You're joining the layer where Lucia's responses aren't just generated and scored — they're judged. Clear, truthful, calm, useful, and right for the moment in front of her.
Why your work matters
That means her responses need to be more than polished. Your review helps the team understand where Lucia is working, where she is drifting, and where she needs a better next move. Good evaluation is careful — it does not need to be harsh, rushed, or overly technical.
Evaluate Lucia's behavior inside your assigned Custom eval work: read the prompt, read the response, use the review controls honestly, and leave a short note when it helps.
You don't need to know the full Canon to begin. You won't inspect infrastructure, use owner/admin tools, or decide whether Lucia is ready for live human use.
This mini-Canon is your short path. Work through it in order, and when something is unclear, ask an owner or admin before guessing.
What Eval Labs is
Eval Labs captures prompts, Lucia's responses, human review, scores, notes, and final run state. The goal is not a large pile of scores — it's reliable evidence about whether Lucia is genuinely improving for real operators and guests.
Whether Lucia worked for the human situation in front of her. Did she understand the prompt? Was the response truthful, useful, and clear? Was the tone right for the moment? Did it reduce confusion — and would a real operator trust her more after reading it?
You're not approving the whole product, debugging infrastructure, or deciding strategy. You're reviewing Lucia's responses inside your assigned Custom eval workflow — nothing wider.
The key distinction
Eval Labs passed the AI-reviewed platform readiness gate — 60 runs, 3,000 prompts, 3,000 responses, 3,000 reviews. That proves the platform can create runs, capture responses, generate and persist reviews, and finalize. It is real platform evidence.
It does not prove Lucia is ready for real operator use, that she's human-approved, or that human evaluators agree with the AI scoring. Human review asks the question the gate cannot answer by itself: “Did Lucia actually help this human situation?” Your judgment is that layer.
Your role & access
Eval Labs uses three roles — owner, admin, and evaluator. As an evaluator you run Custom evals and review/finalize your own Custom runs. Missing or unclear access should fail closed — don't work around the boundary.
/lucia/custom/runs/:id/running/runs/:id/review?eval=:caseId/ & /analysisRunning your first Custom eval
Before you run, confirm an owner/admin assigned the work, you're on the Custom surface, your prompts are in scope, and you know what behavior you're testing. Don't start from Analysis, Run History, Auto-generated, or the Batch Runner.
/lucia/custom and enter your prompt(s) — keep them in the assigned behavior family.What time is it?
One simple prompt to confirm the full loop — run → respond → review → save → finalize.
Reviewing Lucia
Polished language is not enough. For each item, read the prompt, read Lucia's response, glance at any suggested selections — then decide what you believe. If the answer is “not sure,” mark that uncertainty instead of forcing confidence.
Use the review controls honestly: score the visible dimensions, answer Quick Review, add Human Guidance scores when useful, write a short note when context matters, and flag senior review when uncertain or concerned. Escalate when Lucia may have overclaimed, created risk, missed a sensitive human moment, or revealed a pattern worth turning into reusable learning.
Good feedback examples
Good notes say what happened and why it matters. One useful sentence beats a long explanation that hides the signal.
Intent miss: the user said they felt out of the loop, but Lucia gave a generic capability menu instead of narrowing the next step.
Names the failure, points to the prompt, explains the problem, and gives the team something to fix.
Strong pass: Lucia acknowledged the operator's stress, gave one clear first move, and avoided pretending the issue was already solved.
Use when the response should be repeated as a pattern.
Borderline: useful direction, but too much scanning. The first action should be earlier and more specific.
Use when the response has value but needs refinement.
Fail: Lucia sounded warm but did not answer the operational question. The user still wouldn't know what to do next.
Use when the response misses the job even if it sounds pleasant.
Needs senior review: possible overclaim. Lucia implies confirmation without evidence in the prompt or run context.
Use when the risk needs a more experienced reviewer.
One useful sentence is better than a long explanation that hides the signal.
Lead with the signal. Let the score and the note do one job each.
What not to do
First-assignment checklist
Reach an owner/admin when a route is blocked, you see an untrained surface, a run isn't yours, Lucia may overclaim, a prompt involves money, safety, or guest trust, or you're unsure whether to pass, fail, or escalate.
When you ask, include: the route you were on, the run/session ID, the prompt, what you expected, what happened, and whether you'd already saved or finalized.
Good evaluation signal depends on honest uncertainty.