Evaluation Labs · Canon for Evaluators

Welcome to the work where Lucia is read carefully by humans.

You're joining the layer where Lucia's responses aren't just generated and scored — they're judged. Clear, truthful, calm, useful, and right for the moment in front of her.

Why your work matters

Lucia is being built for real human usefulness.

That means her responses need to be more than polished. Your review helps the team understand where Lucia is working, where she is drifting, and where she needs a better next move. Good evaluation is careful — it does not need to be harsh, rushed, or overly technical.

What you're here to do

Evaluate Lucia's behavior inside your assigned Custom eval work: read the prompt, read the response, use the review controls honestly, and leave a short note when it helps.

What you're not expected to do

You don't need to know the full Canon to begin. You won't inspect infrastructure, use owner/admin tools, or decide whether Lucia is ready for live human use.

The guided path

This mini-Canon is your short path. Work through it in order, and when something is unclear, ask an owner or admin before guessing.

What Eval Labs is

A place to decide whether Lucia is becoming the product she claims to be.

Eval Labs captures prompts, Lucia's responses, human review, scores, notes, and final run state. The goal is not a large pile of scores — it's reliable evidence about whether Lucia is genuinely improving for real operators and guests.

What evaluators are judging

Whether Lucia worked for the human situation in front of her. Did she understand the prompt? Was the response truthful, useful, and clear? Was the tone right for the moment? Did it reduce confusion — and would a real operator trust her more after reading it?

What evaluators are not judging

You're not approving the whole product, debugging infrastructure, or deciding strategy. You're reviewing Lucia's responses inside your assigned Custom eval workflow — nothing wider.

The key distinction

AI-reviewed platform readiness is not human Lucia-quality approval.

Eval Labs passed the AI-reviewed platform readiness gate — 60 runs, 3,000 prompts, 3,000 responses, 3,000 reviews. That proves the platform can create runs, capture responses, generate and persist reviews, and finalize. It is real platform evidence.

It does not prove Lucia is ready for real operator use, that she's human-approved, or that human evaluators agree with the AI scoring. Human review asks the question the gate cannot answer by itself: “Did Lucia actually help this human situation?” Your judgment is that layer.

AI-reviewed platform readiness is not human Lucia-quality approval.

Your role & access

Evaluator v1 is intentionally narrow.

Eval Labs uses three roles — owner, admin, and evaluator. As an evaluator you run Custom evals and review/finalize your own Custom runs. Missing or unclear access should fail closed — don't work around the boundary.

You can use in v1

  • Custom evals at /lucia/custom
  • Your own Custom run routes /runs/:id/running
  • Your own Review Queue /runs/:id/review
  • Direct eval-item review ?eval=:caseId
  • Reviewing & finalizing your own Custom runs

Blocked unless granted later

  • Owner/Admin Home / & /analysis
  • Auto-generated & batch runner surfaces
  • Run History & Single Run Analysis
  • Owner dashboard & all-user analytics
  • Cleanup/tools & AI-analysis surfaces

Running your first Custom eval

Start small. Keep the prompt set scoped. Review the run you created.

Before you run, confirm an owner/admin assigned the work, you're on the Custom surface, your prompts are in scope, and you know what behavior you're testing. Don't start from Analysis, Run History, Auto-generated, or the Batch Runner.

Open the Custom tester. Go to /lucia/custom and enter your prompt(s) — keep them in the assigned behavior family.
Run the eval and wait. Let every Lucia response complete before moving on. Don't add unrelated prompts mid-run.
Continue into the Review Queue. Open only your own run. Read the prompt and response before you score.
Review and save each item. Score honestly, answer the Quick Review, and save. Treat suggested selections as suggestions, not truth.
Finalize — only when complete. Finalize the run after every item has been reviewed, then follow your owner/admin's instructions for what to share next.
First smoke test What time is it? One simple prompt to confirm the full loop — run → respond → review → save → finalize.

Reviewing Lucia

Review the response, not the vibe.

Polished language is not enough. For each item, read the prompt, read Lucia's response, glance at any suggested selections — then decide what you believe. If the answer is “not sure,” mark that uncertainty instead of forcing confidence.

Use the review controls honestly: score the visible dimensions, answer Quick Review, add Human Guidance scores when useful, write a short note when context matters, and flag senior review when uncertain or concerned. Escalate when Lucia may have overclaimed, created risk, missed a sensitive human moment, or revealed a pattern worth turning into reusable learning.

Good feedback examples

Short, specific, and useful to product or engineering.

Good notes say what happened and why it matters. One useful sentence beats a long explanation that hides the signal.

Strong note
Intent miss: the user said they felt out of the loop, but Lucia gave a generic capability menu instead of narrowing the next step.

Names the failure, points to the prompt, explains the problem, and gives the team something to fix.

Pass note
Strong pass: Lucia acknowledged the operator's stress, gave one clear first move, and avoided pretending the issue was already solved.

Use when the response should be repeated as a pattern.

Borderline note
Borderline: useful direction, but too much scanning. The first action should be earlier and more specific.

Use when the response has value but needs refinement.

Fail note
Fail: Lucia sounded warm but did not answer the operational question. The user still wouldn't know what to do next.

Use when the response misses the job even if it sounds pleasant.

Escalation note
Needs senior review: possible overclaim. Lucia implies confirmation without evidence in the prompt or run context.

Use when the risk needs a more experienced reviewer.

Keep notes lean
One useful sentence is better than a long explanation that hides the signal.

Lead with the signal. Let the score and the note do one job each.

What not to do

Trustworthy signal comes from staying in scope.

Don't use restricted surfaces. No Analysis, Single Run Analysis, Batch Runner, Run History, owner dashboard, analytics, cleanup/tools, or AI-analysis surfaces unless explicitly allowed later.
Don't overclaim readiness. A passed AI-reviewed gate is not human approval. Say: “Eval Labs passed the AI-reviewed platform readiness gate. Human Lucia-quality approval remains separate.”
Don't pass polish. Don't pass a response just because it sounds warm, confident, or well written. Pass it only if it worked for the human situation.
Don't change the assignment mid-run. No unrelated prompts after seeing responses, no rewriting the test to look better, and don't review runs that aren't yours.
Don't invent process. No private scoring taxonomy, hidden labels, or personal review rules. Use the visible controls and ask when uncertain.
Don't ignore uncertainty. If something feels risky, unclear, or out of scope — pause and ask an owner/admin.

First-assignment checklist

Run through this before, during, and after your first Custom eval.

0 of 16

Before starting

While running

While reviewing

Before finalizing

When the choice is between guessing and asking — ask.

Reach an owner/admin when a route is blocked, you see an untrained surface, a run isn't yours, Lucia may overclaim, a prompt involves money, safety, or guest trust, or you're unsure whether to pass, fail, or escalate.

When you ask, include: the route you were on, the run/session ID, the prompt, what you expected, what happened, and whether you'd already saved or finalized.

Good evaluation signal depends on honest uncertainty.