Demo

Prompt optimization before retraining

Walk through a small eval loop: define the baseline, score it against a rubric, try candidate rewrites, and keep the prompt that improves behavior. The model stays fixed throughout.

Eval case context

Account facts: analytics was enabled on May 6, 2026. It adds $49 per seat. The team has 4 seats. No duplicate invoice exists. Customer: My invoice doubled after we added the analytics add-on. What happened?

Baseline prompt

You are a helpful support assistant. Answer the customer's question clearly.

Selected candidate prompt

You are a helpful support assistant. Answer the customer's question clearly.

Rubric

Score the behavior before changing the model.

selected score

48

Grounding

Uses the supplied account facts instead of guessing.

2/535%

Actionability

Gives the user a concrete next step or decision.

2/530%

Tone

Keeps the response concise, direct, and calm.

4/520%

Risk control

Avoids unsupported promises and escalates when needed.

2/515%

Candidate prompts

The optimizer keeps the measured variant, not the fanciest one.

winner: Rubric-optimized

Scored output

Baseline / Billing question

Your invoice may have increased because the analytics add-on changed your plan. You can check billing settings or contact support for details.

Target behavior: Explain the add-on charge, show the seat math, and offer to remove or downgrade it.

Winner selection

Rubric-optimized wins this eval with 100 points.

The selected prompt scores 48. The optimizer would keep Rubric-optimized because it better matches the rubric on this eval case.

Winning output

Analytics was enabled on May 6 and bills at $49 per seat. With 4 seats, that adds $196 to the invoice. I can help remove the add-on or downgrade it before the next billing cycle.

Lesson

Retraining is a later lever, not the first lever.

If the eval shows that a prompt rewrite reliably fixes the failure, ship the prompt. Save retraining for gaps the prompt cannot express: missing skills, missing domain behavior, or failures that remain after retrieval, rubric design, and optimizer search are exhausted.