Key concepts
- Golden outputs are the “this is correct” snapshots for a small but representative set of inputs.
- Fixtures are the saved inputs and context your workflow expects.
- Regression checks are automated rules that compare new runs to your goldens and fail the build when something breaks.
What to put in your fixture pack
Pick 20-100 real, anonymized cases that reflect the messiness of production. Include clear edge cases: empty fields, strange punctuation, non‑English text, short/long inputs, and outdated facts. Label what “good” means for each case. Keep it human-readable.
Regression checks that catch breakage
Use several simple checks instead of one fancy one.
- Exact or near‑exact match for structured outputs. Compare JSON with order-insensitive keys.
- Schema & type checks to validate shape: required fields, enums, arrays, length ranges.
- Policy checks for brand voice, claims, banned terms, and compliance rules.
- Length & count limits for headlines, meta descriptions, bullet counts.
- Semantic similarity for free text. Use embeddings or a rubric-based judge to compare against the golden.
- Cost/latency drift to prevent “it works but it’s 3× slower or pricier.”
Make “Execute Workflow” your safety switch
Wire your test harness to your Execute Workflow action so anyone can run the whole suite before merging a change.
In the UI
- Choose the workflow.
- Select Test Suite: /tests/fixtures.
- Toggle Record vs. Verify. Verify should be default.
- Click Execute Workflow and review the pass/fail report.
CLI-style example (adapt to your stack)
# Run the workflow against fixtures
execute-workflow run --id ad-copy --fixtures tests/fixtures/ad_copy --out runs/2025-09-18
# Compare new outputs to goldens with multiple checks
execute-workflow check \
--golden tests/golden/ad_copy \
--actual runs/2025-09-18 \
--checks schema,length,policy,semantic \
--semantic-threshold 0.86
Examples for marketing & ops
Ad copy generator
- Fixtures include product blurbs with constraints.
- Checks: headline ≤ 30 chars, no banned claims, CTA present, tone tag = “confident,” semantic ≥ 0.86 vs. golden.
SEO brief writer
- Checks: sections exist [H1, H2s, outline], target keywords included, reading grade ≤ 9, links count within range.
Email subject lines
- Checks: 5 variants, each ≤ 45 chars, no spammy words list, diversity score ≥ 0.6 across variants.
PIM transform / import mappings
- Checks: normalized brand and model names match mapping table, SKU format regex passes, JSON schema valid, 0 nulls in required fields.
Salesforce or CMS sync steps
- Checks: payload schema, required IDs present, no duplicate external keys, dry‑run diff shows only expected fields.
Goldens: when to update vs. when to fix
Update the golden only when the new output is objectively better by your rubric. Do not update goldens to paper over regressions. Require a brief note on why the golden changed.
Handle LLM variability without hand‑wringing
Lower temperature and set deterministic sampling if your platform supports it. Favor structural and policy checks where possible. Use semantic checks with a clear threshold and a short list of allowed deviations.
Minimal starter kit
/tests
/fixtures
/ad_copy
001.json # inputs
001.meta.json # rules/constraints for this case
/golden
/ad_copy
001.json # expected output snapshot
/checks
schema.json # JSON Schema for outputs
policy.yml # banned terms, brand voice rules
thresholds.yml # similarity, length, latency
README.md # how to run "Execute Workflow"
Example policy.yml
banned_terms:
- "100% guaranteed"
- "cure"
required_elements:
- "CTA"
tone: ["confident", "helpful"]
Example thresholds.yml
semantic_similarity: 0.86
headline_char_max: 30
latency_ms_p95: 1200
cost_per_run_usd_max: 0.03
What trips teams up
- Overfitting prompts to pass tests that don’t reflect real work.
- Only using similarity scores and ignoring structure, policy, and cost.
- Letting anyone update goldens without review notes.
- Skipping fixtures for “rare” cases that are not rare at all in production.
If you remember one thing
Make it easy to run all checks with a single Execute Workflow. If pushing changes without that step is possible, break that path.