Ship 2 MR 5b. Extends discovery-eval with a second mode that grades
MistralLLMEnricher's category output against labelled ground truth.
Accuracy + per-label confusion matrix so mix-ups between similar
categories (mittelaltermarkt vs ritterfest, weihnachtsmarkt vs
kirchweih) are visible at a glance.
Usage:
-mode similarity — existing MR 5 path, unchanged.
-mode category — new: scrapes quellen URLs, asks LLM for
{category, opening_hours, description},
scores category only.
Structure
- main.go: split into runSimilarityMode + runCategoryMode. Both
share ai.Client construction and the ctx timeout (bumped to 15min
for category mode since scraping adds I/O). Mode dispatched on
-mode flag; unknown modes exit 2.
- category.go: fixture / cache / run / metrics / report — parallel
to the similarity files, not shared because the data shapes differ
enough that generics would add more noise than they save. Cache
key is sha256(markt_name_lower|stadt_lower|year|model); separate
from SimilarityPairKey since that one takes two rows.
- fixtures/category.json: 10 hand-labelled DACH-market rows
exercising the categories we expect the LLM to produce —
mittelaltermarkt, weihnachtsmarkt, ritterfest, ritterturnier,
handwerkermarkt, schlossfest, kirchweih. Each row lists a quelle
URL the enricher will scrape live (first run only; cache takes
over after).
- normalizeCategory: strips casing + German umlauts + the -märkte
plural drift so a correctly-categorised row doesn't get scored
wrong for cosmetic LLM output variation.
Metrics: Accuracy + per-label confusion matrix. Confusion format is
`want → predictions` with `!` markers on off-diagonal predictions —
readable in a terminal, machine-parseable in the JSON report.
Mismatches are listed at the end with want/got pairs so operators
can spot prompt failures and patch either the prompt or the fixture.
Threshold gate reads accuracy (not F1) — category is multi-class,
precision/recall don't have a single-label meaning.
Tests: normalisation edge cases (casing, umlaut, plural, trimming),
scoring drift tolerance, metrics counts + confusion matrix shape,
errors excluded from confusion, cache round-trip + model scoping,
missing/corrupt file handling.
.gitignore adds .cat-eval-cache.json and cat-eval-report.json.
Follow-ups (MR 5c / later): opening_hours and description scoring.
Both need fuzzier matching (regex structure vs LLM judge) which is
its own design problem.
3.3 KiB
discovery-eval
CLI that grades discovery's AI-backed components against labelled fixtures. Two modes:
-mode similarity(default) —MistralSimilarityClassifieron pair- labelled fixtures. Reports precision / recall / F1 / accuracy + a confidence calibration table.-mode category—MistralLLMEnricher'scategoryoutput on row- labelled fixtures. Reports accuracy + a per-label confusion matrix.
File-based cache keeps reruns free. Each mode has its own cache key shape, so switching modes doesn't churn entries.
Run it
Similarity (default)
export AI_API_KEY=...
export AI_MODEL_COMPLEX=mistral-large-latest
go run ./backend/cmd/discovery-eval \
-mode similarity \
-fixture backend/cmd/discovery-eval/fixtures/similarity.json \
-cache .eval-cache.json \
-threshold 0.8 \
-report eval-report.json
Exit code is 1 when F1 < threshold (0 = gating disabled).
Category
export AI_API_KEY=...
export AI_MODEL_COMPLEX=mistral-large-latest
go run ./backend/cmd/discovery-eval \
-mode category \
-fixture backend/cmd/discovery-eval/fixtures/category.json \
-cache .cat-eval-cache.json \
-threshold 0.7 \
-report cat-eval-report.json
Category mode scrapes each row's quellen URLs live (first run only; cache
covers subsequent runs) and asks the LLM enricher to produce a category.
Normalised comparison: casing + German umlauts + the -märkte/-markt plural
drift are all treated as equal. Exit code is 1 when accuracy < threshold.
Extending the fixture
fixtures/similarity.json is hand-curated. Add pairs that exercise patterns
the classifier has gotten wrong, or edge cases the prompt documentation
mentions but we haven't tested. Prefer real crawler output (anonymised
if needed) over invented pairs — the model's failure modes on real data
are the ones that matter.
Keep pairs deterministic (don't include confidence bands or other
stochastic signals) and ensure each pair's note field explains the
edge case. When you change a pair's labels, purge the cache so the
classifier re-answers.
Cache
-cache .eval-cache.json stores verdicts by
sha256(name_normalized_a | stadt_a | year_a | name_normalized_b | stadt_b | year_b) + "|" + model.
Delete the file to force a full re-run. Changing the model string invalidates every entry automatically.
Atomic writes (temp file + rename) — a crashed run won't corrupt the cache.
Interpreting the output
- Precision is about false merges: how often does the classifier say "same" when it isn't? Low precision means auto-merge will fuse distinct markets — hard to recover from.
- Recall is about missed merges: how often does it say "different" when it should have said "same"? Low recall means operators review duplicate rows manually — annoying but safe.
- F1 balances both. The default gate threshold isn't set here — pick a number after an initial baseline run.
- Calibration tells you whether "90% confident" verdicts are actually correct 90% of the time. An under-calibrated model is worse than a less-accurate but well-calibrated one for downstream auto-decisions.
Out of scope for now
- Enrichment eval (category/opening_hours/description). Scoring fuzzy text outputs needs its own design; tracked for a future MR.
- CI wiring. Once we have a baseline F1 the harness can run in GitLab CI with a fixed threshold.