88d0ae9d96247fdc391db4d4af2e070959bc9d86
Ship 2 MR 5b. Extends discovery-eval with a second mode that grades
MistralLLMEnricher's category output against labelled ground truth.
Accuracy + per-label confusion matrix so mix-ups between similar
categories (mittelaltermarkt vs ritterfest, weihnachtsmarkt vs
kirchweih) are visible at a glance.
Usage:
-mode similarity — existing MR 5 path, unchanged.
-mode category — new: scrapes quellen URLs, asks LLM for
{category, opening_hours, description},
scores category only.
Structure
- main.go: split into runSimilarityMode + runCategoryMode. Both
share ai.Client construction and the ctx timeout (bumped to 15min
for category mode since scraping adds I/O). Mode dispatched on
-mode flag; unknown modes exit 2.
- category.go: fixture / cache / run / metrics / report — parallel
to the similarity files, not shared because the data shapes differ
enough that generics would add more noise than they save. Cache
key is sha256(markt_name_lower|stadt_lower|year|model); separate
from SimilarityPairKey since that one takes two rows.
- fixtures/category.json: 10 hand-labelled DACH-market rows
exercising the categories we expect the LLM to produce —
mittelaltermarkt, weihnachtsmarkt, ritterfest, ritterturnier,
handwerkermarkt, schlossfest, kirchweih. Each row lists a quelle
URL the enricher will scrape live (first run only; cache takes
over after).
- normalizeCategory: strips casing + German umlauts + the -märkte
plural drift so a correctly-categorised row doesn't get scored
wrong for cosmetic LLM output variation.
Metrics: Accuracy + per-label confusion matrix. Confusion format is
`want → predictions` with `!` markers on off-diagonal predictions —
readable in a terminal, machine-parseable in the JSON report.
Mismatches are listed at the end with want/got pairs so operators
can spot prompt failures and patch either the prompt or the fixture.
Threshold gate reads accuracy (not F1) — category is multi-class,
precision/recall don't have a single-label meaning.
Tests: normalisation edge cases (casing, umlaut, plural, trimming),
scoring drift tolerance, metrics counts + confusion matrix shape,
errors excluded from confusion, cache round-trip + model scoping,
missing/corrupt file handling.
.gitignore adds .cat-eval-cache.json and cat-eval-report.json.
Follow-ups (MR 5c / later): opening_hours and description scoring.
Both need fuzzier matching (regex structure vs LLM judge) which is
its own design problem.
Description
No description provided
Languages
Go
60.3%
Svelte
20.3%
Dart
11.1%
TypeScript
5%
PLpgSQL
1.1%
Other
2.1%