marktvogt.de

Go to file

vikingowl 88d0ae9d96 feat(discovery): category eval mode for the LLM enricher

Ship 2 MR 5b. Extends discovery-eval with a second mode that grades
MistralLLMEnricher's category output against labelled ground truth.
Accuracy + per-label confusion matrix so mix-ups between similar
categories (mittelaltermarkt vs ritterfest, weihnachtsmarkt vs
kirchweih) are visible at a glance.

Usage:
  -mode similarity  — existing MR 5 path, unchanged.
  -mode category    — new: scrapes quellen URLs, asks LLM for
                       {category, opening_hours, description},
                       scores category only.

Structure
- main.go: split into runSimilarityMode + runCategoryMode. Both
  share ai.Client construction and the ctx timeout (bumped to 15min
  for category mode since scraping adds I/O). Mode dispatched on
  -mode flag; unknown modes exit 2.
- category.go: fixture / cache / run / metrics / report — parallel
  to the similarity files, not shared because the data shapes differ
  enough that generics would add more noise than they save. Cache
  key is sha256(markt_name_lower|stadt_lower|year|model); separate
  from SimilarityPairKey since that one takes two rows.
- fixtures/category.json: 10 hand-labelled DACH-market rows
  exercising the categories we expect the LLM to produce —
  mittelaltermarkt, weihnachtsmarkt, ritterfest, ritterturnier,
  handwerkermarkt, schlossfest, kirchweih. Each row lists a quelle
  URL the enricher will scrape live (first run only; cache takes
  over after).
- normalizeCategory: strips casing + German umlauts + the -märkte
  plural drift so a correctly-categorised row doesn't get scored
  wrong for cosmetic LLM output variation.

Metrics: Accuracy + per-label confusion matrix. Confusion format is
`want → predictions` with `!` markers on off-diagonal predictions —
readable in a terminal, machine-parseable in the JSON report.
Mismatches are listed at the end with want/got pairs so operators
can spot prompt failures and patch either the prompt or the fixture.

Threshold gate reads accuracy (not F1) — category is multi-class,
precision/recall don't have a single-label meaning.

Tests: normalisation edge cases (casing, umlaut, plural, trimming),
scoring drift tolerance, metrics counts + confusion matrix shape,
errors excluded from confusion, cache round-trip + model scoping,
missing/corrupt file handling.

.gitignore adds .cat-eval-cache.json and cat-eval-report.json.

Follow-ups (MR 5c / later): opening_hours and description scoring.
Both need fuzzier matching (regex structure vs LLM judge) which is
its own design problem.

2026-04-24 12:44:26 +02:00