feat(discovery): category eval mode for the LLM enricher

Ship 2 MR 5b. Extends discovery-eval with a second mode that grades
MistralLLMEnricher's category output against labelled ground truth.
Accuracy + per-label confusion matrix so mix-ups between similar
categories (mittelaltermarkt vs ritterfest, weihnachtsmarkt vs
kirchweih) are visible at a glance.

Usage:
  -mode similarity  — existing MR 5 path, unchanged.
  -mode category    — new: scrapes quellen URLs, asks LLM for
                       {category, opening_hours, description},
                       scores category only.

Structure
- main.go: split into runSimilarityMode + runCategoryMode. Both
  share ai.Client construction and the ctx timeout (bumped to 15min
  for category mode since scraping adds I/O). Mode dispatched on
  -mode flag; unknown modes exit 2.
- category.go: fixture / cache / run / metrics / report — parallel
  to the similarity files, not shared because the data shapes differ
  enough that generics would add more noise than they save. Cache
  key is sha256(markt_name_lower|stadt_lower|year|model); separate
  from SimilarityPairKey since that one takes two rows.
- fixtures/category.json: 10 hand-labelled DACH-market rows
  exercising the categories we expect the LLM to produce —
  mittelaltermarkt, weihnachtsmarkt, ritterfest, ritterturnier,
  handwerkermarkt, schlossfest, kirchweih. Each row lists a quelle
  URL the enricher will scrape live (first run only; cache takes
  over after).
- normalizeCategory: strips casing + German umlauts + the -märkte
  plural drift so a correctly-categorised row doesn't get scored
  wrong for cosmetic LLM output variation.

Metrics: Accuracy + per-label confusion matrix. Confusion format is
`want → predictions` with `!` markers on off-diagonal predictions —
readable in a terminal, machine-parseable in the JSON report.
Mismatches are listed at the end with want/got pairs so operators
can spot prompt failures and patch either the prompt or the fixture.

Threshold gate reads accuracy (not F1) — category is multi-class,
precision/recall don't have a single-label meaning.

Tests: normalisation edge cases (casing, umlaut, plural, trimming),
scoring drift tolerance, metrics counts + confusion matrix shape,
errors excluded from confusion, cache round-trip + model scoping,
missing/corrupt file handling.

.gitignore adds .cat-eval-cache.json and cat-eval-report.json.

Follow-ups (MR 5c / later): opening_hours and description scoring.
Both need fuzzier matching (regex structure vs LLM judge) which is
its own design problem.
This commit is contained in:
2026-04-24 12:44:26 +02:00
parent 169fa1b3c4
commit 88d0ae9d96
6 changed files with 778 additions and 32 deletions

2
.gitignore vendored
View File

@@ -10,7 +10,9 @@ vendor/
# discovery-eval local caches + generated reports
.eval-cache.json
.cat-eval-cache.json
eval-report.json
cat-eval-report.json
# ── Web ──────────────────────────────────────
/web/node_modules/

View File

@@ -1,24 +1,53 @@
# discovery-eval
CLI that measures the `MistralSimilarityClassifier` against a labelled
fixture of same-/different-market pairs. Reports precision, recall, F1,
accuracy, and confidence calibration. File-based cache keeps reruns free.
CLI that grades discovery's AI-backed components against labelled fixtures.
Two modes:
- `-mode similarity` (default) — `MistralSimilarityClassifier` on pair-
labelled fixtures. Reports precision / recall / F1 / accuracy + a
confidence calibration table.
- `-mode category``MistralLLMEnricher`'s `category` output on row-
labelled fixtures. Reports accuracy + a per-label confusion matrix.
File-based cache keeps reruns free. Each mode has its own cache key shape,
so switching modes doesn't churn entries.
## Run it
### Similarity (default)
```
export AI_API_KEY=...
export AI_MODEL_COMPLEX=mistral-large-latest
go run ./backend/cmd/discovery-eval \
-mode similarity \
-fixture backend/cmd/discovery-eval/fixtures/similarity.json \
-cache .eval-cache.json \
-threshold 0.8 \
-report eval-report.json
```
Exit code is 1 when `F1 < threshold` (0 = gating disabled). That makes it
usable as a CI regression gate once a baseline F1 is known.
Exit code is 1 when `F1 < threshold` (0 = gating disabled).
### Category
```
export AI_API_KEY=...
export AI_MODEL_COMPLEX=mistral-large-latest
go run ./backend/cmd/discovery-eval \
-mode category \
-fixture backend/cmd/discovery-eval/fixtures/category.json \
-cache .cat-eval-cache.json \
-threshold 0.7 \
-report cat-eval-report.json
```
Category mode scrapes each row's `quellen` URLs live (first run only; cache
covers subsequent runs) and asks the LLM enricher to produce a category.
Normalised comparison: casing + German umlauts + the -märkte/-markt plural
drift are all treated as equal. Exit code is 1 when `accuracy < threshold`.
## Extending the fixture

View File

@@ -0,0 +1,357 @@
package main
import (
"context"
"crypto/sha256"
"encoding/hex"
"encoding/json"
"errors"
"fmt"
"io"
"io/fs"
"log/slog"
"os"
"path/filepath"
"sort"
"strings"
"marktvogt.de/backend/internal/domain/discovery/enrich"
)
// CategoryFixture is the parsed shape of fixtures/category.json — labelled
// rows where `expected_category` is the ground truth for the MistralLLMEnricher's
// output category field.
type CategoryFixture struct {
Rows []CategoryRow `json:"rows"`
}
// CategoryRow is one ground-truth example. quellen is the list of source URLs
// the enricher will scrape. `expected_category` is the operator's judgement
// of what the correct German label should be — normalised before comparison
// so case/umlaut drift doesn't falsely grade as wrong.
type CategoryRow struct {
MarktName string `json:"markt_name"`
Stadt string `json:"stadt"`
Bundesland string `json:"bundesland,omitempty"`
Land string `json:"land,omitempty"`
Year int `json:"year,omitempty"`
Quellen []string `json:"quellen"`
ExpectedCategory string `json:"expected_category"`
Note string `json:"note,omitempty"`
}
func loadCategoryFixture(path string) (*CategoryFixture, error) {
data, err := os.ReadFile(path)
if err != nil {
return nil, fmt.Errorf("read fixture: %w", err)
}
var f CategoryFixture
if err := json.Unmarshal(data, &f); err != nil {
return nil, fmt.Errorf("parse fixture: %w", err)
}
if len(f.Rows) == 0 {
return nil, fmt.Errorf("fixture has no rows")
}
return &f, nil
}
// CategoryCache is the category mode's sibling to Cache. Keyed on the row's
// content tuple + model so a model bump forces a refresh.
type CategoryCache struct {
Entries map[string]enrich.Enrichment `json:"entries"`
}
func newCategoryCache() *CategoryCache {
return &CategoryCache{Entries: map[string]enrich.Enrichment{}}
}
// categoryCacheKey hashes (markt_name_lower|stadt_lower|year|model). Separate
// from SimilarityPairKey because that function takes two rows and sorts —
// here we key on one row.
func categoryCacheKey(r CategoryRow, model string) string {
raw := fmt.Sprintf("%s|%s|%d|%s",
strings.ToLower(r.MarktName), strings.ToLower(r.Stadt), r.Year, model)
sum := sha256.Sum256([]byte(raw))
return hex.EncodeToString(sum[:])
}
func (c *CategoryCache) Get(r CategoryRow, model string) (enrich.Enrichment, bool) {
v, ok := c.Entries[categoryCacheKey(r, model)]
return v, ok
}
func (c *CategoryCache) Put(r CategoryRow, model string, v enrich.Enrichment) {
c.Entries[categoryCacheKey(r, model)] = v
}
func loadCategoryCache(path string) (*CategoryCache, error) {
data, err := os.ReadFile(path)
if err != nil {
if errors.Is(err, fs.ErrNotExist) {
return newCategoryCache(), nil
}
return nil, fmt.Errorf("read cache: %w", err)
}
c := newCategoryCache()
if err := json.Unmarshal(data, c); err != nil {
return newCategoryCache(), fmt.Errorf("parse cache (starting empty): %w", err)
}
if c.Entries == nil {
c.Entries = map[string]enrich.Enrichment{}
}
return c, nil
}
// saveCategoryCache is the same atomic-write pattern as saveCache. Duplicated
// rather than generic-ed because Go generics on JSON types add more noise
// than they save for two call sites.
func saveCategoryCache(path string, c *CategoryCache) error {
data, err := json.MarshalIndent(c, "", " ")
if err != nil {
return fmt.Errorf("marshal cache: %w", err)
}
dir := filepath.Dir(path)
if dir == "" {
dir = "."
}
tmp, err := os.CreateTemp(dir, ".cat-cache-*.tmp")
if err != nil {
return fmt.Errorf("create tmp: %w", err)
}
tmpPath := tmp.Name()
if _, err := tmp.Write(data); err != nil {
_ = tmp.Close()
_ = os.Remove(tmpPath)
return fmt.Errorf("write tmp: %w", err)
}
if err := tmp.Close(); err != nil {
_ = os.Remove(tmpPath)
return fmt.Errorf("close tmp: %w", err)
}
if err := os.Rename(tmpPath, path); err != nil {
_ = os.Remove(tmpPath)
return fmt.Errorf("rename tmp: %w", err)
}
return nil
}
// normalizeCategory strips casing drift + German umlauts for comparison.
// "Mittelaltermarkt" == "mittelaltermarkt" == "Mittelaltermärkte" (last one
// loses the 'e' pluralisation — too aggressive for identity, but good enough
// for categorical matching when the LLM occasionally emits plurals).
func normalizeCategory(s string) string {
s = strings.TrimSpace(strings.ToLower(s))
replacer := strings.NewReplacer(
"ä", "a", "ö", "o", "ü", "u", "ß", "ss",
)
s = replacer.Replace(s)
// Drop trailing 'e' on plurals (märkte → markte → markt). Only a light
// heuristic — applied only when stripping produces a known stem.
if strings.HasSuffix(s, "markte") {
s = strings.TrimSuffix(s, "e")
}
return s
}
// CategoryResult mirrors Result for the category mode. `Got` is the raw
// category the LLM returned (before normalisation) so mismatches stay
// legible in the report.
type CategoryResult struct {
Row CategoryRow
Got string
Want string
Correct bool
FromCache bool
// Err records the scrape/LLM failure message when the run couldn't
// produce a category at all; scored as not-correct.
Err string
}
// runCategory is the category-mode equivalent of run(). Uses the real
// MistralLLMEnricher — that's the whole point of category eval (scrape +
// LLM against labelled outputs).
func runCategory(
ctx context.Context,
enricher enrich.LLMEnricher,
cache *CategoryCache,
fixture *CategoryFixture,
model string,
) ([]CategoryResult, error) {
results := make([]CategoryResult, 0, len(fixture.Rows))
for i, r := range fixture.Rows {
if err := ctx.Err(); err != nil {
return results, err
}
if v, ok := cache.Get(r, model); ok {
results = append(results, scoreCategoryResult(r, v, "", true))
continue
}
req := enrich.LLMRequest{
MarktName: r.MarktName,
Stadt: r.Stadt,
Land: r.Land,
Bundesland: r.Bundesland,
Quellen: r.Quellen,
}
got, err := enricher.EnrichMissing(ctx, req)
if err != nil {
slog.Warn("enrich failed; scoring as incorrect",
"row_index", i, "markt", r.MarktName, "error", err)
results = append(results, CategoryResult{
Row: r,
Want: r.ExpectedCategory,
Err: err.Error(),
})
continue
}
cache.Put(r, model, got)
results = append(results, scoreCategoryResult(r, got, "", false))
}
return results, nil
}
func scoreCategoryResult(r CategoryRow, got enrich.Enrichment, errMsg string, fromCache bool) CategoryResult {
gotCat := strings.TrimSpace(got.Category)
return CategoryResult{
Row: r,
Got: gotCat,
Want: r.ExpectedCategory,
Correct: normalizeCategory(gotCat) == normalizeCategory(r.ExpectedCategory),
FromCache: fromCache,
Err: errMsg,
}
}
// CategoryMetrics summarises a category-mode run. Unlike similarity (binary,
// P/R/F1 matter), categorical eval cares about accuracy + per-category
// confusion — which labels get mixed up.
type CategoryMetrics struct {
Total int `json:"total"`
Correct int `json:"correct"`
Incorrect int `json:"incorrect"`
Errors int `json:"errors"` // rows that failed to produce any category
CacheHits int `json:"cache_hits"`
LLMCalls int `json:"llm_calls"`
// Confusion: label → {predicted label → count}. Excludes errored rows.
Confusion map[string]map[string]int `json:"confusion"`
Accuracy float64 `json:"accuracy"`
}
func computeCategoryMetrics(results []CategoryResult) CategoryMetrics {
m := CategoryMetrics{
Total: len(results),
Confusion: map[string]map[string]int{},
}
if m.Total == 0 {
return m
}
for _, r := range results {
if r.FromCache {
m.CacheHits++
} else {
m.LLMCalls++
}
if r.Err != "" {
m.Errors++
continue
}
want := normalizeCategory(r.Want)
got := normalizeCategory(r.Got)
if r.Correct {
m.Correct++
} else {
m.Incorrect++
}
if _, ok := m.Confusion[want]; !ok {
m.Confusion[want] = map[string]int{}
}
m.Confusion[want][got]++
}
m.Accuracy = float64(m.Correct) / float64(m.Total)
return m
}
// printCategorySummary writes a human-readable report. Mirrors printSummary's
// shape so operators switching modes see the same vocabulary.
func printCategorySummary(w io.Writer, results []CategoryResult, m CategoryMetrics, model string) {
wf(w, "\n=== discovery-eval (category) ===\n")
wf(w, "model: %s\n", model)
wf(w, "rows: %d\n", m.Total)
wf(w, "cache: %d hits, %d llm calls\n", m.CacheHits, m.LLMCalls)
wf(w, "\n")
wf(w, "correct: %d\n", m.Correct)
wf(w, "incorrect: %d\n", m.Incorrect)
wf(w, "errors: %d\n", m.Errors)
wf(w, "accuracy: %.3f\n", m.Accuracy)
// Confusion matrix — one row per expected label showing how predictions
// distributed. Labels sorted alphabetically for stable output.
if len(m.Confusion) > 0 {
labels := make([]string, 0, len(m.Confusion))
for k := range m.Confusion {
labels = append(labels, k)
}
sort.Strings(labels)
wf(w, "\nconfusion (want → predictions):\n")
for _, want := range labels {
preds := m.Confusion[want]
predKeys := make([]string, 0, len(preds))
for k := range preds {
predKeys = append(predKeys, k)
}
sort.Strings(predKeys)
wf(w, " %-24s", want)
first := true
for _, p := range predKeys {
if !first {
wf(w, ", ")
}
marker := ""
if p != want {
marker = "!"
}
wf(w, "%s%s×%d", marker, p, preds[p])
first = false
}
wf(w, "\n")
}
}
// Surface individual mistakes so the operator can patch the prompt or fixture.
wrong := make([]CategoryResult, 0)
for _, r := range results {
if !r.Correct || r.Err != "" {
wrong = append(wrong, r)
}
}
if len(wrong) > 0 {
wf(w, "\nmismatches (%d):\n", len(wrong))
for _, r := range wrong {
if r.Err != "" {
wf(w, " ERROR %q (%s): %s\n", r.Row.MarktName, r.Row.Stadt, r.Err)
continue
}
wf(w, " want=%q got=%q %q (%s)\n", r.Want, r.Got, r.Row.MarktName, r.Row.Stadt)
}
}
wf(w, "\n")
}
// CategoryReport is the on-disk shape for -report in category mode.
type CategoryReport struct {
Mode string `json:"mode"`
Model string `json:"model"`
Metrics CategoryMetrics `json:"metrics"`
Results []CategoryResult `json:"results"`
}
func writeCategoryReport(path string, results []CategoryResult, m CategoryMetrics, model string) error {
rep := CategoryReport{Mode: "category", Model: model, Metrics: m, Results: results}
data, err := json.MarshalIndent(rep, "", " ")
if err != nil {
return err
}
return os.WriteFile(path, data, 0o644)
}

View File

@@ -0,0 +1,171 @@
package main
import (
"os"
"path/filepath"
"testing"
"marktvogt.de/backend/internal/domain/discovery/enrich"
)
func TestNormalizeCategory_UmlautAndCase(t *testing.T) {
cases := []struct {
in, want string
}{
{"Mittelaltermarkt", "mittelaltermarkt"},
{"MITTELALTERMARKT", "mittelaltermarkt"},
{"Mittelaltermärkte", "mittelaltermarkt"},
{"Weihnachtsmarkt", "weihnachtsmarkt"},
{"Schönbrunn-Fest", "schonbrunn-fest"},
{"weißwurst", "weisswurst"},
{" Ritterfest ", "ritterfest"},
{"", ""},
}
for _, c := range cases {
t.Run(c.in, func(t *testing.T) {
got := normalizeCategory(c.in)
if got != c.want {
t.Errorf("normalizeCategory(%q) = %q; want %q", c.in, got, c.want)
}
})
}
}
func TestScoreCategoryResult_CaseInsensitive(t *testing.T) {
// Same category but casing/diacritic drift — must be counted as correct.
r := CategoryRow{MarktName: "X", Stadt: "Y", ExpectedCategory: "mittelaltermarkt"}
got := scoreCategoryResult(r, enrich.Enrichment{Category: "Mittelaltermarkt"}, "", false)
if !got.Correct {
t.Errorf("casing drift should score as correct, got %+v", got)
}
}
func TestScoreCategoryResult_UmlautDrift(t *testing.T) {
r := CategoryRow{ExpectedCategory: "mittelaltermarkt"}
got := scoreCategoryResult(r, enrich.Enrichment{Category: "Mittelaltermärkte"}, "", false)
if !got.Correct {
t.Errorf("umlaut + plural drift should normalise to correct, got %+v", got)
}
}
func TestScoreCategoryResult_WrongLabel(t *testing.T) {
r := CategoryRow{ExpectedCategory: "mittelaltermarkt"}
got := scoreCategoryResult(r, enrich.Enrichment{Category: "weihnachtsmarkt"}, "", false)
if got.Correct {
t.Errorf("distinct labels must not score as correct: %+v", got)
}
}
func TestComputeCategoryMetrics_BasicAccuracy(t *testing.T) {
results := []CategoryResult{
{Row: CategoryRow{ExpectedCategory: "a"}, Got: "a", Want: "a", Correct: true},
{Row: CategoryRow{ExpectedCategory: "a"}, Got: "b", Want: "a", Correct: false},
{Row: CategoryRow{ExpectedCategory: "c"}, Got: "c", Want: "c", Correct: true, FromCache: true},
}
m := computeCategoryMetrics(results)
if m.Total != 3 || m.Correct != 2 || m.Incorrect != 1 || m.Errors != 0 {
t.Errorf("counts wrong: %+v", m)
}
if m.CacheHits != 1 || m.LLMCalls != 2 {
t.Errorf("cache accounting wrong: hits=%d calls=%d", m.CacheHits, m.LLMCalls)
}
if m.Accuracy < 0.666 || m.Accuracy > 0.667 {
t.Errorf("accuracy = %v; want ~0.666", m.Accuracy)
}
// Confusion should have a[a]=1, a[b]=1, c[c]=1
if m.Confusion["a"]["a"] != 1 || m.Confusion["a"]["b"] != 1 || m.Confusion["c"]["c"] != 1 {
t.Errorf("confusion matrix unexpected: %+v", m.Confusion)
}
}
func TestComputeCategoryMetrics_ErrorsExcludedFromConfusion(t *testing.T) {
results := []CategoryResult{
{Row: CategoryRow{ExpectedCategory: "a"}, Want: "a", Err: "network down"},
{Row: CategoryRow{ExpectedCategory: "a"}, Got: "a", Want: "a", Correct: true},
}
m := computeCategoryMetrics(results)
if m.Errors != 1 {
t.Errorf("errors = %d; want 1", m.Errors)
}
// Only the non-errored row should appear in confusion.
total := 0
for _, inner := range m.Confusion {
for _, v := range inner {
total += v
}
}
if total != 1 {
t.Errorf("confusion should exclude errors; total=%d", total)
}
}
func TestCategoryCache_ModelScoped(t *testing.T) {
c := newCategoryCache()
r := CategoryRow{MarktName: "x", Stadt: "y", Year: 2026}
c.Put(r, "m1", enrich.Enrichment{Category: "mittelaltermarkt"})
if _, ok := c.Get(r, "m2"); ok {
t.Error("cache hit under different model; should be a miss")
}
if v, ok := c.Get(r, "m1"); !ok || v.Category != "mittelaltermarkt" {
t.Error("cache miss on exact model match")
}
}
func TestCategoryCache_RoundTrip(t *testing.T) {
dir := t.TempDir()
path := filepath.Join(dir, "cat-cache.json")
c := newCategoryCache()
r := CategoryRow{MarktName: "Markt X", Stadt: "Dresden", Year: 2026}
c.Put(r, "m", enrich.Enrichment{
Category: "mittelaltermarkt",
Description: "ein Markt",
})
if err := saveCategoryCache(path, c); err != nil {
t.Fatal(err)
}
loaded, err := loadCategoryCache(path)
if err != nil {
t.Fatal(err)
}
v, ok := loaded.Get(r, "m")
if !ok || v.Category != "mittelaltermarkt" {
t.Errorf("round-trip lost data: %+v ok=%v", v, ok)
}
}
func TestLoadCategoryCache_MissingAndCorrupt(t *testing.T) {
// Missing file → empty cache, no error.
c, err := loadCategoryCache(filepath.Join(t.TempDir(), "missing.json"))
if err != nil || c == nil || c.Entries == nil {
t.Errorf("missing file should yield empty cache: err=%v", err)
}
// Corrupt file → empty cache + parse error reported.
dir := t.TempDir()
path := filepath.Join(dir, "cache.json")
if err := os.WriteFile(path, []byte("{garbage"), 0o644); err != nil {
t.Fatal(err)
}
c2, err := loadCategoryCache(path)
if err == nil {
t.Error("expected parse error so operator can investigate")
}
if c2 == nil || c2.Entries == nil {
t.Error("corrupt file should still return usable empty cache")
}
}
func TestCategoryCacheKey_Stable(t *testing.T) {
r := CategoryRow{MarktName: "Markt X", Stadt: "Dresden", Year: 2026}
k1 := categoryCacheKey(r, "m")
k2 := categoryCacheKey(r, "m")
if k1 != k2 {
t.Error("cache key must be deterministic")
}
// Different year → different key.
r2 := r
r2.Year = 2027
if categoryCacheKey(r2, "m") == k1 {
t.Error("year change should produce different key")
}
}

View File

@@ -0,0 +1,94 @@
{
"rows": [
{
"markt_name": "Mittelaltermarkt Dresden",
"stadt": "Dresden",
"land": "Deutschland",
"year": 2026,
"quellen": ["https://www.marktkalendarium.de/markt/dresden-mittelaltermarkt"],
"expected_category": "mittelaltermarkt",
"note": "baseline: unambiguous Mittelaltermarkt"
},
{
"markt_name": "Dresdner Striezelmarkt",
"stadt": "Dresden",
"land": "Deutschland",
"year": 2026,
"quellen": ["https://striezelmarkt.dresden.de/"],
"expected_category": "weihnachtsmarkt",
"note": "historical proper name; LLM must recognise Striezelmarkt as Christmas market"
},
{
"markt_name": "Kaiser-Ludwig-Markt Landsberg",
"stadt": "Landsberg am Lech",
"land": "Deutschland",
"year": 2026,
"quellen": ["https://www.landsberg.de/kaiser-ludwig-markt"],
"expected_category": "mittelaltermarkt",
"note": "themed medieval market"
},
{
"markt_name": "Ritterfest Burg Stolpen",
"stadt": "Stolpen",
"land": "Deutschland",
"year": 2026,
"quellen": ["https://www.burg-stolpen.org/ritterfest"],
"expected_category": "ritterfest",
"note": "knight-themed festival, not general market"
},
{
"markt_name": "Weihnachtsmarkt Nürnberg",
"stadt": "Nürnberg",
"land": "Deutschland",
"year": 2026,
"quellen": ["https://www.christkindlesmarkt.de/"],
"expected_category": "weihnachtsmarkt",
"note": "the canonical German christmas market"
},
{
"markt_name": "Handwerkermarkt Rothenburg",
"stadt": "Rothenburg ob der Tauber",
"land": "Deutschland",
"year": 2026,
"quellen": ["https://www.rothenburg.de/handwerkermarkt"],
"expected_category": "handwerkermarkt",
"note": "craft market; not medieval-themed"
},
{
"markt_name": "Schlossfest Schönbrunn",
"stadt": "Wien",
"land": "Oesterreich",
"year": 2026,
"quellen": ["https://www.schoenbrunn.at/schlossfest"],
"expected_category": "schlossfest",
"note": "castle festival, broader than market — LLM should not default to mittelaltermarkt"
},
{
"markt_name": "Ritterturnier Burg Kreuzenstein",
"stadt": "Leobendorf",
"land": "Oesterreich",
"year": 2026,
"quellen": ["https://www.burg-kreuzenstein.com/ritterturnier"],
"expected_category": "ritterturnier",
"note": "jousting event; distinct from market / fest"
},
{
"markt_name": "Kirchweih Fürth",
"stadt": "Fürth",
"land": "Deutschland",
"year": 2026,
"quellen": ["https://www.fuerth.de/kirchweih"],
"expected_category": "kirchweih",
"note": "traditional Bavarian parish fair"
},
{
"markt_name": "Mittelaltermarkt auf der Ronneburg",
"stadt": "Ronneburg",
"land": "Deutschland",
"year": 2026,
"quellen": ["https://www.ronneburg.de/mittelaltermarkt"],
"expected_category": "mittelaltermarkt",
"note": "venue-prefixed mittelaltermarkt"
}
]
}

View File

@@ -1,18 +1,24 @@
// discovery-eval measures the MistralSimilarityClassifier against a labelled
// fixture. Reports precision/recall/F1/accuracy + a confidence calibration
// table, optionally gated on an F1 threshold for CI use.
// discovery-eval measures discovery's AI-backed components against labelled
// fixtures. Two modes:
//
// -mode similarity (default) — grades MistralSimilarityClassifier on
// pair-labelled fixtures. Precision/recall/F1/accuracy
// + confidence calibration.
// -mode category — grades MistralLLMEnricher's `category` output on
// row-labelled fixtures. Accuracy + per-label confusion.
//
// Usage:
//
// AI_API_KEY=... AI_MODEL_COMPLEX=mistral-large-latest \
// discovery-eval \
// -mode similarity \
// -fixture backend/cmd/discovery-eval/fixtures/similarity.json \
// -cache .eval-cache.json \
// -threshold 0.8 \
// -report eval-report.json
//
// The cache file is keyed on (pair_key, model) — rerunning against the same
// model+fixtures is free. Bump the model or edit a fixture to force a refresh.
// Each mode has its own cache key so switching modes doesn't churn entries.
// Bump AI_MODEL_COMPLEX or edit a fixture to force a refresh.
package main
import (
@@ -25,6 +31,12 @@ import (
"marktvogt.de/backend/internal/domain/discovery/enrich"
"marktvogt.de/backend/internal/pkg/ai"
"marktvogt.de/backend/internal/pkg/scrape"
)
const (
modeSimilarity = "similarity"
modeCategory = "category"
)
// realMain returns the desired exit code. Kept separate from main() so
@@ -36,63 +48,144 @@ func realMain() int {
})))
var (
fixturePath = flag.String("fixture", "backend/cmd/discovery-eval/fixtures/similarity.json", "path to labelled fixture JSON")
mode = flag.String("mode", modeSimilarity, "eval mode: similarity | category")
fixturePath = flag.String("fixture", "", "path to labelled fixture JSON (defaults per mode)")
cachePath = flag.String("cache", ".eval-cache.json", "path to local verdict cache (gitignored)")
reportPath = flag.String("report", "", "optional path to write machine-readable JSON report")
threshold = flag.Float64("threshold", 0.0, "fail (exit 1) when F1 is below this value; 0 disables gating")
threshold = flag.Float64("threshold", 0.0, "fail (exit 1) when F1/accuracy is below this value; 0 disables gating")
)
flag.Parse()
fixture, err := loadFixture(*fixturePath)
if err != nil {
slog.Error("load fixture", "path", *fixturePath, "error", err)
return 2
}
slog.Info("loaded fixture", "pairs", len(fixture.Pairs), "path", *fixturePath)
apiKey := os.Getenv("AI_API_KEY")
model := os.Getenv("AI_MODEL_COMPLEX")
if model == "" {
model = "mistral-large-latest"
}
userAgent := os.Getenv("AI_USER_AGENT")
if userAgent == "" {
userAgent = "marktvogt-eval/1.0 (+https://marktvogt.de)"
}
client := ai.New(apiKey, "", model, 1.0)
if !client.Enabled() {
slog.Error("AI client not configured (set AI_API_KEY)")
return 2
}
ctx, cancel := context.WithTimeout(context.Background(), 15*time.Minute)
defer cancel()
switch *mode {
case modeSimilarity:
return runSimilarityMode(ctx, client, model, pathWithDefault(*fixturePath, "backend/cmd/discovery-eval/fixtures/similarity.json"), *cachePath, *reportPath, *threshold)
case modeCategory:
scraper := scrape.New(userAgent)
enricher := enrich.NewMistralLLMEnricher(client, scraper)
return runCategoryMode(ctx, enricher, model, pathWithDefault(*fixturePath, "backend/cmd/discovery-eval/fixtures/category.json"), *cachePath, *reportPath, *threshold)
default:
slog.Error("unknown mode", "mode", *mode, "valid", []string{modeSimilarity, modeCategory})
return 2
}
}
func pathWithDefault(p, dflt string) string {
if p == "" {
return dflt
}
return p
}
// runSimilarityMode is the original MR 5 eval path, lifted out of main() so
// the mode switch stays readable.
func runSimilarityMode(
ctx context.Context,
client *ai.Client,
model, fixturePath, cachePath, reportPath string,
threshold float64,
) int {
fixture, err := loadFixture(fixturePath)
if err != nil {
slog.Error("load fixture", "path", fixturePath, "error", err)
return 2
}
slog.Info("loaded fixture", "mode", modeSimilarity, "pairs", len(fixture.Pairs), "path", fixturePath)
classifier := enrich.NewMistralSimilarityClassifier(client)
cache, err := loadCache(*cachePath)
cache, err := loadCache(cachePath)
if err != nil {
slog.Warn("cache load failed; starting empty", "path", *cachePath, "error", err)
slog.Warn("cache load failed; starting empty", "path", cachePath, "error", err)
cache = newCache()
}
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Minute)
defer cancel()
results, err := run(ctx, classifier, cache, fixture, model)
if err != nil {
slog.Error("eval run failed", "error", err)
return 2
}
if err := saveCache(*cachePath, cache); err != nil {
// Non-fatal — results still computed; next run just pays again.
slog.Warn("cache save failed; metrics still reported", "path", *cachePath, "error", err)
if err := saveCache(cachePath, cache); err != nil {
slog.Warn("cache save failed; metrics still reported", "path", cachePath, "error", err)
}
metrics := computeMetrics(results)
printSummary(os.Stdout, results, metrics, model)
if *reportPath != "" {
if err := writeReport(*reportPath, results, metrics, model); err != nil {
slog.Warn("report write failed", "path", *reportPath, "error", err)
if reportPath != "" {
if err := writeReport(reportPath, results, metrics, model); err != nil {
slog.Warn("report write failed", "path", reportPath, "error", err)
}
}
if *threshold > 0 && metrics.F1 < *threshold {
fmt.Fprintf(os.Stderr, "\nFAIL: F1=%.3f < threshold=%.3f\n", metrics.F1, *threshold)
if threshold > 0 && metrics.F1 < threshold {
fmt.Fprintf(os.Stderr, "\nFAIL: F1=%.3f < threshold=%.3f\n", metrics.F1, threshold)
return 1
}
return 0
}
// runCategoryMode grades MistralLLMEnricher's category field against a
// labelled fixture. Uses its own cache shape (CategoryCache) so the
// similarity and category runs don't collide on disk.
func runCategoryMode(
ctx context.Context,
enricher enrich.LLMEnricher,
model, fixturePath, cachePath, reportPath string,
threshold float64,
) int {
fixture, err := loadCategoryFixture(fixturePath)
if err != nil {
slog.Error("load fixture", "path", fixturePath, "error", err)
return 2
}
slog.Info("loaded fixture", "mode", modeCategory, "rows", len(fixture.Rows), "path", fixturePath)
cache, err := loadCategoryCache(cachePath)
if err != nil {
slog.Warn("cache load failed; starting empty", "path", cachePath, "error", err)
cache = newCategoryCache()
}
results, err := runCategory(ctx, enricher, cache, fixture, model)
if err != nil {
slog.Error("eval run failed", "error", err)
return 2
}
if err := saveCategoryCache(cachePath, cache); err != nil {
slog.Warn("cache save failed; metrics still reported", "path", cachePath, "error", err)
}
metrics := computeCategoryMetrics(results)
printCategorySummary(os.Stdout, results, metrics, model)
if reportPath != "" {
if err := writeCategoryReport(reportPath, results, metrics, model); err != nil {
slog.Warn("report write failed", "path", reportPath, "error", err)
}
}
if threshold > 0 && metrics.Accuracy < threshold {
fmt.Fprintf(os.Stderr, "\nFAIL: accuracy=%.3f < threshold=%.3f\n", metrics.Accuracy, threshold)
return 1
}
return 0