Files
vikingowl cf5408ab66 feat(discovery): eval harness for the AI similarity classifier
Ship 2 MR 5. Adds a CLI that measures MistralSimilarityClassifier
against a labelled fixture: precision, recall, F1, accuracy, plus a
confidence calibration table so we can tell whether "90% confident"
verdicts are actually right 90% of the time.

Usage: go run ./backend/cmd/discovery-eval -fixture ... -cache ...
-threshold 0.8 -report eval-report.json.

Structure
- main.go: arg parsing + wiring (ai.Client, classifier, cache,
  metrics). The work happens in realMain() which returns an exit code
  — keeps defers running on error paths.
- fixture.go: parses labelled pairs JSON. Fixture authors only need to
  fill in name/stadt/year; name_normalized falls back to name when
  omitted.
- cache.go: file-backed map keyed by SimilarityPairKey + model string.
  Symmetric (a,b) == (b,a). Atomic writes (temp file + rename) so a
  crashed run cannot corrupt the cache. Corrupt-file load returns an
  empty usable cache and reports the parse error.
- run.go: executes each pair through the classifier, populating the
  cache. Individual classify errors are downgraded to "not correct"
  and logged — the run always finishes so the operator sees whatever
  data is available.
- metrics.go: confusion matrix, P/R/F1/accuracy, per-confidence-
  bucket calibration ([0-0.5), [0.5-0.75), [0.75-0.9), [0.9-1.0]).
  Prints human summary + surfaces highest-confidence mismatches
  first (most actionable for prompt iteration). Optional JSON report.
- Threshold gate: -threshold N exits non-zero when F1<N. Default 0
  (gating disabled until we have a baseline F1).

Fixture: seeds 15 hand-crafted DACH-market pairs covering the edge
cases we actually care about — umlaut drift (Straßburg/Strassburg),
year difference on a recurring series, word-reordering, distinct
events at the same venue, historical proper names (Striezelmarkt),
same city with multiple distinct Christmas markets. Operator extends
over time; each pair carries a `note` explaining the case it locks.

.gitignore adds .eval-cache.json and eval-report.json — neither
should land in the repo.

Tests cover metrics edge cases (all correct, imbalanced,
no-positive-predictions-no-NaN, calibration bucket assignment,
cache accounting, empty input) and cache behaviour (round-trip,
symmetric lookup, model-scoped invalidation, missing/corrupt file
handling, atomic-write leaves no temp files).

Out of scope for MR 5: enrichment field accuracy (fuzzy text
scoring is its own problem — tracked for a follow-up), CI wiring
(needs a baseline F1 first).
2026-04-24 12:26:18 +02:00

46 lines
1.1 KiB
Go

package main
import (
"encoding/json"
"fmt"
"os"
)
// Fixture is the parsed shape of similarity.json — a hand-labelled set of
// row pairs where `same` is the ground truth.
type Fixture struct {
Pairs []LabelledPair `json:"pairs"`
}
// LabelledPair is one ground-truth example. Note is free-text for the
// human maintainer — why this pair was added, what edge case it exercises.
type LabelledPair struct {
A PairRow `json:"a"`
B PairRow `json:"b"`
Same bool `json:"same"`
Note string `json:"note,omitempty"`
}
// PairRow mirrors enrich.SimilarityRow but stays JSON-serialisable.
type PairRow struct {
Name string `json:"name"`
Stadt string `json:"stadt"`
Year int `json:"year"`
NameNormalized string `json:"name_normalized"`
}
func loadFixture(path string) (*Fixture, error) {
data, err := os.ReadFile(path)
if err != nil {
return nil, fmt.Errorf("read fixture: %w", err)
}
var f Fixture
if err := json.Unmarshal(data, &f); err != nil {
return nil, fmt.Errorf("parse fixture: %w", err)
}
if len(f.Pairs) == 0 {
return nil, fmt.Errorf("fixture has no pairs")
}
return &f, nil
}