- Add confidence scale (0.95-1.00 / 0.70-0.90 / 0.50-0.70 / 0.00-0.50)
with four annotated few-shot examples to the similarity system prompt
- Add two Ronneburg real-world pairs to similarity.json: descriptive-prefix
swap and low-trigram-overlap rename, both expected same=true
Rename mistral.go → llm_enricher.go and mistral_test.go →
llm_enricher_test.go; update all test function names and stale model
strings (mistral-large-latest → gemini-2.5-flash-lite); drop Ollama
block from .env; mark superseded planning specs; update provider
references in planning docs and CLAUDE.md to Google Gemini.
Replace the Mistral + Ollama AI stack with a single Google Gemini provider
backed by google.golang.org/genai. API key moves from env/Helm to the DB
(AES-256-GCM, key derived from JWT_SECRET via HKDF) so it can be rotated
via the admin UI without a pod restart.
New:
- pkg/crypto/secretbox — AES-256-GCM encrypt/decrypt for secrets at rest
- pkg/ai/gemini — GeminiProvider with grounding, structured output, usage
recording, and hot-reload (Reinitialize swaps client under mutex)
- pkg/ai/usage — UsageRecorder interface + UsageEvent struct
- domain/settings/store — DB-backed settings (model, grounding toggle, key)
- domain/settings/usage — UsageRepo implementing UsageRecorder; ai_usage table
- migrations 000021 (system_settings) + 000022 (ai_usage)
- settings API: GET /ai, POST /ai/key, POST /ai/model, POST /ai/grounding,
GET /ai/usage
- admin UI: 4-card settings page — provider status, model selector, grounding
toggle with quota, usage rollups + recent-calls table
Removed:
- pkg/ai/ollama, mistral_provider, ratelimiter (+ tests)
- Helm AI_API_KEY, AI_PROVIDER, AI_MODEL_COMPLEX, AI_AGENT_DISCOVERY,
AI_RATE_LIMIT_RPS env vars
Call sites set Grounded+CallType: research (true/"research"), enrich Pass B
(true/"enrich_b"), similarity (false/"similarity"). Integration test updated
to use a stub ai.Provider instead of a fake Ollama HTTP server.
Replaces the Mistral-only ai.Client with an ai.Provider interface backed by
Ollama and Mistral implementations. Migrates enrichment + similarity callers
to ai.Provider.Chat. Research endpoint returns 501 until commit 2 reinstates
it on the new orchestrator.
- Extract readJSONFile + writeJSONAtomic in cache.go; category cache
reuses them (saveCategoryCache is one line, loadCategoryCache uses
the standard load-or-empty shape).
- Drop dead errMsg param from scoreCategoryResult (always "").
- Wrap writeCategoryReport errors with context for consistency.
- Wrap runSimilarityMode / runCategoryMode's 5 per-mode flags into an
evalConfig struct so params don't drift.
- Promote validModes to a package-level var.
- Remove redundant cache = new...() fallback after load* (both load
helpers already return a non-nil empty cache on error).
- Strip narrating / diff-referencing comments per CLAUDE.md; keep the
one genuine WHY on normalizeCategory (divergence from normalize.Name).
Net -54 lines across 4 files; go build + go vet + tests green.
Ship 2 MR 5b. Extends discovery-eval with a second mode that grades
MistralLLMEnricher's category output against labelled ground truth.
Accuracy + per-label confusion matrix so mix-ups between similar
categories (mittelaltermarkt vs ritterfest, weihnachtsmarkt vs
kirchweih) are visible at a glance.
Usage:
-mode similarity — existing MR 5 path, unchanged.
-mode category — new: scrapes quellen URLs, asks LLM for
{category, opening_hours, description},
scores category only.
Structure
- main.go: split into runSimilarityMode + runCategoryMode. Both
share ai.Client construction and the ctx timeout (bumped to 15min
for category mode since scraping adds I/O). Mode dispatched on
-mode flag; unknown modes exit 2.
- category.go: fixture / cache / run / metrics / report — parallel
to the similarity files, not shared because the data shapes differ
enough that generics would add more noise than they save. Cache
key is sha256(markt_name_lower|stadt_lower|year|model); separate
from SimilarityPairKey since that one takes two rows.
- fixtures/category.json: 10 hand-labelled DACH-market rows
exercising the categories we expect the LLM to produce —
mittelaltermarkt, weihnachtsmarkt, ritterfest, ritterturnier,
handwerkermarkt, schlossfest, kirchweih. Each row lists a quelle
URL the enricher will scrape live (first run only; cache takes
over after).
- normalizeCategory: strips casing + German umlauts + the -märkte
plural drift so a correctly-categorised row doesn't get scored
wrong for cosmetic LLM output variation.
Metrics: Accuracy + per-label confusion matrix. Confusion format is
`want → predictions` with `!` markers on off-diagonal predictions —
readable in a terminal, machine-parseable in the JSON report.
Mismatches are listed at the end with want/got pairs so operators
can spot prompt failures and patch either the prompt or the fixture.
Threshold gate reads accuracy (not F1) — category is multi-class,
precision/recall don't have a single-label meaning.
Tests: normalisation edge cases (casing, umlaut, plural, trimming),
scoring drift tolerance, metrics counts + confusion matrix shape,
errors excluded from confusion, cache round-trip + model scoping,
missing/corrupt file handling.
.gitignore adds .cat-eval-cache.json and cat-eval-report.json.
Follow-ups (MR 5c / later): opening_hours and description scoring.
Both need fuzzier matching (regex structure vs LLM judge) which is
its own design problem.
Ship 2 MR 5. Adds a CLI that measures MistralSimilarityClassifier
against a labelled fixture: precision, recall, F1, accuracy, plus a
confidence calibration table so we can tell whether "90% confident"
verdicts are actually right 90% of the time.
Usage: go run ./backend/cmd/discovery-eval -fixture ... -cache ...
-threshold 0.8 -report eval-report.json.
Structure
- main.go: arg parsing + wiring (ai.Client, classifier, cache,
metrics). The work happens in realMain() which returns an exit code
— keeps defers running on error paths.
- fixture.go: parses labelled pairs JSON. Fixture authors only need to
fill in name/stadt/year; name_normalized falls back to name when
omitted.
- cache.go: file-backed map keyed by SimilarityPairKey + model string.
Symmetric (a,b) == (b,a). Atomic writes (temp file + rename) so a
crashed run cannot corrupt the cache. Corrupt-file load returns an
empty usable cache and reports the parse error.
- run.go: executes each pair through the classifier, populating the
cache. Individual classify errors are downgraded to "not correct"
and logged — the run always finishes so the operator sees whatever
data is available.
- metrics.go: confusion matrix, P/R/F1/accuracy, per-confidence-
bucket calibration ([0-0.5), [0.5-0.75), [0.75-0.9), [0.9-1.0]).
Prints human summary + surfaces highest-confidence mismatches
first (most actionable for prompt iteration). Optional JSON report.
- Threshold gate: -threshold N exits non-zero when F1<N. Default 0
(gating disabled until we have a baseline F1).
Fixture: seeds 15 hand-crafted DACH-market pairs covering the edge
cases we actually care about — umlaut drift (Straßburg/Strassburg),
year difference on a recurring series, word-reordering, distinct
events at the same venue, historical proper names (Striezelmarkt),
same city with multiple distinct Christmas markets. Operator extends
over time; each pair carries a `note` explaining the case it locks.
.gitignore adds .eval-cache.json and eval-report.json — neither
should land in the repo.
Tests cover metrics edge cases (all correct, imbalanced,
no-positive-predictions-no-NaN, calibration bucket assignment,
cache accounting, empty input) and cache behaviour (round-trip,
symmetric lookup, model-scoped invalidation, missing/corrupt file
handling, atomic-write leaves no temp files).
Out of scope for MR 5: enrichment field accuracy (fuzzy text
scoring is its own problem — tracked for a follow-up), CI wiring
(needs a baseline F1 first).
Deletes agent_client.go, agent_client_test.go, and the discovery-compare
diagnostic CLI. Removes Tick/PickBuckets/processOneBucket/processBucketResponse
from Service; renames NewServiceWithCrawler to NewService. Drops BatchSize,
ForwardMonths, AgentDiscovery config fields and their env reads. PickStaleBuckets
and UpdateBucketQueried removed from Repository interface (no callers). Stats
hardcodes forwardMonths=12. /tick route removed; /crawl is now the only machine
path, still protected by requireTickToken middleware.
- Service.Crawl derives Konfidenz from merged source count + rank instead of
hardcoded "mittel". Two+ sources -> "hoch"; single curated source ->
"mittel"; single suendenfrei (prose regex) -> "niedrig".
- New AgentStatus constant "crawler" replaces "bestaetigt" for crawler rows
so the validator's agent-specific rules don't fire on them and operators
can filter the queue by origin. Added Konfidenz* and AgentStatus*
constants to model.go.
- Default EndDatum to StartDatum when a source reports a single date
(festival_alarm one-day events, suendenfrei lines without a "bis" range).
Avoids Service.Accept rejecting nil-EndDatum rows.
- Sort PerSource names before assembling raw events for merge — makes
merged output order deterministic across runs.
- NewHandler: manualRateLimitPerHour <= 0 now explicitly disables the
rate limit (previously silently floored to 1/hour). Documented behavior
for all three cases in a constructor comment.
- Added four new tests for Service.Crawl failure/quality paths:
LinkCheckFailed, DedupedQueue, EndDatum default, multi-source Konfidenz.
- Documented the substring-match approximation in
cmd/discovery-compare/main.go's groupCrawlerByBucket — diagnostic-only,
not safe for production routing.
- Admin CRUD endpoints for markets with role-based middleware
- Anonymous market submission with Cloudflare Turnstile verification
- SMTP email notifications on new submissions (LogSender fallback)
- Market status workflow (pending/approved/rejected) with admin notes
- Nullable location column for submissions without coordinates
- CLI tool for promoting users to admin role
- Slug generation package extracted from seed
- Rate limiting on submission endpoint (3/hour per IP)
- Mailpit added to docker-compose for local email testing
- Reduced from 312 to 272 markets (removed 39 unverified/unconfirmed entries)
- All 272 markets now have real descriptions sourced from official websites
- Added admission prices (adult/child cents + notes) where available
- Added street addresses and corrected venue names
- Added opening hours for markets where published
- Updated websites to full https:// URLs
- Updated seedMarket struct with new fields: Street, Description,
AdmissionAdultCents, AdmissionChildCents, AdmissionNotes
- Seed INSERT now uses description/admission from JSON instead of
generating template descriptions; falls back to generated desc if empty
- Added jsonString() helper for SQL-safe JSON encoding
Scrape marktkalendarium.de for 2026 market data and replace the
placeholder seed with real events across DE/AT/CH.
- Embed 311 markets as JSON with name, dates, city, zip, venue,
organizer and website
- Geocode coordinates via Nominatim with caching and rate limiting
- Auto-derive Bundesland from postal code prefix
- Generate descriptions based on event type keywords
- Support DATABASE_URL env var for direct production seeding
- Disable revive exported/package-comments rules (style, not correctness)
- Use errors.Is instead of == for pgx.ErrNoRows comparisons
- Use errors.As instead of type assertion on validator errors
- Use http.NewRequestWithContext instead of client.Get (noctx)
- Check resp.Body.Close error return (errcheck)
- Run gofmt on files with formatting drift