Files

vikingowl 3ddfd87408 feat(ai): migrate to Google Gemini 2.5 Flash-Lite, drop Mistral/Ollama

Replace the Mistral + Ollama AI stack with a single Google Gemini provider
backed by google.golang.org/genai. API key moves from env/Helm to the DB
(AES-256-GCM, key derived from JWT_SECRET via HKDF) so it can be rotated
via the admin UI without a pod restart.

New:
- pkg/crypto/secretbox — AES-256-GCM encrypt/decrypt for secrets at rest
- pkg/ai/gemini — GeminiProvider with grounding, structured output, usage
  recording, and hot-reload (Reinitialize swaps client under mutex)
- pkg/ai/usage — UsageRecorder interface + UsageEvent struct
- domain/settings/store — DB-backed settings (model, grounding toggle, key)
- domain/settings/usage — UsageRepo implementing UsageRecorder; ai_usage table
- migrations 000021 (system_settings) + 000022 (ai_usage)
- settings API: GET /ai, POST /ai/key, POST /ai/model, POST /ai/grounding,
  GET /ai/usage
- admin UI: 4-card settings page — provider status, model selector, grounding
  toggle with quota, usage rollups + recent-calls table

Removed:
- pkg/ai/ollama, mistral_provider, ratelimiter (+ tests)
- Helm AI_API_KEY, AI_PROVIDER, AI_MODEL_COMPLEX, AI_AGENT_DISCOVERY,
  AI_RATE_LIMIT_RPS env vars

Call sites set Grounded+CallType: research (true/"research"), enrich Pass B
(true/"enrich_b"), similarity (false/"similarity"). Integration test updated
to use a stub ai.Provider instead of a fake Ollama HTTP server.

2026-04-25 09:54:49 +02:00

fixtures

feat(discovery): category eval mode for the LLM enricher

2026-04-24 12:44:26 +02:00

cache_test.go

feat(discovery): eval harness for the AI similarity classifier

2026-04-24 12:26:18 +02:00

cache.go

refactor(discovery-eval): share JSON helpers, trim narration, tighten signatures

2026-04-24 12:59:06 +02:00

category_test.go

refactor(discovery-eval): share JSON helpers, trim narration, tighten signatures

2026-04-24 12:59:06 +02:00

category.go

refactor(discovery-eval): share JSON helpers, trim narration, tighten signatures

2026-04-24 12:59:06 +02:00

fixture.go

feat(discovery): eval harness for the AI similarity classifier

2026-04-24 12:26:18 +02:00

main.go

feat(ai): migrate to Google Gemini 2.5 Flash-Lite, drop Mistral/Ollama

2026-04-25 09:54:49 +02:00

metrics_test.go

feat(discovery): eval harness for the AI similarity classifier

2026-04-24 12:26:18 +02:00

metrics.go

feat(discovery): eval harness for the AI similarity classifier

2026-04-24 12:26:18 +02:00

README.md

feat(discovery): category eval mode for the LLM enricher

2026-04-24 12:44:26 +02:00

run.go

feat(discovery): eval harness for the AI similarity classifier

2026-04-24 12:26:18 +02:00

README.md

discovery-eval

CLI that grades discovery's AI-backed components against labelled fixtures. Two modes:

-mode similarity (default) — MistralSimilarityClassifier on pair- labelled fixtures. Reports precision / recall / F1 / accuracy + a confidence calibration table.
-mode category — MistralLLMEnricher's category output on row- labelled fixtures. Reports accuracy + a per-label confusion matrix.

File-based cache keeps reruns free. Each mode has its own cache key shape, so switching modes doesn't churn entries.

Run it

Similarity (default)

export AI_API_KEY=...
export AI_MODEL_COMPLEX=mistral-large-latest

go run ./backend/cmd/discovery-eval \
  -mode similarity \
  -fixture backend/cmd/discovery-eval/fixtures/similarity.json \
  -cache   .eval-cache.json \
  -threshold 0.8 \
  -report  eval-report.json

Exit code is 1 when F1 < threshold (0 = gating disabled).

Extending the fixture

fixtures/similarity.json is hand-curated. Add pairs that exercise patterns the classifier has gotten wrong, or edge cases the prompt documentation mentions but we haven't tested. Prefer real crawler output (anonymised if needed) over invented pairs — the model's failure modes on real data are the ones that matter.

Keep pairs deterministic (don't include confidence bands or other stochastic signals) and ensure each pair's note field explains the edge case. When you change a pair's labels, purge the cache so the classifier re-answers.

Cache

Delete the file to force a full re-run. Changing the model string invalidates every entry automatically.

Atomic writes (temp file + rename) — a crashed run won't corrupt the cache.

Interpreting the output

Precision is about false merges: how often does the classifier say "same" when it isn't? Low precision means auto-merge will fuse distinct markets — hard to recover from.
Recall is about missed merges: how often does it say "different" when it should have said "same"? Low recall means operators review duplicate rows manually — annoying but safe.
F1 balances both. The default gate threshold isn't set here — pick a number after an initial baseline run.
Calibration tells you whether "90% confident" verdicts are actually correct 90% of the time. An under-calibrated model is worse than a less-accurate but well-calibrated one for downstream auto-decisions.

Out of scope for now

Enrichment eval (category/opening_hours/description). Scoring fuzzy text outputs needs its own design; tracked for a future MR.
CI wiring. Once we have a baseline F1 the harness can run in GitLab CI with a fixed threshold.

README.md

discovery-eval

Run it

Similarity (default)

Category

Extending the fixture

Cache

Interpreting the output

Out of scope for now