gin panics at startup with:
':aid' in new path '/api/v1/admin/discovery/queue/:aid/similar/:bid/classify'
conflicts with existing wildcard ':id' in existing prefix
'/api/v1/admin/discovery/queue/:id'
Gin's trie requires identical parameter names at the same prefix position.
All sibling routes use :id; the tiebreak route was registered with :aid,
crashing the server on every deploy since e0b73ac. Prod has been running
the pre-tiebreak image (52f3e4c0) the whole time because every Helm
upgrade crash-looped and rolled back.
Rename :aid to :id in both the route and the handler's c.Param read.
:bid is in a different slot and stays.
- Extract readJSONFile + writeJSONAtomic in cache.go; category cache
reuses them (saveCategoryCache is one line, loadCategoryCache uses
the standard load-or-empty shape).
- Drop dead errMsg param from scoreCategoryResult (always "").
- Wrap writeCategoryReport errors with context for consistency.
- Wrap runSimilarityMode / runCategoryMode's 5 per-mode flags into an
evalConfig struct so params don't drift.
- Promote validModes to a package-level var.
- Remove redundant cache = new...() fallback after load* (both load
helpers already return a non-nil empty cache on error).
- Strip narrating / diff-referencing comments per CLAUDE.md; keep the
one genuine WHY on normalizeCategory (divergence from normalize.Name).
Net -54 lines across 4 files; go build + go vet + tests green.
Ship 2 MR 5b. Extends discovery-eval with a second mode that grades
MistralLLMEnricher's category output against labelled ground truth.
Accuracy + per-label confusion matrix so mix-ups between similar
categories (mittelaltermarkt vs ritterfest, weihnachtsmarkt vs
kirchweih) are visible at a glance.
Usage:
-mode similarity — existing MR 5 path, unchanged.
-mode category — new: scrapes quellen URLs, asks LLM for
{category, opening_hours, description},
scores category only.
Structure
- main.go: split into runSimilarityMode + runCategoryMode. Both
share ai.Client construction and the ctx timeout (bumped to 15min
for category mode since scraping adds I/O). Mode dispatched on
-mode flag; unknown modes exit 2.
- category.go: fixture / cache / run / metrics / report — parallel
to the similarity files, not shared because the data shapes differ
enough that generics would add more noise than they save. Cache
key is sha256(markt_name_lower|stadt_lower|year|model); separate
from SimilarityPairKey since that one takes two rows.
- fixtures/category.json: 10 hand-labelled DACH-market rows
exercising the categories we expect the LLM to produce —
mittelaltermarkt, weihnachtsmarkt, ritterfest, ritterturnier,
handwerkermarkt, schlossfest, kirchweih. Each row lists a quelle
URL the enricher will scrape live (first run only; cache takes
over after).
- normalizeCategory: strips casing + German umlauts + the -märkte
plural drift so a correctly-categorised row doesn't get scored
wrong for cosmetic LLM output variation.
Metrics: Accuracy + per-label confusion matrix. Confusion format is
`want → predictions` with `!` markers on off-diagonal predictions —
readable in a terminal, machine-parseable in the JSON report.
Mismatches are listed at the end with want/got pairs so operators
can spot prompt failures and patch either the prompt or the fixture.
Threshold gate reads accuracy (not F1) — category is multi-class,
precision/recall don't have a single-label meaning.
Tests: normalisation edge cases (casing, umlaut, plural, trimming),
scoring drift tolerance, metrics counts + confusion matrix shape,
errors excluded from confusion, cache round-trip + model scoping,
missing/corrupt file handling.
.gitignore adds .cat-eval-cache.json and cat-eval-report.json.
Follow-ups (MR 5c / later): opening_hours and description scoring.
Both need fuzzier matching (regex structure vs LLM judge) which is
its own design problem.
Ship 2 MR 8. Operator-productivity layer on top of the detail drawer:
j/k to walk rows, Enter to open, a/r to accept-reject the selection,
e/s to jump into the drawer with AI enrich / Similar already visible,
? for a help modal listing everything. Escape closes the drawer (or
the help modal if it's open).
Implementation
- selectedId $state drives a subtle indigo ring on the highlighted
row. Follows drawerId when the drawer opens so Esc → j leaves you
on the same row. Auto-resets to queue[0] if the selected row
scrolls off the page (pagination / refresh).
- Global <svelte:window onkeydown> listener. isTypingTarget() bails
out when focus is inside an input/textarea/select/contenteditable
so typing in the drawer's edit form doesn't trigger shortcuts.
Cmd/Ctrl/Alt combos also skipped so browser shortcuts stay intact.
- selectRelative() updates selectedId + scrolls the row into view
(block: 'nearest') so keyboard-driven scanning through a long
queue keeps the highlight visible.
- submitRowAction() builds + submits a hidden <form> for a/r so the
SvelteKit action pipeline (invalidations, form result propagation)
runs the same way a button click would.
Decisions baked in
- 'e' (AI enrich) and 's' (Similar) open the drawer rather than
firing the LLM call directly. LLM calls cost money; keeping the
UI explicit avoids hidden side effects from a misclick.
- Persistent '?' button bottom-right for discoverability — operators
shouldn't have to read docs to find the help.
- Modal uses click-outside-to-dismiss + Esc + ✕ button, all three.
No backend changes. Frontend-only.
Ship 2 MR 6. Consolidates every market-specific action that used to
expand into the queue table into a single side drawer. Queue rows
keep Accept/Reject for fast-path review; clicking anywhere else on a
row opens the drawer with the full context.
State via URL param ?drawer=<id>. F5 preserves the open row; links
like /admin/discovery?drawer=<uuid>&sort=konfidenz are shareable and
compose with existing pagination/sort state.
DetailDrawer.svelte (new) sections:
- Header: name, konfidenz, source count, Accept/Reject, close (✕)
- Identity: editable form (name, stadt, bundesland, start/end, website)
- Enrichment: full payload with per-field provenance tags + AI enrich
button; "Noch keine Enrichment-Daten" empty state
- Quellen: URL list (link-out)
- Quellen-Vergleich: per-source contribution diff (reuses
ContributionsPanel) — only rendered when >=2 sources
- Similar: candidates loaded lazily on drawer open; AI? tiebreak
button per candidate shows ✓ same / ✗ diff chips with LLM reason
- Audit: discovered_at, agent_status, hinweis
+page.svelte: removed the three inline <tr> panels (Similar,
Quellen-Vergleich, expanded) and their associated state (expandedId,
similarOpenId, quellenVergleichOpenId, similarLoading, similarEntries,
similarVerdicts, similarClassifying, toggleSimilar, classifySimilar,
toggleQuellenVergleich). Row actions collapsed from 5 buttons
(Accept/Reject/Similar/AI/Quellen-Vergleich) to 2 (Accept/Reject).
The chevron glyph stays as a visual affordance but is inert — the
whole row is clickable. Buttons/forms/links inside the row stop
propagation via a closest()-based guard so fast-path Accept/Reject
don't accidentally open the drawer.
No backend changes; the drawer consumes existing queue data +
existing endpoints (similar, similar/classify, enrich).
Follow-ups: MR 8 adds keyboard shortcuts that naturally compose with
the drawer (j/k navigation, Enter opens, Esc closes).
Ship 2 MR 5. Adds a CLI that measures MistralSimilarityClassifier
against a labelled fixture: precision, recall, F1, accuracy, plus a
confidence calibration table so we can tell whether "90% confident"
verdicts are actually right 90% of the time.
Usage: go run ./backend/cmd/discovery-eval -fixture ... -cache ...
-threshold 0.8 -report eval-report.json.
Structure
- main.go: arg parsing + wiring (ai.Client, classifier, cache,
metrics). The work happens in realMain() which returns an exit code
— keeps defers running on error paths.
- fixture.go: parses labelled pairs JSON. Fixture authors only need to
fill in name/stadt/year; name_normalized falls back to name when
omitted.
- cache.go: file-backed map keyed by SimilarityPairKey + model string.
Symmetric (a,b) == (b,a). Atomic writes (temp file + rename) so a
crashed run cannot corrupt the cache. Corrupt-file load returns an
empty usable cache and reports the parse error.
- run.go: executes each pair through the classifier, populating the
cache. Individual classify errors are downgraded to "not correct"
and logged — the run always finishes so the operator sees whatever
data is available.
- metrics.go: confusion matrix, P/R/F1/accuracy, per-confidence-
bucket calibration ([0-0.5), [0.5-0.75), [0.75-0.9), [0.9-1.0]).
Prints human summary + surfaces highest-confidence mismatches
first (most actionable for prompt iteration). Optional JSON report.
- Threshold gate: -threshold N exits non-zero when F1<N. Default 0
(gating disabled until we have a baseline F1).
Fixture: seeds 15 hand-crafted DACH-market pairs covering the edge
cases we actually care about — umlaut drift (Straßburg/Strassburg),
year difference on a recurring series, word-reordering, distinct
events at the same venue, historical proper names (Striezelmarkt),
same city with multiple distinct Christmas markets. Operator extends
over time; each pair carries a `note` explaining the case it locks.
.gitignore adds .eval-cache.json and eval-report.json — neither
should land in the repo.
Tests cover metrics edge cases (all correct, imbalanced,
no-positive-predictions-no-NaN, calibration bucket assignment,
cache accounting, empty input) and cache behaviour (round-trip,
symmetric lookup, model-scoped invalidation, missing/corrupt file
handling, atomic-write leaves no temp files).
Out of scope for MR 5: enrichment field accuracy (fuzzy text
scoring is its own problem — tracked for a follow-up), CI wiring
(needs a baseline F1 first).
Ship 2 MR 7. Replaces the "drop on duplicate" branch of the crawl
loop with a cross-run auto-merge: when a new crawl brings a source
that a pending queue row doesn't yet carry, the new source's data
merges into the existing row instead of spawning a second entry.
Operator review burden stays bounded to one row per market even as
coverage grows across sources.
Konfidenz upgrades come for free: a row that starts with one source
at konfidenz=mittel flips to hoch the moment a second independent
source confirms the same (name, city, start_date) triple.
Repo changes
- QueueHasPending (bool) replaced by FindPendingMatch returning
*DiscoveredMarket. Same exact-tuple lookup; now callers see the
full match so they can merge.
- MergePendingSources appends new sources/quellen/contributions onto
a pending row using set-union semantics. source_contributions
dedupe by SourceName so repeat crawls don't stack duplicate entries.
Konfidenz and hinweis are overwritten with caller-computed values.
- Idempotent: send the same delta twice, nothing changes the second
time.
Service.Crawl flow
- On match + incoming source already on the row -> DedupedQueue.
Same semantic as before, just more tightly scoped (same source
re-emits an event; previously any match counted as dedup).
- On match + incoming source not yet on the row -> auto-merge path:
compute the source/quellen/contribution delta, call
MergePendingSources, count in summary.AutoMerged.
- The crawlerKonfidenz helper is now a thin wrapper over a shared
konfidenzForSources(sources []string), reused by the merge path.
Source-name constants extracted to un-hardcode the switch cases
and the test references.
Summary + UI
- CrawlSummary gains AutoMerged int. Logged alongside the other
counters.
- +page.svelte crawl-result grid gets an "Auto-merged" tile.
Tests
- Same-source redundant pickup -> DedupedQueue=1, no MergePendingSources
call, no insert.
- New-source auto-merge -> AutoMerged=1, MergePendingSources called with
exact delta (addSources=[new only], addQuellen=[new only], addContribs
labelled with new source_name), konfidenz upgraded to hoch.
- Existing TestServiceCrawlDedupQueue renamed to
TestServiceCrawlDedupQueue_SameSourceRedundant reflecting the
tightened semantic.
No migration — existing text[] and jsonb columns support the union
operations via SQL.
Ship 2 MR 4. Adds per-pair AI-backed classification for operator use
inside the existing Similar panel: an "AI?" button next to each
candidate asks Mistral whether the two queue rows refer to the same
underlying market. Result shown inline as a green "✓ same N%" or
grey "✗ diff N%" chip with the LLM's reason on hover.
No scraping — the classifier works from (name, city, year) alone,
which is enough for the common cases (same venue on two calendars,
typos, cross-year recurrence). Call is short (usually <3s) so the
handler is synchronous, 15s deadline.
Caching
- Migration 000020 adds similarity_ai_cache keyed on a content hash
over (normalized_name|stadt|year) for both rows, sorted for
symmetry. Survives queue row accept/reject because the hash is
about markt-content, not queue-row lifecycle.
- enrich.SimilarityPairKey computes the key. Classify(a,b) and
Classify(b,a) hit the same entry. Stadt casing drift doesn't
invalidate.
- Repo methods GetSimilarityCache / SetSimilarityCache + corresponding
mock hooks. DefaultSimilarityCacheTTL=30d.
Mistral integration
- enrich.MistralSimilarityClassifier reuses the same aiPass2
interface as the enricher. English system prompt asks for
JSON-only output with {same_market, confidence 0..1, reason}.
Confidence clamped to [0,1] because models occasionally return
1.2 or -0.1. Reason is short German justification.
- NoopSimilarityClassifier returns an error — callers must check
ai.Enabled() before deciding which binding to pass.
Service.ClassifySimilarPair loads both rows, computes pair key,
cache-first, calls classifier on miss, writes cache, returns
verdict. Rejects self-comparison (pair-key collapses). Handler
POST /admin/discovery/queue/:aid/similar/:bid/classify.
UI: new AI? column inside the Similar panel. Per-candidate pending
state via Set<string>, disabled button while in-flight, inline
verdict chip after response. Tooltip shows the LLM's reason.
Tests: pair-key symmetry + differentiation + casing tolerance;
Mistral classifier happy path, clamping edge cases, error
propagation, bad-JSON handling, Noop rejection. Service tests:
happy path writes cache, cache-hit skips LLM, self-comparison
rejected, classifier errors don't poison the cache.
NewService signature grows by one param (sim enrich.
SimilarityClassifier). All 14 existing callers (routes.go + tests)
updated; tests pass nil.
Completes the manual two-pass enrichment flow: the crawl-enrich-all
button (MR 3) fills deterministic fields across the queue; this MR
adds a per-row "AI" button that scrapes the row's quellen URLs and
asks Mistral to fill category, opening_hours, description.
Flow per click:
1. Load row, compute CacheKey(name_normalized, stadt, year).
2. Cache hit -> skip LLM, merge cached payload onto current
crawl-enrich base, persist, return.
3. Miss -> scrape up to 5 quellen URLs via pkg/scrape (goquery
text extraction, 4000-char truncation), concatenate into labeled
blocks, call ai.Client.Pass2 with JSON response format.
4. Parse response into Enrichment{category, opening_hours,
description}, stamp provenance=llm + model + token counts.
5. Cache the raw LLM payload (not the merged one) under the tuple
key with DefaultCacheTTL=30d, so later re-crawls can layer new
crawl-enrich bases on the same cached answer.
6. Merge(crawl, llm) -- crawl fields survive. Persist via
SetEnrichment(status=done). Return merged to the operator.
ErrNoScrapedContent fails fast when zero URLs return usable text;
LLMs without grounding hallucinate, and a 400-style operator error is
better than inventing details. Individual scrape failures don't halt
the flow as long as at least one source succeeds.
pkg/scrape (new, reusable)
- Client.Fetch: HTTP GET, strip script/style/nav/footer/aside via
goquery, gather body text, collapse whitespace, truncate.
DefaultTimeout=10s, DefaultMaxChars=4000. User-Agent configurable.
- Tests cover noise stripping, whitespace collapsing, truncation,
body-less fragments.
enrich.MistralLLMEnricher
- Takes ai.Client + Scraper (both injectable; tests use stubs).
- Prompt: English system instructions asking for JSON-only output
with category/opening_hours/description in German. User prompt
includes markt identifiers, already-filled fields (so the LLM
doesn't waste tokens re-deriving them), and scraped blocks.
- Tests: happy path, all-scrapes-fail (-> ErrNoScrapedContent),
partial-scrape-success, empty LLM fields yield no provenance,
URL cap at 5.
Service.RunLLMEnrichOne + handler POST /admin/discovery/queue/:id/
enrich (sync, 30s timeout). NewService gains llm enrich.LLMEnricher
param; routes.go constructs a MistralLLMEnricher when ai.Client is
enabled, falls back to NoopLLMEnricher otherwise.
UI: per-row AI button next to Similar, tracks per-row pending state
via a Set<string>, disables the button while the request is in
flight and shows "AI..." label. Success invalidates the page, the
row's expanded view picks up the new category/opening_hours/
description fields with llm provenance tags. Inline error message on
the row if the enrich action fails.
Replaces the originally-planned async-worker design with operator-
triggered bulk runs (see memory/project_ship2_enrichment.md). Crawl-
enrichment is cheap enough to always run against the whole list but
runs only when the admin clicks — the flow stays predictable and the
crawl itself stays fast.
Endpoints
- POST /admin/discovery/enrichment/crawl-all — 202 + goroutine, mirrors
the crawl pattern. Per-process CAS gate prevents concurrent runs.
- GET /admin/discovery/enrichment/crawl-all-status — polled shape
identical to /crawl-status for UI reuse.
Service RunCrawlEnrichAll iterates enrichment_status='pending' rows,
builds an enrich.Input from each, runs CrawlEnrich (consolidation +
Nominatim geocoding via the shared geocoder), and persists via
SetEnrichment(status=done). Per-row errors count toward Failed and
append to a bounded Errors slice; the pass never halts.
Enrich package refactor
- Enrichment, Sources, Provenance constants moved from discovery ->
enrich (they are the enrich package's own types; discovery previously
held them for historical reasons).
- CrawlEnrich now takes a narrow enrich.Input / enrich.Contribution so
the enrich package no longer imports the parent discovery package.
This breaks the import cycle that appeared once discovery needed to
call enrich (the MR 2 structure only worked because no caller went
in that direction yet).
- LLMEnricher takes an LLMRequest (primitives) instead of a
DiscoveredMarket. NoopLLMEnricher updated; real Mistral impl lands
in MR 3b.
- CacheKey signature switched from (DiscoveredMarket) to primitive
(nameNormalized, stadt, year).
Service geocoder wiring: discovery.NewService gains a Geocoder param
(routes.go passes the shared Nominatim client; the interface lives in
discovery to avoid another circular edge with enrich).
UI: "Run crawl-enrich" button next to "Run crawl"; identical poll +
summary card pattern. Queue row expand shows enrichment status badge
plus the PLZ/Venue/Organizer/Lat-Lng fields inline with per-field
provenance tag.
Tests: three new service tests (happy path, per-row SetEnrichment
failure, empty-queue no-op). Existing enrich package tests updated
for the primitive input signature. All 13 test NewService call-sites
updated for the new geocoder param.
Lays infrastructure for Ship 2 crawl-time enrichment. Design principles
(see memory/project_ship2_enrichment.md):
- async worker (not inline in crawl) — MR 3 wires it up
- single enrichment jsonb column, not typed columns — shape still in flux
- per-row LLM budget, global soft cap logged
- crawl-enrich runs first; LLM only fills gaps it cannot reach
Migration 000019: adds discovered_markets.enrichment{,_status,_attempts}
and enriched_at; partial index on enrichment_status for the worker's
claim query; enrichment_cache table keyed by sha256(name|city|year).
enrich package:
- crawl.go — pure consolidator over SourceContributions (PLZ, venue,
organizer), first non-empty wins. Optional Geocoder pulls lat/lng via
Nominatim; failures are non-fatal. Everything marked provenance=crawl.
- llm.go — LLMEnricher interface + NoopLLMEnricher. Real Mistral-backed
impl lands in MR 3 along with the worker.
- enrich.go — Merge(base, overlay) with base-wins semantics, enforcing
the crawl-over-llm invariant at the type level: even a confident LLM
pass can't overwrite a crawl-populated field.
- cache.go — CacheKey() stable across re-crawls; DefaultCacheTTL=30d.
Repository: scan/persist the new columns, GetEnrichmentCache /
SetEnrichmentCache / SetEnrichment. The SetEnrichment UPDATE increments
attempts server-side and stamps enriched_at only for terminal states
(done|failed) — 'skipped' keeps the previous timestamp.
No UI changes and no worker binary yet. Noop LLM enricher in place so
MR 3 can wire the worker without refactoring shape.
Admin queue table gains clickable sort on Markt, Stadt, Datum, Quellen
(count), and Konfidenz. Default on page load is konfidenz desc with
start_datum ASC NULLS LAST as the within-tier tiebreaker — operators
see highest-confidence, soonest-upcoming markets first. URL state
(?sort=&order=) is the single source of truth; F5 preserves, localStorage
is not used.
Backend: ListQueue takes (sortBy, order); repository builds ORDER BY
from a closed whitelist — konfidenz uses a CASE rank (hoch=3, mittel=2,
niedrig=1), quellen_count uses cardinality(quellen). Handler
normalisers reject anything off the whitelist and echo the effective
values in meta.sort / meta.order so the UI can render arrows. Unit
tests lock the emitted SQL per combination and assert raw input cannot
leak into ORDER BY.
MR 6's backend + MR 7's UI had mismatched envelope assumptions. Backend
returned pagination as sibling fields to data; UI's ApiResponse<T> wrapper
only typed data, so 'body.data' (the queue) became undefined at runtime.
MR 6 backend returned {data, total, limit, offset} as siblings but the
shared ApiResponse<T> envelope only types the data field. The UI's load
function treated queueRes.data as a wrapper and read body.data (undefined)
as the row list. Result: empty queue in UI despite 1384 pending rows
in the DB.
Fix: backend moves total/limit/offset into meta (matches PaginationMeta
convention from web/src/lib/api/types.ts). UI casts to read the meta
slot alongside typed data.
Migration 000018 adds sources text[] + source_contributions jsonb to
discovered_markets. Crawler preserves raw per-source RawEvents through
Merge() and service persists them alongside the merged row. Admin UI
gains a merged-sources chip + Datumskonflikt badge and an expandable
Quellen-Vergleich panel showing per-field comparison across sources
with conflicting values highlighted.
Migration 000018 adds sources text[] + source_contributions jsonb
columns to discovered_markets. Crawler's merger now preserves the raw
per-source RawEvents through Merge() so they can be stored alongside
the merged row. Admin UI gains two surfaces: (a) compact "merged from
source1 + source2" chip + amber Datumskonflikt badge when hinweis
flags it, (b) expandable Quellen-Vergleich panel showing a per-field
comparison table with diverging fields highlighted. Forensic visibility
into what each source said vs what the merger picked.
Queue endpoint now returns {data, total, limit, offset}. Admin UI
reads ?page + ?limit from URL, renders prev/next + page-size selector
+ "Showing X-Y of Z" label. Per-row Similar button fetches the MR 5
/queue/:id/similar endpoint via a new SvelteKit proxy route and
renders matches inline with score/name/city/date. Essential for
navigating the 1000+ row queue after MR 5's crawl fixes.
Drops link-check from crawl path (was timing-bound, misleading counter).
Fixes suendenfrei pagination footer-link infinite loop. Adds similarity
helper with Levenshtein-based fuzzy name match + city match + date
proximity, exposed as GET /queue/:id/similar for admin duplicate review.
- Service.Crawl no longer link-verifies Quellen/Website for crawler
events. Those URLs come from real HTML of trusted sources and have
been implicitly verified at parse time. Removing this makes the
insert phase complete in well under a minute even for 1500+ events
and stops attributing timing-limited processing as link failures.
LinkCheckFailed counter retained for JSON shape stability.
- Suendenfrei pagination now stops on len(events) == 0. Previously the
site's footer <h3><a> links kept anchors.Length() > 0 indefinitely,
sending the crawler to page-90 before the outer ctx timeout.
- New similarity helper (SimilarityScore, FindSimilar) and endpoint
GET /api/v1/admin/discovery/queue/:id/similar. Multiplicative score
of normalized-name Levenshtein ratio gating city-match and date-
proximity bonuses. Prevents coincident-city/date events from being
incorrectly flagged as near-duplicates when their names differ.
Lets admin review flag near-duplicates that slip past exact-match
dedup (date typos, city variants, trailing-word swaps).
Gateway (NGF) ignored our HTTPRoute timeouts field (UnsupportedField).
Flipping to fire-and-forget: handler returns 202 immediately, goroutine
runs crawl with detached 5-min context, GET /admin/discovery/crawl-status
returns state, admin UI polls every 3s until running=false.
HTTP requests are now all sub-second; gateway timeout is no longer in
the crawl critical path. Concurrent-run protection via atomic.Bool
(replaces TryLock), rate limit semantics unchanged.
Handler.Crawl now spawns a goroutine with a 5-minute detached context
and returns 202 immediately. Admin UI polls the new
GET /admin/discovery/crawl-status every 3s until running=false, then
renders CrawlSummary. Bypasses the 60s nginx-gateway proxy_read_timeout
entirely — HTTP requests are all sub-second.
Concurrency: atomic.Bool guard (CompareAndSwap) replaces TryLock,
resultMu RWMutex protects the summary/error state, rateMu protects
the rate-limit check. Rate limit semantics unchanged (still applies
to admin-session path, bearer-token bypass via context flag).
Gateway cut the HTTP request at 60s, which cancelled the request ctx
and cascaded into the link-verifier in Service.Crawl's insert pipeline.
Every merged event was then dropped as LinkCheckFailed, resulting in
zero new queue rows despite the crawler parsing ~1500 events.
Fix is three parts: HTTPRoute timeout 300s for /crawl*, insert-phase
context detached from the HTTP request ctx, and a CrawlSummary INFO
log line for diagnosability.
- HTTPRoute: add 300s request+backendRequest timeout rule for
/api/v1/admin/discovery/crawl; default rule unchanged. nginx-gateway's
60s default was cutting the connection mid-crawl.
- Service.Crawl: detach insert pipeline from HTTP request context with
a 3-minute internal timeout. Previously a canceled request ctx
cascaded into the link-verifier, failing every URL check and
counting every merged event as LinkCheckFailed. Inserts now complete
even if the gateway cut the connection.
- Log CrawlSummary at INFO on completion so outcomes are visible in
backend logs without needing the HTTP response body.
- New test: TestServiceCrawlDetachesInsertContextFromRequestCtx.
Deletes the Mistral Pass 0 code path from discovery, flips the k8s
CronJob to the crawler endpoint on a daily schedule, and adds a
Run crawl button to the admin UI that renders CrawlSummary.
Net change: ~-900 lines / +150 lines. Mistral remains wired for Pass 1
and Pass 2 research — only Pass 0 discovery is replaced by the deterministic
5-source Go crawler.
Deletes agent_client.go, agent_client_test.go, and the discovery-compare
diagnostic CLI. Removes Tick/PickBuckets/processOneBucket/processBucketResponse
from Service; renames NewServiceWithCrawler to NewService. Drops BatchSize,
ForwardMonths, AgentDiscovery config fields and their env reads. PickStaleBuckets
and UpdateBucketQueried removed from Repository interface (no callers). Stats
hardcodes forwardMonths=12. /tick route removed; /crawl is now the only machine
path, still protected by requireTickToken middleware.
- Service.Crawl derives Konfidenz from merged source count + rank instead of
hardcoded "mittel". Two+ sources -> "hoch"; single curated source ->
"mittel"; single suendenfrei (prose regex) -> "niedrig".
- New AgentStatus constant "crawler" replaces "bestaetigt" for crawler rows
so the validator's agent-specific rules don't fire on them and operators
can filter the queue by origin. Added Konfidenz* and AgentStatus*
constants to model.go.
- Default EndDatum to StartDatum when a source reports a single date
(festival_alarm one-day events, suendenfrei lines without a "bis" range).
Avoids Service.Accept rejecting nil-EndDatum rows.
- Sort PerSource names before assembling raw events for merge — makes
merged output order deterministic across runs.
- NewHandler: manualRateLimitPerHour <= 0 now explicitly disables the
rate limit (previously silently floored to 1/hour). Documented behavior
for all three cases in a constructor comment.
- Added four new tests for Service.Crawl failure/quality paths:
LinkCheckFailed, DedupedQueue, EndDatum default, multi-source Konfidenz.
- Documented the substring-match approximation in
cmd/discovery-compare/main.go's groupCrawlerByBucket — diagnostic-only,
not safe for production routing.
Exposes Service.Crawl via two HTTP routes: a bearer-token path that
bypasses the manual rate limit, and an admin-session path subject to a
configurable per-hour cap. A sync.Mutex blocks concurrent runs.
Includes handler tests for mutex reentry and rate limit enforcement.
Extract normalize helpers into discovery/normalize subpackage to break
the otherwise circular import (discovery/crawler → discovery → crawler).
NormalizeName/NormalizeCity in discovery become thin wrappers; merger.go
switches to discovery/normalize directly.
Adds crawlerRunner interface, NewServiceWithCrawler constructor, CrawlSummary/
SourceSummary types, and Service.Crawl which wires the crawler output through
link-verify, dedup, validation, and insert — same pipeline as processBucketResponse
but without a bucket context (BucketID is nil on crawler-produced rows).