marktvogt.de

Author	SHA1	Message	Date
vikingowl	c69fe4c07d	fix(discovery): route param name collision in ClassifySimilarPair gin panics at startup with: ':aid' in new path '/api/v1/admin/discovery/queue/:aid/similar/:bid/classify' conflicts with existing wildcard ':id' in existing prefix '/api/v1/admin/discovery/queue/:id' Gin's trie requires identical parameter names at the same prefix position. All sibling routes use :id; the tiebreak route was registered with :aid, crashing the server on every deploy since `e0b73ac`. Prod has been running the pre-tiebreak image (`52f3e4c0`) the whole time because every Helm upgrade crash-looped and rolled back. Rename :aid to :id in both the route and the handler's c.Param read. :bid is in a different slot and stays.	2026-04-24 13:06:08 +02:00
Christian Nachtigall	24675cf176	Merge branch 'refactor/discovery-eval-simplify' into 'main' refactor(discovery-eval): share JSON helpers, trim narration, tighten signatures See merge request vikingowl/marktvogt.de!16	2026-04-24 11:00:03 +00:00
vikingowl	126cc58cbf	refactor(discovery-eval): share JSON helpers, trim narration, tighten signatures - Extract readJSONFile + writeJSONAtomic in cache.go; category cache reuses them (saveCategoryCache is one line, loadCategoryCache uses the standard load-or-empty shape). - Drop dead errMsg param from scoreCategoryResult (always ""). - Wrap writeCategoryReport errors with context for consistency. - Wrap runSimilarityMode / runCategoryMode's 5 per-mode flags into an evalConfig struct so params don't drift. - Promote validModes to a package-level var. - Remove redundant cache = new...() fallback after load* (both load helpers already return a non-nil empty cache on error). - Strip narrating / diff-referencing comments per CLAUDE.md; keep the one genuine WHY on normalizeCategory (divergence from normalize.Name). Net -54 lines across 4 files; go build + go vet + tests green.	2026-04-24 12:59:06 +02:00
vikingowl	95d5eabdb5	Merge branch 'feat/discovery-enrichment-eval' — MR 5b category eval mode for LLM enricher	2026-04-24 12:44:42 +02:00
vikingowl	88d0ae9d96	feat(discovery): category eval mode for the LLM enricher Ship 2 MR 5b. Extends discovery-eval with a second mode that grades MistralLLMEnricher's category output against labelled ground truth. Accuracy + per-label confusion matrix so mix-ups between similar categories (mittelaltermarkt vs ritterfest, weihnachtsmarkt vs kirchweih) are visible at a glance. Usage: -mode similarity — existing MR 5 path, unchanged. -mode category — new: scrapes quellen URLs, asks LLM for {category, opening_hours, description}, scores category only. Structure - main.go: split into runSimilarityMode + runCategoryMode. Both share ai.Client construction and the ctx timeout (bumped to 15min for category mode since scraping adds I/O). Mode dispatched on -mode flag; unknown modes exit 2. - category.go: fixture / cache / run / metrics / report — parallel to the similarity files, not shared because the data shapes differ enough that generics would add more noise than they save. Cache key is sha256(markt_name_lower\|stadt_lower\|year\|model); separate from SimilarityPairKey since that one takes two rows. - fixtures/category.json: 10 hand-labelled DACH-market rows exercising the categories we expect the LLM to produce — mittelaltermarkt, weihnachtsmarkt, ritterfest, ritterturnier, handwerkermarkt, schlossfest, kirchweih. Each row lists a quelle URL the enricher will scrape live (first run only; cache takes over after). - normalizeCategory: strips casing + German umlauts + the -märkte plural drift so a correctly-categorised row doesn't get scored wrong for cosmetic LLM output variation. Metrics: Accuracy + per-label confusion matrix. Confusion format is `want → predictions` with `!` markers on off-diagonal predictions — readable in a terminal, machine-parseable in the JSON report. Mismatches are listed at the end with want/got pairs so operators can spot prompt failures and patch either the prompt or the fixture. Threshold gate reads accuracy (not F1) — category is multi-class, precision/recall don't have a single-label meaning. Tests: normalisation edge cases (casing, umlaut, plural, trimming), scoring drift tolerance, metrics counts + confusion matrix shape, errors excluded from confusion, cache round-trip + model scoping, missing/corrupt file handling. .gitignore adds .cat-eval-cache.json and cat-eval-report.json. Follow-ups (MR 5c / later): opening_hours and description scoring. Both need fuzzier matching (regex structure vs LLM judge) which is its own design problem.	2026-04-24 12:44:26 +02:00
vikingowl	169fa1b3c4	Merge branch 'feat/discovery-keyboard-shortcuts' — MR 8 keyboard shortcuts for queue review	2026-04-24 12:40:22 +02:00
vikingowl	ef6e1def3d	feat(discovery): keyboard shortcuts for queue review Ship 2 MR 8. Operator-productivity layer on top of the detail drawer: j/k to walk rows, Enter to open, a/r to accept-reject the selection, e/s to jump into the drawer with AI enrich / Similar already visible, ? for a help modal listing everything. Escape closes the drawer (or the help modal if it's open). Implementation - selectedId $state drives a subtle indigo ring on the highlighted row. Follows drawerId when the drawer opens so Esc → j leaves you on the same row. Auto-resets to queue[0] if the selected row scrolls off the page (pagination / refresh). - Global <svelte:window onkeydown> listener. isTypingTarget() bails out when focus is inside an input/textarea/select/contenteditable so typing in the drawer's edit form doesn't trigger shortcuts. Cmd/Ctrl/Alt combos also skipped so browser shortcuts stay intact. - selectRelative() updates selectedId + scrolls the row into view (block: 'nearest') so keyboard-driven scanning through a long queue keeps the highlight visible. - submitRowAction() builds + submits a hidden <form> for a/r so the SvelteKit action pipeline (invalidations, form result propagation) runs the same way a button click would. Decisions baked in - 'e' (AI enrich) and 's' (Similar) open the drawer rather than firing the LLM call directly. LLM calls cost money; keeping the UI explicit avoids hidden side effects from a misclick. - Persistent '?' button bottom-right for discoverability — operators shouldn't have to read docs to find the help. - Modal uses click-outside-to-dismiss + Esc + ✕ button, all three. No backend changes. Frontend-only.	2026-04-24 12:40:12 +02:00
vikingowl	3516999345	Merge branch 'feat/discovery-detail-drawer' — MR 6 detail drawer replaces inline panels	2026-04-24 12:37:47 +02:00
vikingowl	5476578373	feat(discovery): per-row detail drawer replaces inline panels Ship 2 MR 6. Consolidates every market-specific action that used to expand into the queue table into a single side drawer. Queue rows keep Accept/Reject for fast-path review; clicking anywhere else on a row opens the drawer with the full context. State via URL param ?drawer=<id>. F5 preserves the open row; links like /admin/discovery?drawer=<uuid>&sort=konfidenz are shareable and compose with existing pagination/sort state. DetailDrawer.svelte (new) sections: - Header: name, konfidenz, source count, Accept/Reject, close (✕) - Identity: editable form (name, stadt, bundesland, start/end, website) - Enrichment: full payload with per-field provenance tags + AI enrich button; "Noch keine Enrichment-Daten" empty state - Quellen: URL list (link-out) - Quellen-Vergleich: per-source contribution diff (reuses ContributionsPanel) — only rendered when >=2 sources - Similar: candidates loaded lazily on drawer open; AI? tiebreak button per candidate shows ✓ same / ✗ diff chips with LLM reason - Audit: discovered_at, agent_status, hinweis +page.svelte: removed the three inline <tr> panels (Similar, Quellen-Vergleich, expanded) and their associated state (expandedId, similarOpenId, quellenVergleichOpenId, similarLoading, similarEntries, similarVerdicts, similarClassifying, toggleSimilar, classifySimilar, toggleQuellenVergleich). Row actions collapsed from 5 buttons (Accept/Reject/Similar/AI/Quellen-Vergleich) to 2 (Accept/Reject). The chevron glyph stays as a visual affordance but is inert — the whole row is clickable. Buttons/forms/links inside the row stop propagation via a closest()-based guard so fast-path Accept/Reject don't accidentally open the drawer. No backend changes; the drawer consumes existing queue data + existing endpoints (similar, similar/classify, enrich). Follow-ups: MR 8 adds keyboard shortcuts that naturally compose with the drawer (j/k navigation, Enter opens, Esc closes).	2026-04-24 12:37:38 +02:00
vikingowl	6218710453	Merge branch 'feat/discovery-eval-harness' — MR 5 eval harness for AI similarity classifier	2026-04-24 12:31:11 +02:00
vikingowl	cf5408ab66	feat(discovery): eval harness for the AI similarity classifier Ship 2 MR 5. Adds a CLI that measures MistralSimilarityClassifier against a labelled fixture: precision, recall, F1, accuracy, plus a confidence calibration table so we can tell whether "90% confident" verdicts are actually right 90% of the time. Usage: go run ./backend/cmd/discovery-eval -fixture ... -cache ... -threshold 0.8 -report eval-report.json. Structure - main.go: arg parsing + wiring (ai.Client, classifier, cache, metrics). The work happens in realMain() which returns an exit code — keeps defers running on error paths. - fixture.go: parses labelled pairs JSON. Fixture authors only need to fill in name/stadt/year; name_normalized falls back to name when omitted. - cache.go: file-backed map keyed by SimilarityPairKey + model string. Symmetric (a,b) == (b,a). Atomic writes (temp file + rename) so a crashed run cannot corrupt the cache. Corrupt-file load returns an empty usable cache and reports the parse error. - run.go: executes each pair through the classifier, populating the cache. Individual classify errors are downgraded to "not correct" and logged — the run always finishes so the operator sees whatever data is available. - metrics.go: confusion matrix, P/R/F1/accuracy, per-confidence- bucket calibration ([0-0.5), [0.5-0.75), [0.75-0.9), [0.9-1.0]). Prints human summary + surfaces highest-confidence mismatches first (most actionable for prompt iteration). Optional JSON report. - Threshold gate: -threshold N exits non-zero when F1<N. Default 0 (gating disabled until we have a baseline F1). Fixture: seeds 15 hand-crafted DACH-market pairs covering the edge cases we actually care about — umlaut drift (Straßburg/Strassburg), year difference on a recurring series, word-reordering, distinct events at the same venue, historical proper names (Striezelmarkt), same city with multiple distinct Christmas markets. Operator extends over time; each pair carries a `note` explaining the case it locks. .gitignore adds .eval-cache.json and eval-report.json — neither should land in the repo. Tests cover metrics edge cases (all correct, imbalanced, no-positive-predictions-no-NaN, calibration bucket assignment, cache accounting, empty input) and cache behaviour (round-trip, symmetric lookup, model-scoped invalidation, missing/corrupt file handling, atomic-write leaves no temp files). Out of scope for MR 5: enrichment field accuracy (fuzzy text scoring is its own problem — tracked for a follow-up), CI wiring (needs a baseline F1 first).	2026-04-24 12:26:18 +02:00
vikingowl	525a20b79c	Merge branch 'feat/discovery-auto-merge-crawl' — Ship 2 MRs 2–7 (enrichment foundation, crawl-enrich, LLM enrich, AI similarity, auto-merge) Brings the full Ship 2 feature stack (except the eval harness and detail drawer) into main. Conflicts resolved: - repository.go: kept MR 1's sort params + queueOrderByClause builder on ListQueue, AND MR 7's FindPendingMatch + MergePendingSources (MR 7 removed the old QueueHasPending). ListQueue SELECT keeps the enrichment columns MR 2 added. - mock_repo_test.go: kept both MR 1's listQueueCalls capture and the MR 2-4 enrichment/similarity hooks. - service_test.go: ListPendingQueuePaged uses MR 1's sort-param signature; NewService uses the MR 2-7 seven-arg form. - handler_test.go: TestListQueueSortParamWhitelist's NewService call bumped from 4 args to 7 (nil geocoder, nil llm enricher, nil sim classifier). Features landing on main: - MR 2: enrichment schema (migration 000019), jsonb payload, enrich package with Merge/CacheKey/NoopLLMEnricher. - MR 3: manual crawl-enrich-all button + async 202 status endpoint. - MR 3b: per-row LLM enrich via scrape-then-prompt (pkg/scrape + MistralLLMEnricher). - MR 4: AI similarity tiebreak (migration 000020), MistralSimilarityClassifier, per-candidate AI? button in the Similar panel. - MR 7: cross-crawl auto-merge for new sources on pending queue rows (FindPendingMatch + MergePendingSources, AutoMerged counter).	2026-04-24 12:13:30 +02:00
vikingowl	28202c71df	Merge branch 'feat/discovery-queue-sort' — MR 1 sortable queue, default konfidenz desc	2026-04-24 12:05:16 +02:00
vikingowl	c06788a63d	feat(discovery): auto-merge queue rows across crawl runs Ship 2 MR 7. Replaces the "drop on duplicate" branch of the crawl loop with a cross-run auto-merge: when a new crawl brings a source that a pending queue row doesn't yet carry, the new source's data merges into the existing row instead of spawning a second entry. Operator review burden stays bounded to one row per market even as coverage grows across sources. Konfidenz upgrades come for free: a row that starts with one source at konfidenz=mittel flips to hoch the moment a second independent source confirms the same (name, city, start_date) triple. Repo changes - QueueHasPending (bool) replaced by FindPendingMatch returning *DiscoveredMarket. Same exact-tuple lookup; now callers see the full match so they can merge. - MergePendingSources appends new sources/quellen/contributions onto a pending row using set-union semantics. source_contributions dedupe by SourceName so repeat crawls don't stack duplicate entries. Konfidenz and hinweis are overwritten with caller-computed values. - Idempotent: send the same delta twice, nothing changes the second time. Service.Crawl flow - On match + incoming source already on the row -> DedupedQueue. Same semantic as before, just more tightly scoped (same source re-emits an event; previously any match counted as dedup). - On match + incoming source not yet on the row -> auto-merge path: compute the source/quellen/contribution delta, call MergePendingSources, count in summary.AutoMerged. - The crawlerKonfidenz helper is now a thin wrapper over a shared konfidenzForSources(sources []string), reused by the merge path. Source-name constants extracted to un-hardcode the switch cases and the test references. Summary + UI - CrawlSummary gains AutoMerged int. Logged alongside the other counters. - +page.svelte crawl-result grid gets an "Auto-merged" tile. Tests - Same-source redundant pickup -> DedupedQueue=1, no MergePendingSources call, no insert. - New-source auto-merge -> AutoMerged=1, MergePendingSources called with exact delta (addSources=[new only], addQuellen=[new only], addContribs labelled with new source_name), konfidenz upgraded to hoch. - Existing TestServiceCrawlDedupQueue renamed to TestServiceCrawlDedupQueue_SameSourceRedundant reflecting the tightened semantic. No migration — existing text[] and jsonb columns support the union operations via SQL.	2026-04-24 12:01:01 +02:00
vikingowl	e0b73acfd6	feat(discovery): AI tiebreak for ambiguous similarity matches Ship 2 MR 4. Adds per-pair AI-backed classification for operator use inside the existing Similar panel: an "AI?" button next to each candidate asks Mistral whether the two queue rows refer to the same underlying market. Result shown inline as a green "✓ same N%" or grey "✗ diff N%" chip with the LLM's reason on hover. No scraping — the classifier works from (name, city, year) alone, which is enough for the common cases (same venue on two calendars, typos, cross-year recurrence). Call is short (usually <3s) so the handler is synchronous, 15s deadline. Caching - Migration 000020 adds similarity_ai_cache keyed on a content hash over (normalized_name\|stadt\|year) for both rows, sorted for symmetry. Survives queue row accept/reject because the hash is about markt-content, not queue-row lifecycle. - enrich.SimilarityPairKey computes the key. Classify(a,b) and Classify(b,a) hit the same entry. Stadt casing drift doesn't invalidate. - Repo methods GetSimilarityCache / SetSimilarityCache + corresponding mock hooks. DefaultSimilarityCacheTTL=30d. Mistral integration - enrich.MistralSimilarityClassifier reuses the same aiPass2 interface as the enricher. English system prompt asks for JSON-only output with {same_market, confidence 0..1, reason}. Confidence clamped to [0,1] because models occasionally return 1.2 or -0.1. Reason is short German justification. - NoopSimilarityClassifier returns an error — callers must check ai.Enabled() before deciding which binding to pass. Service.ClassifySimilarPair loads both rows, computes pair key, cache-first, calls classifier on miss, writes cache, returns verdict. Rejects self-comparison (pair-key collapses). Handler POST /admin/discovery/queue/:aid/similar/:bid/classify. UI: new AI? column inside the Similar panel. Per-candidate pending state via Set<string>, disabled button while in-flight, inline verdict chip after response. Tooltip shows the LLM's reason. Tests: pair-key symmetry + differentiation + casing tolerance; Mistral classifier happy path, clamping edge cases, error propagation, bad-JSON handling, Noop rejection. Service tests: happy path writes cache, cache-hit skips LLM, self-comparison rejected, classifier errors don't poison the cache. NewService signature grows by one param (sim enrich. SimilarityClassifier). All 14 existing callers (routes.go + tests) updated; tests pass nil.	2026-04-24 11:04:15 +02:00
vikingowl	ce32f76731	feat(discovery): per-row LLM enrichment via scrape-then-prompt Completes the manual two-pass enrichment flow: the crawl-enrich-all button (MR 3) fills deterministic fields across the queue; this MR adds a per-row "AI" button that scrapes the row's quellen URLs and asks Mistral to fill category, opening_hours, description. Flow per click: 1. Load row, compute CacheKey(name_normalized, stadt, year). 2. Cache hit -> skip LLM, merge cached payload onto current crawl-enrich base, persist, return. 3. Miss -> scrape up to 5 quellen URLs via pkg/scrape (goquery text extraction, 4000-char truncation), concatenate into labeled blocks, call ai.Client.Pass2 with JSON response format. 4. Parse response into Enrichment{category, opening_hours, description}, stamp provenance=llm + model + token counts. 5. Cache the raw LLM payload (not the merged one) under the tuple key with DefaultCacheTTL=30d, so later re-crawls can layer new crawl-enrich bases on the same cached answer. 6. Merge(crawl, llm) -- crawl fields survive. Persist via SetEnrichment(status=done). Return merged to the operator. ErrNoScrapedContent fails fast when zero URLs return usable text; LLMs without grounding hallucinate, and a 400-style operator error is better than inventing details. Individual scrape failures don't halt the flow as long as at least one source succeeds. pkg/scrape (new, reusable) - Client.Fetch: HTTP GET, strip script/style/nav/footer/aside via goquery, gather body text, collapse whitespace, truncate. DefaultTimeout=10s, DefaultMaxChars=4000. User-Agent configurable. - Tests cover noise stripping, whitespace collapsing, truncation, body-less fragments. enrich.MistralLLMEnricher - Takes ai.Client + Scraper (both injectable; tests use stubs). - Prompt: English system instructions asking for JSON-only output with category/opening_hours/description in German. User prompt includes markt identifiers, already-filled fields (so the LLM doesn't waste tokens re-deriving them), and scraped blocks. - Tests: happy path, all-scrapes-fail (-> ErrNoScrapedContent), partial-scrape-success, empty LLM fields yield no provenance, URL cap at 5. Service.RunLLMEnrichOne + handler POST /admin/discovery/queue/:id/ enrich (sync, 30s timeout). NewService gains llm enrich.LLMEnricher param; routes.go constructs a MistralLLMEnricher when ai.Client is enabled, falls back to NoopLLMEnricher otherwise. UI: per-row AI button next to Similar, tracks per-row pending state via a Set<string>, disables the button while the request is in flight and shows "AI..." label. Success invalidates the page, the row's expanded view picks up the new category/opening_hours/ description fields with llm provenance tags. Inline error message on the row if the enrich action fails.	2026-04-24 10:46:28 +02:00
vikingowl	afe9d916d6	feat(discovery): manual crawl-enrich-all button + payload display Replaces the originally-planned async-worker design with operator- triggered bulk runs (see memory/project_ship2_enrichment.md). Crawl- enrichment is cheap enough to always run against the whole list but runs only when the admin clicks — the flow stays predictable and the crawl itself stays fast. Endpoints - POST /admin/discovery/enrichment/crawl-all — 202 + goroutine, mirrors the crawl pattern. Per-process CAS gate prevents concurrent runs. - GET /admin/discovery/enrichment/crawl-all-status — polled shape identical to /crawl-status for UI reuse. Service RunCrawlEnrichAll iterates enrichment_status='pending' rows, builds an enrich.Input from each, runs CrawlEnrich (consolidation + Nominatim geocoding via the shared geocoder), and persists via SetEnrichment(status=done). Per-row errors count toward Failed and append to a bounded Errors slice; the pass never halts. Enrich package refactor - Enrichment, Sources, Provenance constants moved from discovery -> enrich (they are the enrich package's own types; discovery previously held them for historical reasons). - CrawlEnrich now takes a narrow enrich.Input / enrich.Contribution so the enrich package no longer imports the parent discovery package. This breaks the import cycle that appeared once discovery needed to call enrich (the MR 2 structure only worked because no caller went in that direction yet). - LLMEnricher takes an LLMRequest (primitives) instead of a DiscoveredMarket. NoopLLMEnricher updated; real Mistral impl lands in MR 3b. - CacheKey signature switched from (DiscoveredMarket) to primitive (nameNormalized, stadt, year). Service geocoder wiring: discovery.NewService gains a Geocoder param (routes.go passes the shared Nominatim client; the interface lives in discovery to avoid another circular edge with enrich). UI: "Run crawl-enrich" button next to "Run crawl"; identical poll + summary card pattern. Queue row expand shows enrichment status badge plus the PLZ/Venue/Organizer/Lat-Lng fields inline with per-field provenance tag. Tests: three new service tests (happy path, per-row SetEnrichment failure, empty-queue no-op). Existing enrich package tests updated for the primitive input signature. All 13 test NewService call-sites updated for the new geocoder param.	2026-04-24 10:29:58 +02:00
vikingowl	dcbf38f6e9	feat(discovery): enrichment foundation — schema, types, crawl-enrich, cache Lays infrastructure for Ship 2 crawl-time enrichment. Design principles (see memory/project_ship2_enrichment.md): - async worker (not inline in crawl) — MR 3 wires it up - single enrichment jsonb column, not typed columns — shape still in flux - per-row LLM budget, global soft cap logged - crawl-enrich runs first; LLM only fills gaps it cannot reach Migration 000019: adds discovered_markets.enrichment{,_status,_attempts} and enriched_at; partial index on enrichment_status for the worker's claim query; enrichment_cache table keyed by sha256(name\|city\|year). enrich package: - crawl.go — pure consolidator over SourceContributions (PLZ, venue, organizer), first non-empty wins. Optional Geocoder pulls lat/lng via Nominatim; failures are non-fatal. Everything marked provenance=crawl. - llm.go — LLMEnricher interface + NoopLLMEnricher. Real Mistral-backed impl lands in MR 3 along with the worker. - enrich.go — Merge(base, overlay) with base-wins semantics, enforcing the crawl-over-llm invariant at the type level: even a confident LLM pass can't overwrite a crawl-populated field. - cache.go — CacheKey() stable across re-crawls; DefaultCacheTTL=30d. Repository: scan/persist the new columns, GetEnrichmentCache / SetEnrichmentCache / SetEnrichment. The SetEnrichment UPDATE increments attempts server-side and stamps enriched_at only for terminal states (done\|failed) — 'skipped' keeps the previous timestamp. No UI changes and no worker binary yet. Noop LLM enricher in place so MR 3 can wire the worker without refactoring shape.	2026-04-24 09:55:38 +02:00
vikingowl	65027ca9aa	feat(discovery): sortable queue columns, default konfidenz desc Admin queue table gains clickable sort on Markt, Stadt, Datum, Quellen (count), and Konfidenz. Default on page load is konfidenz desc with start_datum ASC NULLS LAST as the within-tier tiebreaker — operators see highest-confidence, soonest-upcoming markets first. URL state (?sort=&order=) is the single source of truth; F5 preserves, localStorage is not used. Backend: ListQueue takes (sortBy, order); repository builds ORDER BY from a closed whitelist — konfidenz uses a CASE rank (hoch=3, mittel=2, niedrig=1), quellen_count uses cardinality(quellen). Handler normalisers reject anything off the whitelist and echo the effective values in meta.sort / meta.order so the UI can render arrows. Unit tests lock the emitted SQL per combination and assert raw input cannot leak into ORDER BY.	2026-04-24 09:38:53 +02:00
vikingowl	52f3e4c009	chore: replace personal emails with contact@marktvogt.de	2026-04-21 10:56:07 +02:00
vikingowl	d6b65501ec	security: redact agent ID from helm values; gitignore superpowers docs Remove Mistral agent ID from agentDiscovery comment in helm values.yaml. Add docs/superpowers/ to .gitignore to prevent re-tracking internal AI plans.	2026-04-21 09:48:32 +02:00
vikingowl	9232203dd3	Merge branch 'chore/access-ttl-and-ship2-handoff' — Ship 2 handoff + TTL bump	2026-04-19 01:06:14 +02:00
vikingowl	b52ac7d861	docs(ship-2): handoff note + chore(helm): bump JWT access TTL 15m to 2h Handoff captures end-of-Ship-1 state and Ship 2 scope (§4.10 expanded product additions: crawl-time enrichment, AI-augmented similarity, inline enrich-before-accept, detail drawer, eval harness, enrichment cache, auto-merge during crawl, keyboard shortcuts). §4.12 tracks the admin auth refresh-on-401 fix; pending that work JWT_ACCESS_TTL bumped from 15m to 2h as interim relief.	2026-04-19 01:05:52 +02:00
vikingowl	95a3dfdef8	Merge branch 'fix/queue-pagination-envelope' — queue UI renders rows again MR 6's backend + MR 7's UI had mismatched envelope assumptions. Backend returned pagination as sibling fields to data; UI's ApiResponse<T> wrapper only typed data, so 'body.data' (the queue) became undefined at runtime.	2026-04-19 00:46:49 +02:00
vikingowl	bddab60686	fix(admin): queue response uses meta envelope; UI reads total from meta MR 6 backend returned {data, total, limit, offset} as siblings but the shared ApiResponse<T> envelope only types the data field. The UI's load function treated queueRes.data as a wrapper and read body.data (undefined) as the row list. Result: empty queue in UI despite 1384 pending rows in the DB. Fix: backend moves total/limit/offset into meta (matches PaginationMeta convention from web/src/lib/api/types.ts). UI casts to read the meta slot alongside typed data.	2026-04-19 00:46:05 +02:00
vikingowl	b42a35c049	Merge branch 'feat/merge-conflict-display' — MR 7 per-source contributions visible Migration 000018 adds sources text[] + source_contributions jsonb to discovered_markets. Crawler preserves raw per-source RawEvents through Merge() and service persists them alongside the merged row. Admin UI gains a merged-sources chip + Datumskonflikt badge and an expandable Quellen-Vergleich panel showing per-field comparison across sources with conflicting values highlighted.	2026-04-19 00:28:24 +02:00
vikingowl	cc6c4f2efb	feat(discovery): persist and display per-source contributions for merged queue rows Migration 000018 adds sources text[] + source_contributions jsonb columns to discovered_markets. Crawler's merger now preserves the raw per-source RawEvents through Merge() so they can be stored alongside the merged row. Admin UI gains two surfaces: (a) compact "merged from source1 + source2" chip + amber Datumskonflikt badge when hinweis flags it, (b) expandable Quellen-Vergleich panel showing a per-field comparison table with diverging fields highlighted. Forensic visibility into what each source said vs what the merger picked.	2026-04-19 00:27:34 +02:00
vikingowl	f22a141615	Merge branch 'feat/admin-queue-pagination-and-similar' — MR 6 queue UX Queue endpoint returns {data, total, limit, offset}; admin UI exposes prev/next + page-size + Showing X-Y of Z. Per-row Similar button fetches MR 5's /queue/:id/similar via a SvelteKit proxy and renders matches inline. Essential for reviewing the 1000+ row queue post-fix.	2026-04-19 00:14:52 +02:00
vikingowl	2acd0cdc06	feat(admin): queue pagination + per-row Show similar button Queue endpoint now returns {data, total, limit, offset}. Admin UI reads ?page + ?limit from URL, renders prev/next + page-size selector + "Showing X-Y of Z" label. Per-row Similar button fetches the MR 5 /queue/:id/similar endpoint via a new SvelteKit proxy route and renders matches inline with score/name/city/date. Essential for navigating the 1000+ row queue after MR 5's crawl fixes.	2026-04-18 23:59:18 +02:00
vikingowl	5c363944b2	Merge branch 'feat/crawl-similarity-and-fixes' — MR 5 crawler cleanup + similarity Drops link-check from crawl path (was timing-bound, misleading counter). Fixes suendenfrei pagination footer-link infinite loop. Adds similarity helper with Levenshtein-based fuzzy name match + city match + date proximity, exposed as GET /queue/:id/similar for admin duplicate review.	2026-04-18 20:05:44 +02:00
vikingowl	073e55c7fc	feat(discovery): drop link-check from crawl path, fix suendenfrei pagination, add similarity helper - Service.Crawl no longer link-verifies Quellen/Website for crawler events. Those URLs come from real HTML of trusted sources and have been implicitly verified at parse time. Removing this makes the insert phase complete in well under a minute even for 1500+ events and stops attributing timing-limited processing as link failures. LinkCheckFailed counter retained for JSON shape stability. - Suendenfrei pagination now stops on len(events) == 0. Previously the site's footer <h3><a> links kept anchors.Length() > 0 indefinitely, sending the crawler to page-90 before the outer ctx timeout. - New similarity helper (SimilarityScore, FindSimilar) and endpoint GET /api/v1/admin/discovery/queue/:id/similar. Multiplicative score of normalized-name Levenshtein ratio gating city-match and date- proximity bonuses. Prevents coincident-city/date events from being incorrectly flagged as near-duplicates when their names differ. Lets admin review flag near-duplicates that slip past exact-match dedup (date typos, city variants, trailing-word swaps).	2026-04-18 20:05:07 +02:00
vikingowl	cdd43cc45a	Merge branch 'feat/crawl-async' — async crawl handler, UI polls status Gateway (NGF) ignored our HTTPRoute timeouts field (UnsupportedField). Flipping to fire-and-forget: handler returns 202 immediately, goroutine runs crawl with detached 5-min context, GET /admin/discovery/crawl-status returns state, admin UI polls every 3s until running=false. HTTP requests are now all sub-second; gateway timeout is no longer in the crawl critical path. Concurrent-run protection via atomic.Bool (replaces TryLock), rate limit semantics unchanged.	2026-04-18 19:25:37 +02:00
vikingowl	9f286b8029	feat(discovery): async crawl — 202 Accepted, status endpoint, UI polls Handler.Crawl now spawns a goroutine with a 5-minute detached context and returns 202 immediately. Admin UI polls the new GET /admin/discovery/crawl-status every 3s until running=false, then renders CrawlSummary. Bypasses the 60s nginx-gateway proxy_read_timeout entirely — HTTP requests are all sub-second. Concurrency: atomic.Bool guard (CompareAndSwap) replaces TryLock, resultMu RWMutex protects the summary/error state, rateMu protects the rate-limit check. Rate limit semantics unchanged (still applies to admin-session path, bearer-token bypass via context flag).	2026-04-18 19:24:48 +02:00
vikingowl	2ea8a9a6f3	Merge branch 'fix/discovery-crawl-timeout' — crawl survives gateway timeout Gateway cut the HTTP request at 60s, which cancelled the request ctx and cascaded into the link-verifier in Service.Crawl's insert pipeline. Every merged event was then dropped as LinkCheckFailed, resulting in zero new queue rows despite the crawler parsing ~1500 events. Fix is three parts: HTTPRoute timeout 300s for /crawl*, insert-phase context detached from the HTTP request ctx, and a CrawlSummary INFO log line for diagnosability.	2026-04-18 18:40:30 +02:00
vikingowl	f6e4e5c29f	fix(discovery): crawl survives gateway timeout and long-running runs - HTTPRoute: add 300s request+backendRequest timeout rule for /api/v1/admin/discovery/crawl; default rule unchanged. nginx-gateway's 60s default was cutting the connection mid-crawl. - Service.Crawl: detach insert pipeline from HTTP request context with a 3-minute internal timeout. Previously a canceled request ctx cascaded into the link-verifier, failing every URL check and counting every merged event as LinkCheckFailed. Inserts now complete even if the gateway cut the connection. - Log CrawlSummary at INFO on completion so outcomes are visible in backend logs without needing the HTTP response body. - New test: TestServiceCrawlDetachesInsertContextFromRequestCtx.	2026-04-18 18:39:21 +02:00
vikingowl	2bb5156c0b	Merge branch 'feat/discovery-crawler-mr2' — Ship 1 MR 2 cutover Deletes the Mistral Pass 0 code path from discovery, flips the k8s CronJob to the crawler endpoint on a daily schedule, and adds a Run crawl button to the admin UI that renders CrawlSummary. Net change: ~-900 lines / +150 lines. Mistral remains wired for Pass 1 and Pass 2 research — only Pass 0 discovery is replaced by the deterministic 5-source Go crawler.	2026-04-18 17:49:08 +02:00
vikingowl	ba453a910f	chore(helm): daily discovery cron hits /crawl endpoint	2026-04-18 17:46:39 +02:00
vikingowl	3add4fb7ad	refactor(discovery): remove Mistral Pass 0 path; /crawl is canonical Deletes agent_client.go, agent_client_test.go, and the discovery-compare diagnostic CLI. Removes Tick/PickBuckets/processOneBucket/processBucketResponse from Service; renames NewServiceWithCrawler to NewService. Drops BatchSize, ForwardMonths, AgentDiscovery config fields and their env reads. PickStaleBuckets and UpdateBucketQueried removed from Repository interface (no callers). Stats hardcodes forwardMonths=12. /tick route removed; /crawl is now the only machine path, still protected by requireTickToken middleware.	2026-04-18 17:42:30 +02:00
vikingowl	a729412478	feat(admin): add Run crawl button and CrawlSummary rendering to discovery page	2026-04-18 17:29:05 +02:00
vikingowl	4c7c3dcb37	Merge branch 'feat/discovery-crawler' — DACH discovery crawler MR 1 Replaces Mistral Pass 0 with a deterministic 5-source Go crawler (marktkalendarium.de, mittelalterkalender.info, festival-alarm.com, mittelaltermarkt.online Tribe REST, suendenfrei.tv). Pass 1/2 enrichment paths unchanged. Existing Mistral Tick path preserved alongside; cutover gated on coverage verification via cmd/discovery-compare. Spec: docs/superpowers/specs/2026-04-18-dach-discovery-crawler-design.md Plan: docs/superpowers/plans/2026-04-18-dach-discovery-crawler.md	2026-04-18 17:03:27 +02:00
vikingowl	7c8a8c6419	fix(discovery): review follow-ups — konfidenz signal, end-date default, determinism, rate-limit=0 - Service.Crawl derives Konfidenz from merged source count + rank instead of hardcoded "mittel". Two+ sources -> "hoch"; single curated source -> "mittel"; single suendenfrei (prose regex) -> "niedrig". - New AgentStatus constant "crawler" replaces "bestaetigt" for crawler rows so the validator's agent-specific rules don't fire on them and operators can filter the queue by origin. Added Konfidenz* and AgentStatus* constants to model.go. - Default EndDatum to StartDatum when a source reports a single date (festival_alarm one-day events, suendenfrei lines without a "bis" range). Avoids Service.Accept rejecting nil-EndDatum rows. - Sort PerSource names before assembling raw events for merge — makes merged output order deterministic across runs. - NewHandler: manualRateLimitPerHour <= 0 now explicitly disables the rate limit (previously silently floored to 1/hour). Documented behavior for all three cases in a constructor comment. - Added four new tests for Service.Crawl failure/quality paths: LinkCheckFailed, DedupedQueue, EndDatum default, multi-source Konfidenz. - Documented the substring-match approximation in cmd/discovery-compare/main.go's groupCrawlerByBucket — diagnostic-only, not safe for production routing.	2026-04-18 16:35:26 +02:00
vikingowl	c5a4bc441c	feat(cmd): discovery-compare CLI for pre-cutover coverage verification	2026-04-18 16:08:48 +02:00
vikingowl	0bed4401fe	feat(config): crawler user-agent and manual rate-limit knobs	2026-04-18 15:50:21 +02:00
vikingowl	91cd4d89b3	feat(discovery): POST /admin/discovery/crawl with mutex and rate limit Exposes Service.Crawl via two HTTP routes: a bearer-token path that bypasses the manual rate limit, and an admin-session path subject to a configurable per-hour cap. A sync.Mutex blocks concurrent runs. Includes handler tests for mutex reentry and rate limit enforcement.	2026-04-18 15:22:24 +02:00
vikingowl	b3289bc6e6	feat(discovery): Service.Crawl — orchestrate crawler through existing pipeline Extract normalize helpers into discovery/normalize subpackage to break the otherwise circular import (discovery/crawler → discovery → crawler). NormalizeName/NormalizeCity in discovery become thin wrappers; merger.go switches to discovery/normalize directly. Adds crawlerRunner interface, NewServiceWithCrawler constructor, CrawlSummary/ SourceSummary types, and Service.Crawl which wires the crawler output through link-verify, dedup, validation, and insert — same pipeline as processBucketResponse but without a bucket context (BucketID is nil on crawler-produced rows).	2026-04-18 15:03:02 +02:00
vikingowl	20176dd51f	refactor(discovery): validator accepts *Bucket, skips bucket checks when nil	2026-04-18 14:43:07 +02:00
vikingowl	310673940e	feat(discovery): migration 000017 — nullable bucket_id; model uses *uuid.UUID	2026-04-18 14:30:54 +02:00
vikingowl	507052e375	feat(discovery/crawler): source config and RunAll orchestrator	2026-04-18 14:09:22 +02:00
vikingowl	c013f6bc54	feat(discovery/crawler): cross-source merger with source-rank tiebreaks	2026-04-18 13:40:21 +02:00
vikingowl	3aed982e1c	feat(discovery/crawler): log unparseable suendenfrei entries at INFO	2026-04-18 13:33:51 +02:00

1 2 3 4 5

248 Commits