Commit Graph

218 Commits

Author SHA1 Message Date
073e55c7fc feat(discovery): drop link-check from crawl path, fix suendenfrei pagination, add similarity helper
- Service.Crawl no longer link-verifies Quellen/Website for crawler
  events. Those URLs come from real HTML of trusted sources and have
  been implicitly verified at parse time. Removing this makes the
  insert phase complete in well under a minute even for 1500+ events
  and stops attributing timing-limited processing as link failures.
  LinkCheckFailed counter retained for JSON shape stability.

- Suendenfrei pagination now stops on len(events) == 0. Previously the
  site's footer <h3><a> links kept anchors.Length() > 0 indefinitely,
  sending the crawler to page-90 before the outer ctx timeout.

- New similarity helper (SimilarityScore, FindSimilar) and endpoint
  GET /api/v1/admin/discovery/queue/:id/similar. Multiplicative score
  of normalized-name Levenshtein ratio gating city-match and date-
  proximity bonuses. Prevents coincident-city/date events from being
  incorrectly flagged as near-duplicates when their names differ.
  Lets admin review flag near-duplicates that slip past exact-match
  dedup (date typos, city variants, trailing-word swaps).
2026-04-18 20:05:07 +02:00
cdd43cc45a Merge branch 'feat/crawl-async' — async crawl handler, UI polls status
Gateway (NGF) ignored our HTTPRoute timeouts field (UnsupportedField).
Flipping to fire-and-forget: handler returns 202 immediately, goroutine
runs crawl with detached 5-min context, GET /admin/discovery/crawl-status
returns state, admin UI polls every 3s until running=false.

HTTP requests are now all sub-second; gateway timeout is no longer in
the crawl critical path. Concurrent-run protection via atomic.Bool
(replaces TryLock), rate limit semantics unchanged.
2026-04-18 19:25:37 +02:00
9f286b8029 feat(discovery): async crawl — 202 Accepted, status endpoint, UI polls
Handler.Crawl now spawns a goroutine with a 5-minute detached context
and returns 202 immediately. Admin UI polls the new
GET /admin/discovery/crawl-status every 3s until running=false, then
renders CrawlSummary. Bypasses the 60s nginx-gateway proxy_read_timeout
entirely — HTTP requests are all sub-second.

Concurrency: atomic.Bool guard (CompareAndSwap) replaces TryLock,
resultMu RWMutex protects the summary/error state, rateMu protects
the rate-limit check. Rate limit semantics unchanged (still applies
to admin-session path, bearer-token bypass via context flag).
2026-04-18 19:24:48 +02:00
2ea8a9a6f3 Merge branch 'fix/discovery-crawl-timeout' — crawl survives gateway timeout
Gateway cut the HTTP request at 60s, which cancelled the request ctx
and cascaded into the link-verifier in Service.Crawl's insert pipeline.
Every merged event was then dropped as LinkCheckFailed, resulting in
zero new queue rows despite the crawler parsing ~1500 events.

Fix is three parts: HTTPRoute timeout 300s for /crawl*, insert-phase
context detached from the HTTP request ctx, and a CrawlSummary INFO
log line for diagnosability.
2026-04-18 18:40:30 +02:00
f6e4e5c29f fix(discovery): crawl survives gateway timeout and long-running runs
- HTTPRoute: add 300s request+backendRequest timeout rule for
  /api/v1/admin/discovery/crawl; default rule unchanged. nginx-gateway's
  60s default was cutting the connection mid-crawl.
- Service.Crawl: detach insert pipeline from HTTP request context with
  a 3-minute internal timeout. Previously a canceled request ctx
  cascaded into the link-verifier, failing every URL check and
  counting every merged event as LinkCheckFailed. Inserts now complete
  even if the gateway cut the connection.
- Log CrawlSummary at INFO on completion so outcomes are visible in
  backend logs without needing the HTTP response body.
- New test: TestServiceCrawlDetachesInsertContextFromRequestCtx.
2026-04-18 18:39:21 +02:00
2bb5156c0b Merge branch 'feat/discovery-crawler-mr2' — Ship 1 MR 2 cutover
Deletes the Mistral Pass 0 code path from discovery, flips the k8s
CronJob to the crawler endpoint on a daily schedule, and adds a
Run crawl button to the admin UI that renders CrawlSummary.

Net change: ~-900 lines / +150 lines. Mistral remains wired for Pass 1
and Pass 2 research — only Pass 0 discovery is replaced by the deterministic
5-source Go crawler.
2026-04-18 17:49:08 +02:00
ba453a910f chore(helm): daily discovery cron hits /crawl endpoint 2026-04-18 17:46:39 +02:00
3add4fb7ad refactor(discovery): remove Mistral Pass 0 path; /crawl is canonical
Deletes agent_client.go, agent_client_test.go, and the discovery-compare
diagnostic CLI. Removes Tick/PickBuckets/processOneBucket/processBucketResponse
from Service; renames NewServiceWithCrawler to NewService. Drops BatchSize,
ForwardMonths, AgentDiscovery config fields and their env reads. PickStaleBuckets
and UpdateBucketQueried removed from Repository interface (no callers). Stats
hardcodes forwardMonths=12. /tick route removed; /crawl is now the only machine
path, still protected by requireTickToken middleware.
2026-04-18 17:42:30 +02:00
a729412478 feat(admin): add Run crawl button and CrawlSummary rendering to discovery page 2026-04-18 17:29:05 +02:00
4c7c3dcb37 Merge branch 'feat/discovery-crawler' — DACH discovery crawler MR 1
Replaces Mistral Pass 0 with a deterministic 5-source Go crawler
(marktkalendarium.de, mittelalterkalender.info, festival-alarm.com,
mittelaltermarkt.online Tribe REST, suendenfrei.tv). Pass 1/2 enrichment
paths unchanged. Existing Mistral Tick path preserved alongside; cutover
gated on coverage verification via cmd/discovery-compare.

Spec: docs/superpowers/specs/2026-04-18-dach-discovery-crawler-design.md
Plan: docs/superpowers/plans/2026-04-18-dach-discovery-crawler.md
2026-04-18 17:03:27 +02:00
7c8a8c6419 fix(discovery): review follow-ups — konfidenz signal, end-date default, determinism, rate-limit=0
- Service.Crawl derives Konfidenz from merged source count + rank instead of
  hardcoded "mittel". Two+ sources -> "hoch"; single curated source ->
  "mittel"; single suendenfrei (prose regex) -> "niedrig".
- New AgentStatus constant "crawler" replaces "bestaetigt" for crawler rows
  so the validator's agent-specific rules don't fire on them and operators
  can filter the queue by origin. Added Konfidenz* and AgentStatus*
  constants to model.go.
- Default EndDatum to StartDatum when a source reports a single date
  (festival_alarm one-day events, suendenfrei lines without a "bis" range).
  Avoids Service.Accept rejecting nil-EndDatum rows.
- Sort PerSource names before assembling raw events for merge — makes
  merged output order deterministic across runs.
- NewHandler: manualRateLimitPerHour <= 0 now explicitly disables the
  rate limit (previously silently floored to 1/hour). Documented behavior
  for all three cases in a constructor comment.
- Added four new tests for Service.Crawl failure/quality paths:
  LinkCheckFailed, DedupedQueue, EndDatum default, multi-source Konfidenz.
- Documented the substring-match approximation in
  cmd/discovery-compare/main.go's groupCrawlerByBucket — diagnostic-only,
  not safe for production routing.
2026-04-18 16:35:26 +02:00
c5a4bc441c feat(cmd): discovery-compare CLI for pre-cutover coverage verification 2026-04-18 16:08:48 +02:00
0bed4401fe feat(config): crawler user-agent and manual rate-limit knobs 2026-04-18 15:50:21 +02:00
91cd4d89b3 feat(discovery): POST /admin/discovery/crawl with mutex and rate limit
Exposes Service.Crawl via two HTTP routes: a bearer-token path that
bypasses the manual rate limit, and an admin-session path subject to a
configurable per-hour cap. A sync.Mutex blocks concurrent runs.
Includes handler tests for mutex reentry and rate limit enforcement.
2026-04-18 15:22:24 +02:00
b3289bc6e6 feat(discovery): Service.Crawl — orchestrate crawler through existing pipeline
Extract normalize helpers into discovery/normalize subpackage to break
the otherwise circular import (discovery/crawler → discovery → crawler).
NormalizeName/NormalizeCity in discovery become thin wrappers; merger.go
switches to discovery/normalize directly.

Adds crawlerRunner interface, NewServiceWithCrawler constructor, CrawlSummary/
SourceSummary types, and Service.Crawl which wires the crawler output through
link-verify, dedup, validation, and insert — same pipeline as processBucketResponse
but without a bucket context (BucketID is nil on crawler-produced rows).
2026-04-18 15:03:02 +02:00
20176dd51f refactor(discovery): validator accepts *Bucket, skips bucket checks when nil 2026-04-18 14:43:07 +02:00
310673940e feat(discovery): migration 000017 — nullable bucket_id; model uses *uuid.UUID 2026-04-18 14:30:54 +02:00
507052e375 feat(discovery/crawler): source config and RunAll orchestrator 2026-04-18 14:09:22 +02:00
c013f6bc54 feat(discovery/crawler): cross-source merger with source-rank tiebreaks 2026-04-18 13:40:21 +02:00
3aed982e1c feat(discovery/crawler): log unparseable suendenfrei entries at INFO 2026-04-18 13:33:51 +02:00
2163621415 feat(discovery/crawler): suendenfrei.tv parser 2026-04-18 13:09:47 +02:00
94aa261c90 refactor(discovery/crawler): hoist land constants; document Tribe date format assumption 2026-04-18 13:04:28 +02:00
1cc7de0bb6 feat(discovery/crawler): mittelaltermarkt.online Tribe REST client 2026-04-18 12:53:32 +02:00
a55bb7e15b docs(discovery/crawler): clarify unused year param in parseDateAttr 2026-04-18 12:48:38 +02:00
93efb90967 feat(discovery/crawler): festival-alarm.com parser 2026-04-18 12:36:01 +02:00
91c058105e feat(discovery/crawler): mittelalterkalender.info parser 2026-04-18 12:24:49 +02:00
e6ec97c09d feat(discovery/crawler): marktkalendarium.de parser 2026-04-18 12:12:13 +02:00
57120beac0 feat(discovery/crawler): polite HTTP fetcher with retry and 429 backoff 2026-04-18 12:02:47 +02:00
31fea6fa3c test(discovery/crawler): add PLZ boundary + range coverage cases 2026-04-18 12:00:42 +02:00
eed76f1e76 docs(discovery/crawler): align PLZ helper comment with implementation 2026-04-18 11:57:34 +02:00
4694804331 feat(discovery/crawler): PLZ-to-land inference helper 2026-04-18 11:54:17 +02:00
e359d06d13 test(discovery/crawler): capture golden fixtures from five sources 2026-04-18 11:45:53 +02:00
5135f0a3be feat(discovery/crawler): scaffold subpackage with Source interface and RawEvent types 2026-04-18 11:36:07 +02:00
adf417b731 fix(research): 429 aware error handling for Pass 1/2
Pass 1 and Pass 2 now detect Mistral web_search rate limits (shared with
the Pass 0 CronJob) and return a proper HTTP 429 with Retry-After: 60
instead of a generic 500 "AI research failed". Pass 2 is enrichment-only,
so rate-limits there fall through with pass1 results intact.

- pkg/ai: new shared IsRateLimit helper + DefaultRetryAfterSeconds=60.
  discovery/service.go drops its local copy and imports the shared one.
- apierror.TooManyRequests now accepts an optional custom message so the
  response body can include "try again in ~60s".
- market/research.go: respondRateLimited helper sets Retry-After,
  downgrades the log line from ERROR to WARN (rate-limits are expected
  state, not a fault), and returns 429 with a structured rate_limited
  code the admin UI can key off of.
2026-04-18 10:33:13 +02:00
8e8bb8d4c3 Merge branch 'feature/discovery-validator' into 'main'
feat(discovery): validator — catches agent self-contradictions before insert

See merge request vikingowl/marktvogt.de!14
2026-04-18 08:05:20 +00:00
de1a3f6efb feat(discovery): validator — catches agent self-contradictions before insert
Pass 0 agents produce schema-valid but semantically wrong output: markets
claimed in the wrong bundesland, status 'bestaetigt' with a hinweis about
Vorjahresdaten, etc. The schema alone can't catch these. This validator
does, as a blocking gate before InsertDiscovered.

Checks (Pass 0 scope):
- bundesland_mismatch: agent's bundesland must equal bucket.region, with
  a light normalizer for CH 'Kanton X' prefix so Phase B can refine the
  Schweiz seed without a signature break.
- status_hinweis_inconsistent: if agent_status=='bestaetigt' AND hinweis
  contains 'vorjahr' (case-insensitive), the agent contradicted itself.

Errors drop the market (counted as summary.validation_failed); warnings
would get merged into hinweis — no warning-level checks exist yet at
Pass 0 scope, placeholder reserved.

Phase B (research agent) checks will extend this file: oeffnungszeiten
dedup, start_datum window coverage, full quellen liveness for Pass 1.
2026-04-18 10:05:08 +02:00
fda30de158 Merge branch 'feature/discovery-pass0-halbmonat-plus-verify' into 'main'
feature/discovery pass0 halbmonat plus verify

See merge request vikingowl/marktvogt.de!13
2026-04-18 07:52:26 +00:00
cd836564f1 feat(discovery): Pass 0 halbmonat buckets + konfidenz/status + link verification
Pass 0 splits every month into two halves (H1 = days 1-15, H2 = 16-EOM)
so each agent call fits within Mistral's 4096 max_tokens budget. The
response schema picks up richer per-market signals and dead agent URLs
get filtered before they land in the admin queue.

DB:
- 000015: add halbmonat char(2) to discovery_buckets, widen unique key,
  backfill existing rows as H1 + insert H2 siblings (624 → 1248 rows).
- 000016: rename discovered_markets.extraktion → konfidenz with
  best-effort value mapping (verbatim→hoch, abgeleitet→mittel); add
  agent_status column.

Backend:
- model: Bucket gains Halbmonat; Pass0Bucket same. Pass0Market renames
  Extraktion → Konfidenz and adds AgentStatus (JSON tag "status").
  DiscoveredMarket mirrors both fields; queue-lifecycle Status column
  stays distinct from agent-reported AgentStatus.
- repository: all SELECT/INSERT touched to use the new columns; picker
  orders by year_month, halbmonat so H1 runs before H2 in the same
  month.
- agent client: prompt now injects halbmonat and recherche_datum (today)
  so the agent has explicit date context.
- link verification: new LinkChecker does concurrent HEAD (GET fallback
  on 405) with a 5s timeout. FilterURLs runs before InsertDiscovered —
  markets whose quellen all fail are dropped and counted as
  link_check_failed in TickSummary. Failing website URLs are cleared
  but don't block insert.
- Service.linkChecker is a narrow interface so tests inject a noop
  stub instead of hitting the network.

Web:
- DiscoveredMarket type gains konfidenz + agent_status, drops extraktion.
- Queue column renames "Extraktion" → "Konfidenz" with three-level
  coloring (hoch=emerald, mittel=amber, niedrig=red, else neutral).
- A small pill next to markt_name surfaces agent_status when it's not
  "bestaetigt" — red for "abgesagt", amber for "unklar" and
  "vorjahr_unbestaetigt" — so risky entries are obvious before accept.
2026-04-18 09:51:57 +02:00
1af97bda21 Merge branch 'feature/discovery-edit-and-sources' into 'main'
feat(discovery): edit pending entries + surface quellen links

See merge request vikingowl/marktvogt.de!12
2026-04-18 07:33:33 +00:00
bf72095348 feat(discovery): edit pending entries + surface quellen links
Expanding any row in the discovery queue now reveals:
- Quellen as clickable URLs (was just a count)
- Hinweis if the agent emitted one
- Inline edit form for markt_name, stadt, bundesland, start/end date,
  and website — the fields the Pass 0 agent gets wrong most often

Backend:
- PATCH /admin/discovery/queue/:id applies a partial update to pending
  entries via a COALESCE-based SQL update. Only fields that were set
  are written.
- Service recomputes name_normalized when markt_name or stadt change so
  dedup stays consistent after edits.
- Status check ensures only 'pending' entries are mutable.

Web:
- Row state $expandedId holds at most one open drawer at a time.
- Dates round-trip through <input type="date"> using the shared
  dateInputValue helper; form action converts back to RFC3339 for Go.
- Existing Accept/Reject buttons untouched — workflow is edit-then-accept.
2026-04-18 09:33:14 +02:00
a44005b694 Merge branch 'fix/discovery-rate-limit-and-polish' into 'main'
fix(discovery): defer rate-limited buckets + polish queue table

See merge request vikingowl/marktvogt.de!11
2026-04-18 07:21:24 +00:00
98eae40755 fix(discovery): defer rate-limited buckets + polish queue table
Rate limits (Mistral web_search 429) used to get counted as hard errors,
marking the bucket as queried and bumping the Errors(24h) strip — even
though the right behavior is to wait and try again later.

Backend:
- isRateLimit() matches "rate limit" / "status 429" in the error string.
- On persistent rate-limit after one 10s retry: leave last_queried_at
  unchanged (bucket stays eligible for next tick) and abort the
  remainder of this tick — Mistral's web_search budget is shared, no
  point hammering more buckets in the same batch.
- TickSummary gains rate_limited counter; Errors stays for real failures.

Frontend:
- Dates: RFC3339 → 'DD.MM.YYYY' German format, range rendered as
  'DD.MM.YYYY – DD.MM.YYYY'.
- Queue table: cell horizontal padding, uppercase compact headers,
  scrollable on narrow viewports, dark-mode variants on every color
  (emerald/amber badges, link color, reject button), Region folds
  bundesland||land into a single column (Land was always 'Deutschland'
  for DACH anyway).
2026-04-18 09:21:05 +02:00
e4ef4adad6 Merge branch 'fix/discovery-json-tags' into 'main'
fix(discovery): add json tags to domain types

See merge request vikingowl/marktvogt.de!10
2026-04-18 07:11:55 +00:00
b6ace52ada fix(discovery): add json tags to domain types
Without snake_case json tags, Go serializes fields as PascalCase (ID,
MarktName, etc.) — but the Svelte frontend reads snake_case. Every
row.id on the client was undefined, which made Svelte 5 see identical
'undefined' keys across the {#each queue as row (row.id)} loop and
throw each_key_duplicate.

Adds explicit snake_case tags to Bucket, DiscoveredMarket, and
RejectedDiscovery to match what the TypeScript types already expect.
2026-04-18 09:11:44 +02:00
8f1efe73f2 Merge branch 'fix/discovery-cron-service-port' into 'main'
fix(helm): CronJob curls the Service port, not the container port

See merge request vikingowl/marktvogt.de!9
2026-04-18 07:00:26 +00:00
5a561b3092 fix(helm): CronJob curls the Service port, not the container port
Service listens on port 80 (target: container 8080). The CronJob was
curling :8080 directly, which isn't exposed by the Service — every tick
timed out after ~135s with "Could not connect to server".

Switch to {{ .Values.service.port }} so the template always tracks the
actual Service port.
2026-04-18 09:00:16 +02:00
1252044d60 Merge branch 'fix/discovery-empty-card-dark' into 'main'
fix(web): dark-mode variants for discovery empty-queue card

See merge request vikingowl/marktvogt.de!8
2026-04-18 06:53:43 +00:00
173e7c5013 fix(web): dark-mode variants for discovery empty-queue card 2026-04-18 08:53:25 +02:00
ce75448f1e Merge branch 'fix/discovery-empty-slice-null' into 'main'
fix(discovery): render empty queue as [] not null (500 on empty prod)

See merge request vikingowl/marktvogt.de!7
2026-04-18 06:44:52 +00:00
14e1a36622 fix(discovery): render empty queue as [] not null (500 on empty prod)
Go's nil slice marshals as JSON null, not [], which crashed the Svelte
page's .length access on fresh installs where no discovery tick has
happened yet. Reproduced in production: /admin/discovery → 500 because
data.queue was null and {queue.length} dereferenced it.

Backend: initialize every returning slice in repository.go via
make([]T, 0) so zero rows serialize as [] consistently. Also applies to
PickStaleBuckets, ListSeriesByCity, and Stats.RecentErrors.

Web: coalesce data.queue / data.stats.recent_errors at the top of the
Svelte script with `?? []` so future nil-slice regressions don't take
the whole page down.
2026-04-18 08:44:17 +02:00