Commit Graph

209 Commits

Author SHA1 Message Date
4c7c3dcb37 Merge branch 'feat/discovery-crawler' — DACH discovery crawler MR 1
Replaces Mistral Pass 0 with a deterministic 5-source Go crawler
(marktkalendarium.de, mittelalterkalender.info, festival-alarm.com,
mittelaltermarkt.online Tribe REST, suendenfrei.tv). Pass 1/2 enrichment
paths unchanged. Existing Mistral Tick path preserved alongside; cutover
gated on coverage verification via cmd/discovery-compare.

Spec: docs/superpowers/specs/2026-04-18-dach-discovery-crawler-design.md
Plan: docs/superpowers/plans/2026-04-18-dach-discovery-crawler.md
2026-04-18 17:03:27 +02:00
7c8a8c6419 fix(discovery): review follow-ups — konfidenz signal, end-date default, determinism, rate-limit=0
- Service.Crawl derives Konfidenz from merged source count + rank instead of
  hardcoded "mittel". Two+ sources -> "hoch"; single curated source ->
  "mittel"; single suendenfrei (prose regex) -> "niedrig".
- New AgentStatus constant "crawler" replaces "bestaetigt" for crawler rows
  so the validator's agent-specific rules don't fire on them and operators
  can filter the queue by origin. Added Konfidenz* and AgentStatus*
  constants to model.go.
- Default EndDatum to StartDatum when a source reports a single date
  (festival_alarm one-day events, suendenfrei lines without a "bis" range).
  Avoids Service.Accept rejecting nil-EndDatum rows.
- Sort PerSource names before assembling raw events for merge — makes
  merged output order deterministic across runs.
- NewHandler: manualRateLimitPerHour <= 0 now explicitly disables the
  rate limit (previously silently floored to 1/hour). Documented behavior
  for all three cases in a constructor comment.
- Added four new tests for Service.Crawl failure/quality paths:
  LinkCheckFailed, DedupedQueue, EndDatum default, multi-source Konfidenz.
- Documented the substring-match approximation in
  cmd/discovery-compare/main.go's groupCrawlerByBucket — diagnostic-only,
  not safe for production routing.
2026-04-18 16:35:26 +02:00
c5a4bc441c feat(cmd): discovery-compare CLI for pre-cutover coverage verification 2026-04-18 16:08:48 +02:00
0bed4401fe feat(config): crawler user-agent and manual rate-limit knobs 2026-04-18 15:50:21 +02:00
91cd4d89b3 feat(discovery): POST /admin/discovery/crawl with mutex and rate limit
Exposes Service.Crawl via two HTTP routes: a bearer-token path that
bypasses the manual rate limit, and an admin-session path subject to a
configurable per-hour cap. A sync.Mutex blocks concurrent runs.
Includes handler tests for mutex reentry and rate limit enforcement.
2026-04-18 15:22:24 +02:00
b3289bc6e6 feat(discovery): Service.Crawl — orchestrate crawler through existing pipeline
Extract normalize helpers into discovery/normalize subpackage to break
the otherwise circular import (discovery/crawler → discovery → crawler).
NormalizeName/NormalizeCity in discovery become thin wrappers; merger.go
switches to discovery/normalize directly.

Adds crawlerRunner interface, NewServiceWithCrawler constructor, CrawlSummary/
SourceSummary types, and Service.Crawl which wires the crawler output through
link-verify, dedup, validation, and insert — same pipeline as processBucketResponse
but without a bucket context (BucketID is nil on crawler-produced rows).
2026-04-18 15:03:02 +02:00
20176dd51f refactor(discovery): validator accepts *Bucket, skips bucket checks when nil 2026-04-18 14:43:07 +02:00
310673940e feat(discovery): migration 000017 — nullable bucket_id; model uses *uuid.UUID 2026-04-18 14:30:54 +02:00
507052e375 feat(discovery/crawler): source config and RunAll orchestrator 2026-04-18 14:09:22 +02:00
c013f6bc54 feat(discovery/crawler): cross-source merger with source-rank tiebreaks 2026-04-18 13:40:21 +02:00
3aed982e1c feat(discovery/crawler): log unparseable suendenfrei entries at INFO 2026-04-18 13:33:51 +02:00
2163621415 feat(discovery/crawler): suendenfrei.tv parser 2026-04-18 13:09:47 +02:00
94aa261c90 refactor(discovery/crawler): hoist land constants; document Tribe date format assumption 2026-04-18 13:04:28 +02:00
1cc7de0bb6 feat(discovery/crawler): mittelaltermarkt.online Tribe REST client 2026-04-18 12:53:32 +02:00
a55bb7e15b docs(discovery/crawler): clarify unused year param in parseDateAttr 2026-04-18 12:48:38 +02:00
93efb90967 feat(discovery/crawler): festival-alarm.com parser 2026-04-18 12:36:01 +02:00
91c058105e feat(discovery/crawler): mittelalterkalender.info parser 2026-04-18 12:24:49 +02:00
e6ec97c09d feat(discovery/crawler): marktkalendarium.de parser 2026-04-18 12:12:13 +02:00
57120beac0 feat(discovery/crawler): polite HTTP fetcher with retry and 429 backoff 2026-04-18 12:02:47 +02:00
31fea6fa3c test(discovery/crawler): add PLZ boundary + range coverage cases 2026-04-18 12:00:42 +02:00
eed76f1e76 docs(discovery/crawler): align PLZ helper comment with implementation 2026-04-18 11:57:34 +02:00
4694804331 feat(discovery/crawler): PLZ-to-land inference helper 2026-04-18 11:54:17 +02:00
e359d06d13 test(discovery/crawler): capture golden fixtures from five sources 2026-04-18 11:45:53 +02:00
5135f0a3be feat(discovery/crawler): scaffold subpackage with Source interface and RawEvent types 2026-04-18 11:36:07 +02:00
adf417b731 fix(research): 429 aware error handling for Pass 1/2
Pass 1 and Pass 2 now detect Mistral web_search rate limits (shared with
the Pass 0 CronJob) and return a proper HTTP 429 with Retry-After: 60
instead of a generic 500 "AI research failed". Pass 2 is enrichment-only,
so rate-limits there fall through with pass1 results intact.

- pkg/ai: new shared IsRateLimit helper + DefaultRetryAfterSeconds=60.
  discovery/service.go drops its local copy and imports the shared one.
- apierror.TooManyRequests now accepts an optional custom message so the
  response body can include "try again in ~60s".
- market/research.go: respondRateLimited helper sets Retry-After,
  downgrades the log line from ERROR to WARN (rate-limits are expected
  state, not a fault), and returns 429 with a structured rate_limited
  code the admin UI can key off of.
2026-04-18 10:33:13 +02:00
8e8bb8d4c3 Merge branch 'feature/discovery-validator' into 'main'
feat(discovery): validator — catches agent self-contradictions before insert

See merge request vikingowl/marktvogt.de!14
2026-04-18 08:05:20 +00:00
de1a3f6efb feat(discovery): validator — catches agent self-contradictions before insert
Pass 0 agents produce schema-valid but semantically wrong output: markets
claimed in the wrong bundesland, status 'bestaetigt' with a hinweis about
Vorjahresdaten, etc. The schema alone can't catch these. This validator
does, as a blocking gate before InsertDiscovered.

Checks (Pass 0 scope):
- bundesland_mismatch: agent's bundesland must equal bucket.region, with
  a light normalizer for CH 'Kanton X' prefix so Phase B can refine the
  Schweiz seed without a signature break.
- status_hinweis_inconsistent: if agent_status=='bestaetigt' AND hinweis
  contains 'vorjahr' (case-insensitive), the agent contradicted itself.

Errors drop the market (counted as summary.validation_failed); warnings
would get merged into hinweis — no warning-level checks exist yet at
Pass 0 scope, placeholder reserved.

Phase B (research agent) checks will extend this file: oeffnungszeiten
dedup, start_datum window coverage, full quellen liveness for Pass 1.
2026-04-18 10:05:08 +02:00
fda30de158 Merge branch 'feature/discovery-pass0-halbmonat-plus-verify' into 'main'
feature/discovery pass0 halbmonat plus verify

See merge request vikingowl/marktvogt.de!13
2026-04-18 07:52:26 +00:00
cd836564f1 feat(discovery): Pass 0 halbmonat buckets + konfidenz/status + link verification
Pass 0 splits every month into two halves (H1 = days 1-15, H2 = 16-EOM)
so each agent call fits within Mistral's 4096 max_tokens budget. The
response schema picks up richer per-market signals and dead agent URLs
get filtered before they land in the admin queue.

DB:
- 000015: add halbmonat char(2) to discovery_buckets, widen unique key,
  backfill existing rows as H1 + insert H2 siblings (624 → 1248 rows).
- 000016: rename discovered_markets.extraktion → konfidenz with
  best-effort value mapping (verbatim→hoch, abgeleitet→mittel); add
  agent_status column.

Backend:
- model: Bucket gains Halbmonat; Pass0Bucket same. Pass0Market renames
  Extraktion → Konfidenz and adds AgentStatus (JSON tag "status").
  DiscoveredMarket mirrors both fields; queue-lifecycle Status column
  stays distinct from agent-reported AgentStatus.
- repository: all SELECT/INSERT touched to use the new columns; picker
  orders by year_month, halbmonat so H1 runs before H2 in the same
  month.
- agent client: prompt now injects halbmonat and recherche_datum (today)
  so the agent has explicit date context.
- link verification: new LinkChecker does concurrent HEAD (GET fallback
  on 405) with a 5s timeout. FilterURLs runs before InsertDiscovered —
  markets whose quellen all fail are dropped and counted as
  link_check_failed in TickSummary. Failing website URLs are cleared
  but don't block insert.
- Service.linkChecker is a narrow interface so tests inject a noop
  stub instead of hitting the network.

Web:
- DiscoveredMarket type gains konfidenz + agent_status, drops extraktion.
- Queue column renames "Extraktion" → "Konfidenz" with three-level
  coloring (hoch=emerald, mittel=amber, niedrig=red, else neutral).
- A small pill next to markt_name surfaces agent_status when it's not
  "bestaetigt" — red for "abgesagt", amber for "unklar" and
  "vorjahr_unbestaetigt" — so risky entries are obvious before accept.
2026-04-18 09:51:57 +02:00
1af97bda21 Merge branch 'feature/discovery-edit-and-sources' into 'main'
feat(discovery): edit pending entries + surface quellen links

See merge request vikingowl/marktvogt.de!12
2026-04-18 07:33:33 +00:00
bf72095348 feat(discovery): edit pending entries + surface quellen links
Expanding any row in the discovery queue now reveals:
- Quellen as clickable URLs (was just a count)
- Hinweis if the agent emitted one
- Inline edit form for markt_name, stadt, bundesland, start/end date,
  and website — the fields the Pass 0 agent gets wrong most often

Backend:
- PATCH /admin/discovery/queue/:id applies a partial update to pending
  entries via a COALESCE-based SQL update. Only fields that were set
  are written.
- Service recomputes name_normalized when markt_name or stadt change so
  dedup stays consistent after edits.
- Status check ensures only 'pending' entries are mutable.

Web:
- Row state $expandedId holds at most one open drawer at a time.
- Dates round-trip through <input type="date"> using the shared
  dateInputValue helper; form action converts back to RFC3339 for Go.
- Existing Accept/Reject buttons untouched — workflow is edit-then-accept.
2026-04-18 09:33:14 +02:00
a44005b694 Merge branch 'fix/discovery-rate-limit-and-polish' into 'main'
fix(discovery): defer rate-limited buckets + polish queue table

See merge request vikingowl/marktvogt.de!11
2026-04-18 07:21:24 +00:00
98eae40755 fix(discovery): defer rate-limited buckets + polish queue table
Rate limits (Mistral web_search 429) used to get counted as hard errors,
marking the bucket as queried and bumping the Errors(24h) strip — even
though the right behavior is to wait and try again later.

Backend:
- isRateLimit() matches "rate limit" / "status 429" in the error string.
- On persistent rate-limit after one 10s retry: leave last_queried_at
  unchanged (bucket stays eligible for next tick) and abort the
  remainder of this tick — Mistral's web_search budget is shared, no
  point hammering more buckets in the same batch.
- TickSummary gains rate_limited counter; Errors stays for real failures.

Frontend:
- Dates: RFC3339 → 'DD.MM.YYYY' German format, range rendered as
  'DD.MM.YYYY – DD.MM.YYYY'.
- Queue table: cell horizontal padding, uppercase compact headers,
  scrollable on narrow viewports, dark-mode variants on every color
  (emerald/amber badges, link color, reject button), Region folds
  bundesland||land into a single column (Land was always 'Deutschland'
  for DACH anyway).
2026-04-18 09:21:05 +02:00
e4ef4adad6 Merge branch 'fix/discovery-json-tags' into 'main'
fix(discovery): add json tags to domain types

See merge request vikingowl/marktvogt.de!10
2026-04-18 07:11:55 +00:00
b6ace52ada fix(discovery): add json tags to domain types
Without snake_case json tags, Go serializes fields as PascalCase (ID,
MarktName, etc.) — but the Svelte frontend reads snake_case. Every
row.id on the client was undefined, which made Svelte 5 see identical
'undefined' keys across the {#each queue as row (row.id)} loop and
throw each_key_duplicate.

Adds explicit snake_case tags to Bucket, DiscoveredMarket, and
RejectedDiscovery to match what the TypeScript types already expect.
2026-04-18 09:11:44 +02:00
8f1efe73f2 Merge branch 'fix/discovery-cron-service-port' into 'main'
fix(helm): CronJob curls the Service port, not the container port

See merge request vikingowl/marktvogt.de!9
2026-04-18 07:00:26 +00:00
5a561b3092 fix(helm): CronJob curls the Service port, not the container port
Service listens on port 80 (target: container 8080). The CronJob was
curling :8080 directly, which isn't exposed by the Service — every tick
timed out after ~135s with "Could not connect to server".

Switch to {{ .Values.service.port }} so the template always tracks the
actual Service port.
2026-04-18 09:00:16 +02:00
1252044d60 Merge branch 'fix/discovery-empty-card-dark' into 'main'
fix(web): dark-mode variants for discovery empty-queue card

See merge request vikingowl/marktvogt.de!8
2026-04-18 06:53:43 +00:00
173e7c5013 fix(web): dark-mode variants for discovery empty-queue card 2026-04-18 08:53:25 +02:00
ce75448f1e Merge branch 'fix/discovery-empty-slice-null' into 'main'
fix(discovery): render empty queue as [] not null (500 on empty prod)

See merge request vikingowl/marktvogt.de!7
2026-04-18 06:44:52 +00:00
14e1a36622 fix(discovery): render empty queue as [] not null (500 on empty prod)
Go's nil slice marshals as JSON null, not [], which crashed the Svelte
page's .length access on fresh installs where no discovery tick has
happened yet. Reproduced in production: /admin/discovery → 500 because
data.queue was null and {queue.length} dereferenced it.

Backend: initialize every returning slice in repository.go via
make([]T, 0) so zero rows serialize as [] consistently. Also applies to
PickStaleBuckets, ListSeriesByCity, and Stats.RecentErrors.

Web: coalesce data.queue / data.stats.recent_errors at the top of the
Svelte script with `?? []` so future nil-slice regressions don't take
the whole page down.
2026-04-18 08:44:17 +02:00
134cc9726b Merge branch 'feature/discovery-admin-stats' into 'main'
feat(discovery): admin stats strip + sidebar nav link

See merge request vikingowl/marktvogt.de!6
2026-04-18 06:35:10 +00:00
b7670b6152 feat(discovery): admin stats strip + sidebar nav link
Surfaces CronJob health signals without needing kubectl: last tick time
(stale-amber if > 6h), buckets due now, errors in the last 24h (with an
expandable list of the most recent failing buckets), and queue size.

Also wires the previously-orphaned /admin/discovery route into the admin
sidebar next to Märkte.

- backend: new GET /admin/discovery/stats endpoint; Stats + BucketError
  types; repository Stats() aggregates four counters + top 5 failing
  buckets.
- web: +page.server.ts fetches stats in parallel with queue;
  +page.svelte renders a 4-card strip above the queue table.
2026-04-18 08:34:34 +02:00
2debc15bc7 Merge branch 'fix/discovery-cron-podsecurity' into 'main'
fix(helm): add restricted PodSecurity settings to discovery CronJob

See merge request vikingowl/marktvogt.de!5
2026-04-18 06:27:52 +00:00
1ba8f856b4 fix(helm): add restricted PodSecurity settings to discovery CronJob
Previous deploys emitted 4 warnings on the discovery-tick Pod template
against the restricted:latest policy. Today they are warnings; if the
namespace enforcement tightens, admission will silently drop the Pod.

Pod-level: runAsNonRoot, runAsUser/runAsGroup 100 (curlimages/curl's
built-in non-root UID), seccompProfile RuntimeDefault.
Container-level: allowPrivilegeEscalation false, capabilities drop ALL.
2026-04-18 08:26:40 +02:00
0a408a40ba Merge branch 'ci/discovery-helm-vars' into 'main'
ci(deploy): wire AI_AGENT_DISCOVERY and DISCOVERY_TOKEN into helm upgrade

See merge request vikingowl/marktvogt.de!4
2026-04-18 06:17:23 +00:00
502d1d2109 ci(deploy): wire AI_AGENT_DISCOVERY and DISCOVERY_TOKEN into helm upgrade 2026-04-18 08:17:06 +02:00
f36558b784 Merge branch 'feature/dach-market-discovery' into 'main'
feat(discovery): DACH market discovery pipeline

See merge request vikingowl/marktvogt.de!3
2026-04-18 06:12:03 +00:00
b222d5fbc8 fix(discovery): use interval multiplication for forward-window query
pgx cannot implicitly encode int arg into text for the `$1 || ' month'`
concatenation pattern (error: "unable to encode 12 into text format for
text (OID 25): cannot find encode plan"). Multiplication with a known
interval works directly with the int parameter and is semantically
equivalent.

Discovered during the T19 smoke test — the tick endpoint returned 500
on every call before this fix.
2026-04-18 08:08:27 +02:00
31ce937f55 feat(helm): add discovery CronJob + token secret + env wiring
Adds a batch/v1 CronJob that POSTs to /api/v1/admin/discovery/tick on a
configurable schedule (default every 4h). Wires DISCOVERY_TOKEN into the
ci-secrets Secret and projects discovery/AI env vars into the backend
Deployment.
2026-04-18 07:57:18 +02:00