- Service.Crawl no longer link-verifies Quellen/Website for crawler
events. Those URLs come from real HTML of trusted sources and have
been implicitly verified at parse time. Removing this makes the
insert phase complete in well under a minute even for 1500+ events
and stops attributing timing-limited processing as link failures.
LinkCheckFailed counter retained for JSON shape stability.
- Suendenfrei pagination now stops on len(events) == 0. Previously the
site's footer <h3><a> links kept anchors.Length() > 0 indefinitely,
sending the crawler to page-90 before the outer ctx timeout.
- New similarity helper (SimilarityScore, FindSimilar) and endpoint
GET /api/v1/admin/discovery/queue/:id/similar. Multiplicative score
of normalized-name Levenshtein ratio gating city-match and date-
proximity bonuses. Prevents coincident-city/date events from being
incorrectly flagged as near-duplicates when their names differ.
Lets admin review flag near-duplicates that slip past exact-match
dedup (date typos, city variants, trailing-word swaps).
Gateway (NGF) ignored our HTTPRoute timeouts field (UnsupportedField).
Flipping to fire-and-forget: handler returns 202 immediately, goroutine
runs crawl with detached 5-min context, GET /admin/discovery/crawl-status
returns state, admin UI polls every 3s until running=false.
HTTP requests are now all sub-second; gateway timeout is no longer in
the crawl critical path. Concurrent-run protection via atomic.Bool
(replaces TryLock), rate limit semantics unchanged.
Handler.Crawl now spawns a goroutine with a 5-minute detached context
and returns 202 immediately. Admin UI polls the new
GET /admin/discovery/crawl-status every 3s until running=false, then
renders CrawlSummary. Bypasses the 60s nginx-gateway proxy_read_timeout
entirely — HTTP requests are all sub-second.
Concurrency: atomic.Bool guard (CompareAndSwap) replaces TryLock,
resultMu RWMutex protects the summary/error state, rateMu protects
the rate-limit check. Rate limit semantics unchanged (still applies
to admin-session path, bearer-token bypass via context flag).
Gateway cut the HTTP request at 60s, which cancelled the request ctx
and cascaded into the link-verifier in Service.Crawl's insert pipeline.
Every merged event was then dropped as LinkCheckFailed, resulting in
zero new queue rows despite the crawler parsing ~1500 events.
Fix is three parts: HTTPRoute timeout 300s for /crawl*, insert-phase
context detached from the HTTP request ctx, and a CrawlSummary INFO
log line for diagnosability.
- HTTPRoute: add 300s request+backendRequest timeout rule for
/api/v1/admin/discovery/crawl; default rule unchanged. nginx-gateway's
60s default was cutting the connection mid-crawl.
- Service.Crawl: detach insert pipeline from HTTP request context with
a 3-minute internal timeout. Previously a canceled request ctx
cascaded into the link-verifier, failing every URL check and
counting every merged event as LinkCheckFailed. Inserts now complete
even if the gateway cut the connection.
- Log CrawlSummary at INFO on completion so outcomes are visible in
backend logs without needing the HTTP response body.
- New test: TestServiceCrawlDetachesInsertContextFromRequestCtx.
Deletes the Mistral Pass 0 code path from discovery, flips the k8s
CronJob to the crawler endpoint on a daily schedule, and adds a
Run crawl button to the admin UI that renders CrawlSummary.
Net change: ~-900 lines / +150 lines. Mistral remains wired for Pass 1
and Pass 2 research — only Pass 0 discovery is replaced by the deterministic
5-source Go crawler.
Deletes agent_client.go, agent_client_test.go, and the discovery-compare
diagnostic CLI. Removes Tick/PickBuckets/processOneBucket/processBucketResponse
from Service; renames NewServiceWithCrawler to NewService. Drops BatchSize,
ForwardMonths, AgentDiscovery config fields and their env reads. PickStaleBuckets
and UpdateBucketQueried removed from Repository interface (no callers). Stats
hardcodes forwardMonths=12. /tick route removed; /crawl is now the only machine
path, still protected by requireTickToken middleware.
- Service.Crawl derives Konfidenz from merged source count + rank instead of
hardcoded "mittel". Two+ sources -> "hoch"; single curated source ->
"mittel"; single suendenfrei (prose regex) -> "niedrig".
- New AgentStatus constant "crawler" replaces "bestaetigt" for crawler rows
so the validator's agent-specific rules don't fire on them and operators
can filter the queue by origin. Added Konfidenz* and AgentStatus*
constants to model.go.
- Default EndDatum to StartDatum when a source reports a single date
(festival_alarm one-day events, suendenfrei lines without a "bis" range).
Avoids Service.Accept rejecting nil-EndDatum rows.
- Sort PerSource names before assembling raw events for merge — makes
merged output order deterministic across runs.
- NewHandler: manualRateLimitPerHour <= 0 now explicitly disables the
rate limit (previously silently floored to 1/hour). Documented behavior
for all three cases in a constructor comment.
- Added four new tests for Service.Crawl failure/quality paths:
LinkCheckFailed, DedupedQueue, EndDatum default, multi-source Konfidenz.
- Documented the substring-match approximation in
cmd/discovery-compare/main.go's groupCrawlerByBucket — diagnostic-only,
not safe for production routing.
Exposes Service.Crawl via two HTTP routes: a bearer-token path that
bypasses the manual rate limit, and an admin-session path subject to a
configurable per-hour cap. A sync.Mutex blocks concurrent runs.
Includes handler tests for mutex reentry and rate limit enforcement.
Extract normalize helpers into discovery/normalize subpackage to break
the otherwise circular import (discovery/crawler → discovery → crawler).
NormalizeName/NormalizeCity in discovery become thin wrappers; merger.go
switches to discovery/normalize directly.
Adds crawlerRunner interface, NewServiceWithCrawler constructor, CrawlSummary/
SourceSummary types, and Service.Crawl which wires the crawler output through
link-verify, dedup, validation, and insert — same pipeline as processBucketResponse
but without a bucket context (BucketID is nil on crawler-produced rows).
Pass 1 and Pass 2 now detect Mistral web_search rate limits (shared with
the Pass 0 CronJob) and return a proper HTTP 429 with Retry-After: 60
instead of a generic 500 "AI research failed". Pass 2 is enrichment-only,
so rate-limits there fall through with pass1 results intact.
- pkg/ai: new shared IsRateLimit helper + DefaultRetryAfterSeconds=60.
discovery/service.go drops its local copy and imports the shared one.
- apierror.TooManyRequests now accepts an optional custom message so the
response body can include "try again in ~60s".
- market/research.go: respondRateLimited helper sets Retry-After,
downgrades the log line from ERROR to WARN (rate-limits are expected
state, not a fault), and returns 429 with a structured rate_limited
code the admin UI can key off of.
Pass 0 agents produce schema-valid but semantically wrong output: markets
claimed in the wrong bundesland, status 'bestaetigt' with a hinweis about
Vorjahresdaten, etc. The schema alone can't catch these. This validator
does, as a blocking gate before InsertDiscovered.
Checks (Pass 0 scope):
- bundesland_mismatch: agent's bundesland must equal bucket.region, with
a light normalizer for CH 'Kanton X' prefix so Phase B can refine the
Schweiz seed without a signature break.
- status_hinweis_inconsistent: if agent_status=='bestaetigt' AND hinweis
contains 'vorjahr' (case-insensitive), the agent contradicted itself.
Errors drop the market (counted as summary.validation_failed); warnings
would get merged into hinweis — no warning-level checks exist yet at
Pass 0 scope, placeholder reserved.
Phase B (research agent) checks will extend this file: oeffnungszeiten
dedup, start_datum window coverage, full quellen liveness for Pass 1.
Pass 0 splits every month into two halves (H1 = days 1-15, H2 = 16-EOM)
so each agent call fits within Mistral's 4096 max_tokens budget. The
response schema picks up richer per-market signals and dead agent URLs
get filtered before they land in the admin queue.
DB:
- 000015: add halbmonat char(2) to discovery_buckets, widen unique key,
backfill existing rows as H1 + insert H2 siblings (624 → 1248 rows).
- 000016: rename discovered_markets.extraktion → konfidenz with
best-effort value mapping (verbatim→hoch, abgeleitet→mittel); add
agent_status column.
Backend:
- model: Bucket gains Halbmonat; Pass0Bucket same. Pass0Market renames
Extraktion → Konfidenz and adds AgentStatus (JSON tag "status").
DiscoveredMarket mirrors both fields; queue-lifecycle Status column
stays distinct from agent-reported AgentStatus.
- repository: all SELECT/INSERT touched to use the new columns; picker
orders by year_month, halbmonat so H1 runs before H2 in the same
month.
- agent client: prompt now injects halbmonat and recherche_datum (today)
so the agent has explicit date context.
- link verification: new LinkChecker does concurrent HEAD (GET fallback
on 405) with a 5s timeout. FilterURLs runs before InsertDiscovered —
markets whose quellen all fail are dropped and counted as
link_check_failed in TickSummary. Failing website URLs are cleared
but don't block insert.
- Service.linkChecker is a narrow interface so tests inject a noop
stub instead of hitting the network.
Web:
- DiscoveredMarket type gains konfidenz + agent_status, drops extraktion.
- Queue column renames "Extraktion" → "Konfidenz" with three-level
coloring (hoch=emerald, mittel=amber, niedrig=red, else neutral).
- A small pill next to markt_name surfaces agent_status when it's not
"bestaetigt" — red for "abgesagt", amber for "unklar" and
"vorjahr_unbestaetigt" — so risky entries are obvious before accept.
Expanding any row in the discovery queue now reveals:
- Quellen as clickable URLs (was just a count)
- Hinweis if the agent emitted one
- Inline edit form for markt_name, stadt, bundesland, start/end date,
and website — the fields the Pass 0 agent gets wrong most often
Backend:
- PATCH /admin/discovery/queue/:id applies a partial update to pending
entries via a COALESCE-based SQL update. Only fields that were set
are written.
- Service recomputes name_normalized when markt_name or stadt change so
dedup stays consistent after edits.
- Status check ensures only 'pending' entries are mutable.
Web:
- Row state $expandedId holds at most one open drawer at a time.
- Dates round-trip through <input type="date"> using the shared
dateInputValue helper; form action converts back to RFC3339 for Go.
- Existing Accept/Reject buttons untouched — workflow is edit-then-accept.
Rate limits (Mistral web_search 429) used to get counted as hard errors,
marking the bucket as queried and bumping the Errors(24h) strip — even
though the right behavior is to wait and try again later.
Backend:
- isRateLimit() matches "rate limit" / "status 429" in the error string.
- On persistent rate-limit after one 10s retry: leave last_queried_at
unchanged (bucket stays eligible for next tick) and abort the
remainder of this tick — Mistral's web_search budget is shared, no
point hammering more buckets in the same batch.
- TickSummary gains rate_limited counter; Errors stays for real failures.
Frontend:
- Dates: RFC3339 → 'DD.MM.YYYY' German format, range rendered as
'DD.MM.YYYY – DD.MM.YYYY'.
- Queue table: cell horizontal padding, uppercase compact headers,
scrollable on narrow viewports, dark-mode variants on every color
(emerald/amber badges, link color, reject button), Region folds
bundesland||land into a single column (Land was always 'Deutschland'
for DACH anyway).
Without snake_case json tags, Go serializes fields as PascalCase (ID,
MarktName, etc.) — but the Svelte frontend reads snake_case. Every
row.id on the client was undefined, which made Svelte 5 see identical
'undefined' keys across the {#each queue as row (row.id)} loop and
throw each_key_duplicate.
Adds explicit snake_case tags to Bucket, DiscoveredMarket, and
RejectedDiscovery to match what the TypeScript types already expect.
Service listens on port 80 (target: container 8080). The CronJob was
curling :8080 directly, which isn't exposed by the Service — every tick
timed out after ~135s with "Could not connect to server".
Switch to {{ .Values.service.port }} so the template always tracks the
actual Service port.
Go's nil slice marshals as JSON null, not [], which crashed the Svelte
page's .length access on fresh installs where no discovery tick has
happened yet. Reproduced in production: /admin/discovery → 500 because
data.queue was null and {queue.length} dereferenced it.
Backend: initialize every returning slice in repository.go via
make([]T, 0) so zero rows serialize as [] consistently. Also applies to
PickStaleBuckets, ListSeriesByCity, and Stats.RecentErrors.
Web: coalesce data.queue / data.stats.recent_errors at the top of the
Svelte script with `?? []` so future nil-slice regressions don't take
the whole page down.