marktvogt.de

Author	SHA1	Message	Date
vikingowl	073e55c7fc	feat(discovery): drop link-check from crawl path, fix suendenfrei pagination, add similarity helper - Service.Crawl no longer link-verifies Quellen/Website for crawler events. Those URLs come from real HTML of trusted sources and have been implicitly verified at parse time. Removing this makes the insert phase complete in well under a minute even for 1500+ events and stops attributing timing-limited processing as link failures. LinkCheckFailed counter retained for JSON shape stability. - Suendenfrei pagination now stops on len(events) == 0. Previously the site's footer <h3><a> links kept anchors.Length() > 0 indefinitely, sending the crawler to page-90 before the outer ctx timeout. - New similarity helper (SimilarityScore, FindSimilar) and endpoint GET /api/v1/admin/discovery/queue/:id/similar. Multiplicative score of normalized-name Levenshtein ratio gating city-match and date- proximity bonuses. Prevents coincident-city/date events from being incorrectly flagged as near-duplicates when their names differ. Lets admin review flag near-duplicates that slip past exact-match dedup (date typos, city variants, trailing-word swaps).	2026-04-18 20:05:07 +02:00
vikingowl	cdd43cc45a	Merge branch 'feat/crawl-async' — async crawl handler, UI polls status Gateway (NGF) ignored our HTTPRoute timeouts field (UnsupportedField). Flipping to fire-and-forget: handler returns 202 immediately, goroutine runs crawl with detached 5-min context, GET /admin/discovery/crawl-status returns state, admin UI polls every 3s until running=false. HTTP requests are now all sub-second; gateway timeout is no longer in the crawl critical path. Concurrent-run protection via atomic.Bool (replaces TryLock), rate limit semantics unchanged.	2026-04-18 19:25:37 +02:00
vikingowl	9f286b8029	feat(discovery): async crawl — 202 Accepted, status endpoint, UI polls Handler.Crawl now spawns a goroutine with a 5-minute detached context and returns 202 immediately. Admin UI polls the new GET /admin/discovery/crawl-status every 3s until running=false, then renders CrawlSummary. Bypasses the 60s nginx-gateway proxy_read_timeout entirely — HTTP requests are all sub-second. Concurrency: atomic.Bool guard (CompareAndSwap) replaces TryLock, resultMu RWMutex protects the summary/error state, rateMu protects the rate-limit check. Rate limit semantics unchanged (still applies to admin-session path, bearer-token bypass via context flag).	2026-04-18 19:24:48 +02:00
vikingowl	2ea8a9a6f3	Merge branch 'fix/discovery-crawl-timeout' — crawl survives gateway timeout Gateway cut the HTTP request at 60s, which cancelled the request ctx and cascaded into the link-verifier in Service.Crawl's insert pipeline. Every merged event was then dropped as LinkCheckFailed, resulting in zero new queue rows despite the crawler parsing ~1500 events. Fix is three parts: HTTPRoute timeout 300s for /crawl*, insert-phase context detached from the HTTP request ctx, and a CrawlSummary INFO log line for diagnosability.	2026-04-18 18:40:30 +02:00
vikingowl	f6e4e5c29f	fix(discovery): crawl survives gateway timeout and long-running runs - HTTPRoute: add 300s request+backendRequest timeout rule for /api/v1/admin/discovery/crawl; default rule unchanged. nginx-gateway's 60s default was cutting the connection mid-crawl. - Service.Crawl: detach insert pipeline from HTTP request context with a 3-minute internal timeout. Previously a canceled request ctx cascaded into the link-verifier, failing every URL check and counting every merged event as LinkCheckFailed. Inserts now complete even if the gateway cut the connection. - Log CrawlSummary at INFO on completion so outcomes are visible in backend logs without needing the HTTP response body. - New test: TestServiceCrawlDetachesInsertContextFromRequestCtx.	2026-04-18 18:39:21 +02:00
vikingowl	2bb5156c0b	Merge branch 'feat/discovery-crawler-mr2' — Ship 1 MR 2 cutover Deletes the Mistral Pass 0 code path from discovery, flips the k8s CronJob to the crawler endpoint on a daily schedule, and adds a Run crawl button to the admin UI that renders CrawlSummary. Net change: ~-900 lines / +150 lines. Mistral remains wired for Pass 1 and Pass 2 research — only Pass 0 discovery is replaced by the deterministic 5-source Go crawler.	2026-04-18 17:49:08 +02:00
vikingowl	ba453a910f	chore(helm): daily discovery cron hits /crawl endpoint	2026-04-18 17:46:39 +02:00
vikingowl	3add4fb7ad	refactor(discovery): remove Mistral Pass 0 path; /crawl is canonical Deletes agent_client.go, agent_client_test.go, and the discovery-compare diagnostic CLI. Removes Tick/PickBuckets/processOneBucket/processBucketResponse from Service; renames NewServiceWithCrawler to NewService. Drops BatchSize, ForwardMonths, AgentDiscovery config fields and their env reads. PickStaleBuckets and UpdateBucketQueried removed from Repository interface (no callers). Stats hardcodes forwardMonths=12. /tick route removed; /crawl is now the only machine path, still protected by requireTickToken middleware.	2026-04-18 17:42:30 +02:00
vikingowl	a729412478	feat(admin): add Run crawl button and CrawlSummary rendering to discovery page	2026-04-18 17:29:05 +02:00
vikingowl	4c7c3dcb37	Merge branch 'feat/discovery-crawler' — DACH discovery crawler MR 1 Replaces Mistral Pass 0 with a deterministic 5-source Go crawler (marktkalendarium.de, mittelalterkalender.info, festival-alarm.com, mittelaltermarkt.online Tribe REST, suendenfrei.tv). Pass 1/2 enrichment paths unchanged. Existing Mistral Tick path preserved alongside; cutover gated on coverage verification via cmd/discovery-compare. Spec: docs/superpowers/specs/2026-04-18-dach-discovery-crawler-design.md Plan: docs/superpowers/plans/2026-04-18-dach-discovery-crawler.md	2026-04-18 17:03:27 +02:00
vikingowl	7c8a8c6419	fix(discovery): review follow-ups — konfidenz signal, end-date default, determinism, rate-limit=0 - Service.Crawl derives Konfidenz from merged source count + rank instead of hardcoded "mittel". Two+ sources -> "hoch"; single curated source -> "mittel"; single suendenfrei (prose regex) -> "niedrig". - New AgentStatus constant "crawler" replaces "bestaetigt" for crawler rows so the validator's agent-specific rules don't fire on them and operators can filter the queue by origin. Added Konfidenz* and AgentStatus* constants to model.go. - Default EndDatum to StartDatum when a source reports a single date (festival_alarm one-day events, suendenfrei lines without a "bis" range). Avoids Service.Accept rejecting nil-EndDatum rows. - Sort PerSource names before assembling raw events for merge — makes merged output order deterministic across runs. - NewHandler: manualRateLimitPerHour <= 0 now explicitly disables the rate limit (previously silently floored to 1/hour). Documented behavior for all three cases in a constructor comment. - Added four new tests for Service.Crawl failure/quality paths: LinkCheckFailed, DedupedQueue, EndDatum default, multi-source Konfidenz. - Documented the substring-match approximation in cmd/discovery-compare/main.go's groupCrawlerByBucket — diagnostic-only, not safe for production routing.	2026-04-18 16:35:26 +02:00
vikingowl	c5a4bc441c	feat(cmd): discovery-compare CLI for pre-cutover coverage verification	2026-04-18 16:08:48 +02:00
vikingowl	0bed4401fe	feat(config): crawler user-agent and manual rate-limit knobs	2026-04-18 15:50:21 +02:00
vikingowl	91cd4d89b3	feat(discovery): POST /admin/discovery/crawl with mutex and rate limit Exposes Service.Crawl via two HTTP routes: a bearer-token path that bypasses the manual rate limit, and an admin-session path subject to a configurable per-hour cap. A sync.Mutex blocks concurrent runs. Includes handler tests for mutex reentry and rate limit enforcement.	2026-04-18 15:22:24 +02:00
vikingowl	b3289bc6e6	feat(discovery): Service.Crawl — orchestrate crawler through existing pipeline Extract normalize helpers into discovery/normalize subpackage to break the otherwise circular import (discovery/crawler → discovery → crawler). NormalizeName/NormalizeCity in discovery become thin wrappers; merger.go switches to discovery/normalize directly. Adds crawlerRunner interface, NewServiceWithCrawler constructor, CrawlSummary/ SourceSummary types, and Service.Crawl which wires the crawler output through link-verify, dedup, validation, and insert — same pipeline as processBucketResponse but without a bucket context (BucketID is nil on crawler-produced rows).	2026-04-18 15:03:02 +02:00
vikingowl	20176dd51f	refactor(discovery): validator accepts *Bucket, skips bucket checks when nil	2026-04-18 14:43:07 +02:00
vikingowl	310673940e	feat(discovery): migration 000017 — nullable bucket_id; model uses *uuid.UUID	2026-04-18 14:30:54 +02:00
vikingowl	507052e375	feat(discovery/crawler): source config and RunAll orchestrator	2026-04-18 14:09:22 +02:00
vikingowl	c013f6bc54	feat(discovery/crawler): cross-source merger with source-rank tiebreaks	2026-04-18 13:40:21 +02:00
vikingowl	3aed982e1c	feat(discovery/crawler): log unparseable suendenfrei entries at INFO	2026-04-18 13:33:51 +02:00
vikingowl	2163621415	feat(discovery/crawler): suendenfrei.tv parser	2026-04-18 13:09:47 +02:00
vikingowl	94aa261c90	refactor(discovery/crawler): hoist land constants; document Tribe date format assumption	2026-04-18 13:04:28 +02:00
vikingowl	1cc7de0bb6	feat(discovery/crawler): mittelaltermarkt.online Tribe REST client	2026-04-18 12:53:32 +02:00
vikingowl	a55bb7e15b	docs(discovery/crawler): clarify unused year param in parseDateAttr	2026-04-18 12:48:38 +02:00
vikingowl	93efb90967	feat(discovery/crawler): festival-alarm.com parser	2026-04-18 12:36:01 +02:00
vikingowl	91c058105e	feat(discovery/crawler): mittelalterkalender.info parser	2026-04-18 12:24:49 +02:00
vikingowl	e6ec97c09d	feat(discovery/crawler): marktkalendarium.de parser	2026-04-18 12:12:13 +02:00
vikingowl	57120beac0	feat(discovery/crawler): polite HTTP fetcher with retry and 429 backoff	2026-04-18 12:02:47 +02:00
vikingowl	31fea6fa3c	test(discovery/crawler): add PLZ boundary + range coverage cases	2026-04-18 12:00:42 +02:00
vikingowl	eed76f1e76	docs(discovery/crawler): align PLZ helper comment with implementation	2026-04-18 11:57:34 +02:00
vikingowl	4694804331	feat(discovery/crawler): PLZ-to-land inference helper	2026-04-18 11:54:17 +02:00
vikingowl	e359d06d13	test(discovery/crawler): capture golden fixtures from five sources	2026-04-18 11:45:53 +02:00
vikingowl	5135f0a3be	feat(discovery/crawler): scaffold subpackage with Source interface and RawEvent types	2026-04-18 11:36:07 +02:00
vikingowl	adf417b731	fix(research): 429 aware error handling for Pass 1/2 Pass 1 and Pass 2 now detect Mistral web_search rate limits (shared with the Pass 0 CronJob) and return a proper HTTP 429 with Retry-After: 60 instead of a generic 500 "AI research failed". Pass 2 is enrichment-only, so rate-limits there fall through with pass1 results intact. - pkg/ai: new shared IsRateLimit helper + DefaultRetryAfterSeconds=60. discovery/service.go drops its local copy and imports the shared one. - apierror.TooManyRequests now accepts an optional custom message so the response body can include "try again in ~60s". - market/research.go: respondRateLimited helper sets Retry-After, downgrades the log line from ERROR to WARN (rate-limits are expected state, not a fault), and returns 429 with a structured rate_limited code the admin UI can key off of.	2026-04-18 10:33:13 +02:00
Christian Nachtigall	8e8bb8d4c3	Merge branch 'feature/discovery-validator' into 'main' feat(discovery): validator — catches agent self-contradictions before insert See merge request vikingowl/marktvogt.de!14	2026-04-18 08:05:20 +00:00
vikingowl	de1a3f6efb	feat(discovery): validator — catches agent self-contradictions before insert Pass 0 agents produce schema-valid but semantically wrong output: markets claimed in the wrong bundesland, status 'bestaetigt' with a hinweis about Vorjahresdaten, etc. The schema alone can't catch these. This validator does, as a blocking gate before InsertDiscovered. Checks (Pass 0 scope): - bundesland_mismatch: agent's bundesland must equal bucket.region, with a light normalizer for CH 'Kanton X' prefix so Phase B can refine the Schweiz seed without a signature break. - status_hinweis_inconsistent: if agent_status=='bestaetigt' AND hinweis contains 'vorjahr' (case-insensitive), the agent contradicted itself. Errors drop the market (counted as summary.validation_failed); warnings would get merged into hinweis — no warning-level checks exist yet at Pass 0 scope, placeholder reserved. Phase B (research agent) checks will extend this file: oeffnungszeiten dedup, start_datum window coverage, full quellen liveness for Pass 1.	2026-04-18 10:05:08 +02:00
Christian Nachtigall	fda30de158	Merge branch 'feature/discovery-pass0-halbmonat-plus-verify' into 'main' feature/discovery pass0 halbmonat plus verify See merge request vikingowl/marktvogt.de!13	2026-04-18 07:52:26 +00:00
vikingowl	cd836564f1	feat(discovery): Pass 0 halbmonat buckets + konfidenz/status + link verification Pass 0 splits every month into two halves (H1 = days 1-15, H2 = 16-EOM) so each agent call fits within Mistral's 4096 max_tokens budget. The response schema picks up richer per-market signals and dead agent URLs get filtered before they land in the admin queue. DB: - 000015: add halbmonat char(2) to discovery_buckets, widen unique key, backfill existing rows as H1 + insert H2 siblings (624 → 1248 rows). - 000016: rename discovered_markets.extraktion → konfidenz with best-effort value mapping (verbatim→hoch, abgeleitet→mittel); add agent_status column. Backend: - model: Bucket gains Halbmonat; Pass0Bucket same. Pass0Market renames Extraktion → Konfidenz and adds AgentStatus (JSON tag "status"). DiscoveredMarket mirrors both fields; queue-lifecycle Status column stays distinct from agent-reported AgentStatus. - repository: all SELECT/INSERT touched to use the new columns; picker orders by year_month, halbmonat so H1 runs before H2 in the same month. - agent client: prompt now injects halbmonat and recherche_datum (today) so the agent has explicit date context. - link verification: new LinkChecker does concurrent HEAD (GET fallback on 405) with a 5s timeout. FilterURLs runs before InsertDiscovered — markets whose quellen all fail are dropped and counted as link_check_failed in TickSummary. Failing website URLs are cleared but don't block insert. - Service.linkChecker is a narrow interface so tests inject a noop stub instead of hitting the network. Web: - DiscoveredMarket type gains konfidenz + agent_status, drops extraktion. - Queue column renames "Extraktion" → "Konfidenz" with three-level coloring (hoch=emerald, mittel=amber, niedrig=red, else neutral). - A small pill next to markt_name surfaces agent_status when it's not "bestaetigt" — red for "abgesagt", amber for "unklar" and "vorjahr_unbestaetigt" — so risky entries are obvious before accept.	2026-04-18 09:51:57 +02:00
Christian Nachtigall	1af97bda21	Merge branch 'feature/discovery-edit-and-sources' into 'main' feat(discovery): edit pending entries + surface quellen links See merge request vikingowl/marktvogt.de!12	2026-04-18 07:33:33 +00:00
vikingowl	bf72095348	feat(discovery): edit pending entries + surface quellen links Expanding any row in the discovery queue now reveals: - Quellen as clickable URLs (was just a count) - Hinweis if the agent emitted one - Inline edit form for markt_name, stadt, bundesland, start/end date, and website — the fields the Pass 0 agent gets wrong most often Backend: - PATCH /admin/discovery/queue/:id applies a partial update to pending entries via a COALESCE-based SQL update. Only fields that were set are written. - Service recomputes name_normalized when markt_name or stadt change so dedup stays consistent after edits. - Status check ensures only 'pending' entries are mutable. Web: - Row state $expandedId holds at most one open drawer at a time. - Dates round-trip through <input type="date"> using the shared dateInputValue helper; form action converts back to RFC3339 for Go. - Existing Accept/Reject buttons untouched — workflow is edit-then-accept.	2026-04-18 09:33:14 +02:00
Christian Nachtigall	a44005b694	Merge branch 'fix/discovery-rate-limit-and-polish' into 'main' fix(discovery): defer rate-limited buckets + polish queue table See merge request vikingowl/marktvogt.de!11	2026-04-18 07:21:24 +00:00
vikingowl	98eae40755	fix(discovery): defer rate-limited buckets + polish queue table Rate limits (Mistral web_search 429) used to get counted as hard errors, marking the bucket as queried and bumping the Errors(24h) strip — even though the right behavior is to wait and try again later. Backend: - isRateLimit() matches "rate limit" / "status 429" in the error string. - On persistent rate-limit after one 10s retry: leave last_queried_at unchanged (bucket stays eligible for next tick) and abort the remainder of this tick — Mistral's web_search budget is shared, no point hammering more buckets in the same batch. - TickSummary gains rate_limited counter; Errors stays for real failures. Frontend: - Dates: RFC3339 → 'DD.MM.YYYY' German format, range rendered as 'DD.MM.YYYY – DD.MM.YYYY'. - Queue table: cell horizontal padding, uppercase compact headers, scrollable on narrow viewports, dark-mode variants on every color (emerald/amber badges, link color, reject button), Region folds bundesland\|\|land into a single column (Land was always 'Deutschland' for DACH anyway).	2026-04-18 09:21:05 +02:00
Christian Nachtigall	e4ef4adad6	Merge branch 'fix/discovery-json-tags' into 'main' fix(discovery): add json tags to domain types See merge request vikingowl/marktvogt.de!10	2026-04-18 07:11:55 +00:00
vikingowl	b6ace52ada	fix(discovery): add json tags to domain types Without snake_case json tags, Go serializes fields as PascalCase (ID, MarktName, etc.) — but the Svelte frontend reads snake_case. Every row.id on the client was undefined, which made Svelte 5 see identical 'undefined' keys across the {#each queue as row (row.id)} loop and throw each_key_duplicate. Adds explicit snake_case tags to Bucket, DiscoveredMarket, and RejectedDiscovery to match what the TypeScript types already expect.	2026-04-18 09:11:44 +02:00
Christian Nachtigall	8f1efe73f2	Merge branch 'fix/discovery-cron-service-port' into 'main' fix(helm): CronJob curls the Service port, not the container port See merge request vikingowl/marktvogt.de!9	2026-04-18 07:00:26 +00:00
vikingowl	5a561b3092	fix(helm): CronJob curls the Service port, not the container port Service listens on port 80 (target: container 8080). The CronJob was curling :8080 directly, which isn't exposed by the Service — every tick timed out after ~135s with "Could not connect to server". Switch to {{ .Values.service.port }} so the template always tracks the actual Service port.	2026-04-18 09:00:16 +02:00
Christian Nachtigall	1252044d60	Merge branch 'fix/discovery-empty-card-dark' into 'main' fix(web): dark-mode variants for discovery empty-queue card See merge request vikingowl/marktvogt.de!8	2026-04-18 06:53:43 +00:00
vikingowl	173e7c5013	fix(web): dark-mode variants for discovery empty-queue card	2026-04-18 08:53:25 +02:00
Christian Nachtigall	ce75448f1e	Merge branch 'fix/discovery-empty-slice-null' into 'main' fix(discovery): render empty queue as [] not null (500 on empty prod) See merge request vikingowl/marktvogt.de!7	2026-04-18 06:44:52 +00:00
vikingowl	14e1a36622	fix(discovery): render empty queue as [] not null (500 on empty prod) Go's nil slice marshals as JSON null, not [], which crashed the Svelte page's .length access on fresh installs where no discovery tick has happened yet. Reproduced in production: /admin/discovery → 500 because data.queue was null and {queue.length} dereferenced it. Backend: initialize every returning slice in repository.go via make([]T, 0) so zero rows serialize as [] consistently. Also applies to PickStaleBuckets, ListSeriesByCity, and Stats.RecentErrors. Web: coalesce data.queue / data.stats.recent_errors at the top of the Svelte script with `?? []` so future nil-slice regressions don't take the whole page down.	2026-04-18 08:44:17 +02:00

1 2 3 4 5

218 Commits