Implements the remediation pass described in
planning/19-security-audit-2026-04-30.md. All Critical findings and the
Wave 1-4 High findings are closed; PoC tests added; full backend test
suite green; helm chart lints clean.
Wave 1 - Auth & identity
- C1 OAuth state nonce: PutOAuthState / ConsumeOAuthState (valkey,
GETDEL single-use, 15min TTL); Callback rejects missing/forged/cross-
provider state before token exchange.
- C2 OAuth identity linking: refuse silent linking to existing user
unless info.EmailVerified is true. fetchGitHubUser now consults the
/user/emails endpoint for the verified flag (no more hardcoded true);
fetchFacebookUser sets EmailVerified=false (FB exposes no per-email
verification flag).
- H1 Magic-link verify: replaced Get + MarkUsed with a single atomic
UPDATE...RETURNING (ConsumeMagicLink) - TOCTOU-free.
- H2 TOTP code replay: MarkTOTPCodeConsumed (valkey SET NX, 120s TTL)
prevents replay of a successfully validated code; fails closed on
transient store errors.
- H3 Backup-code orphan: DisableTOTP now also wipes totp_backup_codes.
Wave 2 - Middleware & network
- C3 CORS/CSRF regex anchoring: NewCORSConfig wraps each pattern with
\A...\z so substring spoofing of origins is impossible.
- H4 ClientIP: server reads APP_TRUSTED_PROXIES; gin SetTrustedProxies
is called explicitly (empty default = no proxy trust).
- H11 Body limit + DisallowUnknownFields: BodyLimitBytes middleware
(1 MiB default) wraps every request; validate.BindJSON now uses a
json.Decoder with DisallowUnknownFields and rejects trailing tokens;
413 envelope on body-limit overflow.
- H16 NetworkPolicy: backend.networkPolicy.enabled defaults to true;
new web-networkpolicy.yaml restricts web pod ingress to nginx-gateway
and egress to backend service + DNS + 443.
Wave 3 - Encryption at rest
- C4 TOTP secrets: CreateTOTPSecret writes encrypted secret_v2;
GetTOTPSecret prefers v2 with legacy fallback.
- C5 OAuth tokens: migration 000033 adds *_v2 columns; CreateOAuthAccount
and UpdateOAuthTokens write encrypted; GetOAuthAccount reads v2 with
legacy fallback.
- M1 Domain separation: crypto.DeriveKeyFor(secret, purpose) replaces
single-purpose DeriveKey; settings, totp, oauth each use a distinct
HKDF-derived subkey. DeriveKey kept as back-compat alias for settings.
Wave 4 - Input & AI safety
- C6 SSRF: new pkg/safehttp refuses to dial RFC1918, loopback, link-
local, ULA, multicast, unspecified, or cloud-metadata IPs; scheme
allowlist (http/https). Wired into pkg/scrape, discovery LinkChecker,
and imageURLReachable. NewForTesting opt-in for httptest.
- H13 PromptGuard German + Unicode: NFKC + Cf-class strip pre-pass
closes zero-width and full-width-homoglyph bypasses; new German rules
for ignoriere/missachte/vergiss/role-escalation/prompt-exfil/verbatim;
Gemma-style and pipe-delimited chat-template tokens covered;
source-fence rule prevents '=== Quelle:' splice in scraped text.
- H14 BudgetGate: new ai.BudgetGate interface; UsageRepo.CheckBudget
reads today's SUM(estimated_cost_usd) (10s cache) and refuses calls
when AI_DAILY_CAP_USD is exceeded; GeminiProvider.Chat checks the
gate before contacting Gemini.
OAuth routes remain disabled in server/routes.go, so C1/C2 are not
actively reachable today; fixes ensure correctness when re-enabled.
Replaces the Python pre-commit framework (.pre-commit-config.yaml) with
husky 9, kept faithful to the existing checks:
- Block direct commits to main (was: no-commit-to-branch).
- git diff --cached --check covers trailing-whitespace and merge-conflict
marker detection.
- Custom large-file check (>500KB), excluding crawler test fixtures.
- Backend Go: gofmt -l (fail on diff), go vet, golangci-lint — only when
backend/*.go is staged.
- Backend deps: go mod tidy -diff — only when go.mod/go.sum is staged.
- Web: prettier --check, eslint, svelte-check — only when web/ is
staged.
lint-staged was intentionally not adopted — the previous config ran
hooks tree-wide (pass_filenames: false), so per-file optimisation would
be a behaviour change.
Install: pnpm install at repo root (the prepare script wires husky into
.git/hooks via core.hooksPath=.husky/_).
RevokeSession, RevokeSessionsByFamilyID, DeleteUserSessions,
RevokeOtherSessions, and ConsumeRefreshToken updated revoked_at in
Postgres but did not invalidate the valkey access-token cache. The cache
serves the original Session JSON (RevokedAt: null) until its TTL expires
(JWT_ACCESS_TTL = 2h), so logout / admin-revoke / refresh-reuse-detection
took up to 2h to actually invalidate.
Fix: each revocation path now uses RETURNING access_token_hash and DELs
the cache key via new helper invalidateCachedSessions. revokeBulk handles
multi-row revocations.
Adds three router-level negative tests for the admin auth chain
(RequireAuth + RequireRole("admin")):
- TestAdminChain_UserRole_Returns403 — user role rejected with 403
- TestAdminChain_AdminRole_Passes — admin role accepted
- TestAdminChain_NoBearerToken_Returns401 — missing token rejected with
401 (auth runs before role check)
Repository-level regression test for the cache invalidation requires
real Valkey + Postgres, currently not in test harness — flagged as TODO
in planning/18-security-threat-model.md.
Audit findings H1, E (negative tests for session validation, authz).
First security audit of the marktvogt backend. Covers custom auth,
sessions, OAuth, TOTP+backup codes, magic links, admin endpoints,
LLM pipeline, and AI-cost endpoints.
Findings:
- H1: revoked sessions stay valid in valkey cache (FIXED in this branch)
- H2: prompt-injection via aggregator scrapes (MITIGATED via promptguard)
- H3: no threat-model artefact existed (RESOLVED by this doc)
- M1: VPA lost in monolithic-chart migration (FIXED in this branch)
- M2: fmt.Printf used for valkey-cache failure (FIXED in this branch)
- M3: AI per-day cost cap is logged-only, not enforcing (OPEN)
- M4: replicaCount=1 + PDB disabled for backend+web (OPEN)
- M5: no body-tampering tests on admin discovery endpoints (OPEN)
- L1: panic in startup paths (acceptable, documented)
- I1: Stripe pre-implementation guards (verify-first, idempotency)
Audit finding H3.
VPA was added to per-service charts (backend/deploy/helm, web/deploy/helm)
on 2026-04-20 but lost when those charts were deleted in the 2026-04-28
monolithic chart migration. The orphan branch gitlab/feat/helm-vpa-off-mode
never made it into helm/marktvogt/.
Restores VPA gated under <svc>.vpa.enabled (default false), updateMode
"Off" so the recommender observes without eviction. Activate via:
helm upgrade --reuse-values --set backend.vpa.enabled=true \
--set web.vpa.enabled=true
After ~1 week of recommender data, decide: tune resources.requests
manually, or flip updateMode to "Auto" (requires PDB + replicaCount>=2).
When flipping to Auto with HPA on CPU, drop "cpu" from
controlledResources to avoid the HPA+VPA-on-same-metric anti-pattern.
Audit finding M1.
New pkg/promptguard.Sanitize strips known structural injection patterns
(role labels, override directives, chat-template tokens, llama tokens,
prompt-exfil) from third-party scraped content before it reaches Gemini.
Wired into both LLM call sites:
- discovery/enrich.ProviderLLMEnricher.EnrichMissing (per-source quellen)
- market/research.buildUserPrompt (quellePage title + text)
Defense-in-depth on top of existing structural framing (JSON envelope in
research, JSON-Schema constrained decoding in enrich_b).
Audit finding H2.
Sweep of server action error strings — eight ASCII fallbacks replaced with
ä/ö/ü/ß across three +page.server.ts files: Pruefung -> Prüfung,
bestaetige -> bestätige, waehle -> wähle, fuelle -> fülle, Loeschen ->
Löschen, Statusaenderung -> Statusänderung, Ungueltiger -> Ungültiger.
Discovery agent_status enum literals ('bestaetigt', 'unklar', etc.) are
intentionally left as ASCII — they must match the LLM schema constants on
the backend.
Replace ASCII fallbacks (vollstaendig, uebermittelt, geprueft, Schliessen,
fuer, Rueckfragen, Datenschutzerklaerung) with proper German characters.
The ASCII-only convention applies to planning docs, not user-facing UI.
Native <dialog>.showModal() relies on the user-agent's margin: auto to
center the modal, but Tailwind 4's preflight resets margin to 0 on every
element, leaving the dialog pinned to the top-left edge of the viewport.
Add m-auto to the dialog class to restore the intended centering. Only one
dialog in the app, so a scoped class fix is sufficient — no global override
needed.
Two follow-ups to the apply-shape fix:
opening_hours: previous converter only kept datum_von's weekday, dropping
the rest of multi-day ranges. A Sat-Sun event ended up with only Samstag
saved. Now iterates [datum_von, datum_bis] inclusive, projects each date
to its German weekday, and dedupes by (weekday, open, close). Different
hours on the same weekday still produce separate rows so the admin sees
the conflict. Capped at 7 distinct weekdays — the form's expressive limit.
website validator: the LLM was emitting aggregator/listing URLs (e.g.
suendenfrei.tv detail pages) as the official event website because that's
where it found grounding info. The prompt already forbade this, but the
model ignores soft rules. Add a hard validator-level rejection for known
aggregator domains: suendenfrei.tv, mittelalterkalender.info,
marktkalendarium.de, festival-alarm.com, mittelaltermarkt.online. These
suggestions now land in the "rejected" bucket with a clear reason.
image_url / logo_url are unaffected (those legitimately come from any host).
Prompt: enumerate the same aggregator domains explicitly under both the
"primary source" fallback list and the "website forbidden" rule, so the
model has a concrete blacklist instead of a category description.
Existing markets with already-saved aggregator websites need manual
clearing — the validator only kicks in on new applies.
Tests: 4 new opening-hours subtests (range expansion, weekday dedupe,
cross-entry dedupe, hours-conflict-kept-separate) and 4 validator subtests
(three aggregator domains rejected as website; aggregator host as image_url
still ok).
When apply moved server-side (9b30863), the client-side conversion from
dd9a5ae was lost. opening_hours and admission_info were stored raw in the
LLM's German shape ([{datum_von,...}], [{betrag,name,waehrung}]), which the
form's reactive bindings could not parse — fields appeared empty after
Uebernehmen + reload.
applyFieldMerge now routes both fields through dedicated converters:
- admission_info: smart name-prefix mapping (Erwachsen*->adult_cents,
Kind*/Child*->child_cents, Ermaessigt*/Schueler*/Senior*/Rentner*/Reduced
->reduced_cents). Unmapped ticket names append to notes so admins still
see the extracted info. Already-form-shape input (map with adult_cents)
passes through unchanged.
- opening_hours: each LLM entry's datum_von is parsed as ISO date and
converted to a German weekday name. Entries with unparseable datum_von
are dropped (better than writing rows the form's day select cannot bind).
Already-form-shape input ([{day,...}]) passes through unchanged.
Forward-only: markets where a previously broken apply already wrote
LLM-shape JSON need a re-apply (or manual edit) to render correctly.
Tests: 8 new TestApplyFieldMerge_* subtests covering smart mapping, notes
fallback, weekday derivation, malformed dates, and pass-through.
Pre-flight for ArgoCD enable: ensure chart renders match cluster state so
the first sync is a no-op rather than rolling deployments back to :latest.
CI continues to bump via --set-string on each push.
depends_on: [backend] in web.yaml blocked web from triggering on
single-subtree web commits (backend's path filter excluded it from the
pipeline, so web's dependency was never satisfied; web never ran).
Switching to retry loop: 3 attempts with 30s backoff. When a cross-subtree
commit triggers both pipelines, the loser of the helm release-lock race
sleeps and retries — works whether or not the other workflow ran.
helm 4.1 in alpine/helm:4.1 errored 'release: already exists' on the latter
flag despite the release being deployed. --reuse-values is the older,
universally-supported variant — preserves the other service's image tag
between pipeline runs the same way.
CI deploy steps now target helm/marktvogt with --reset-then-reuse-values,
preserving the other service's image tag across pipeline runs. Each pipeline
sets only its own X.image.tag.
App-level secrets (smtp/turnstile/discovery/ai/JWT/oauth) moved out of CI's
--set chain in the previous phase — now pre-created via
scripts/k8s-secrets-sync.sh from .env.helm. The chart's conditional secret
templates remain for backward-compat with the live release's stored values
but will be removed in a follow-up once those values are cleared.
Old per-service chart directories deleted; only the monolithic
helm/marktvogt/ remains.
MIGRATION.md updated with the actual procedure that worked, including the
several pitfalls hit during the live tenant-2 migration on 2026-04-28
(helm uninstall trap, SSA field-manager swap for CRDs, kyverno hostname
allowlist for new subdomains).
New unified helm chart at helm/marktvogt/ that combines backend (Go API,
Postgres, Dragonfly, migrate hook, discovery cron) and web (SvelteKit SSR)
into a single release. Replaces the per-service charts at backend/deploy/helm
and web/deploy/helm — kept in place until the live migration is verified
(see helm/marktvogt/MIGRATION.md).
Selector labels and resource names match the existing per-service charts
exactly so migration is by re-annotation rather than recreate; CNPG cluster
and Dragonfly survive the cutover with no data loss.
Adds scripts/k8s-secrets-sync.sh + .env.helm.example for reproducible
out-of-band secret creation. .env.helm itself is gitignored.
Nil slices in MergePlan.AutoApply/ReviewRequired/Rejected serialized to
JSON null, causing the admin research panel to crash with
"can't access property 'map', plan.review_required is null". Initialize
the buckets as empty slices so the wire contract is always an array.
Tightened the empty-buckets test to assert the JSON shape.
Adds the native (Go) TypeScript compiler as a devDep and routes
svelte-check through it via --tsgo. Local pnpm run check goes from
~5s to ~3s on this codebase; pre-commit hook inherits the speed
automatically.
The linux-x64 prebuilt is a statically-linked Go binary (~25MB), so
the alpine builder in web/Dockerfile installs it cleanly even though
it never invokes svelte-check during the image build.
Re-prices every existing ai_usage row using the correct $/1M token rates
per model family. CASE clauses ordered specific-first (flash-lite before
flash) to mirror the longest-prefix-match in priceFor(). Aliases
(gemini-*-latest) resolve to the 2.5 family, the only one in production
during the affected window.
The grounding-fee component ($35/1k above 1500/day free tier) is not
recomputed: historical traffic shows zero grounded calls in the window,
so the bumper would be 0. Down is a no-op (irreversible by design — the
original miscalculated values are not preserved).
estimateCost ignored the model name and billed every Gemini call at
hardcoded flash-lite rates ($0.10 / $0.40 per 1M), under-counting Pro
calls by ~12-25x. Switch to priceFor(model) and prefer resp.ModelVersion
so aliases like gemini-pro-latest resolve to their concrete family.
Capture ThoughtsTokenCount as a separate ThinkingTokens column on
ai_usage (migration 000030) and bill it at the output rate.
Add a global thinking on/off toggle that mirrors the grounding pattern:
provider holds an in-memory cache (read at startup from settings.Store),
handler keeps it in sync, Chat() applies ThinkingConfig.ThinkingBudget=0
only when disabled. Default true preserves SDK behavior. Grounding+
thinking get/set helpers folded into shared getBool/setBool to keep
goconst happy.
Web admin settings: new "Modell-Reasoning" toggle card; usage panel sums
include thinking tokens. Types are optional with `?? 0` defaults so a
brief web-before-backend rollout window cannot render NaN.
The original sessions table has expires_at TIMESTAMPTZ NOT NULL with no
default. Migration 000027 added the new columns but did not drop this one,
so CreateSession must still supply a value. Using AbsoluteExpiresAt.
Replace HS256 JWT access tokens with two opaque 32-byte random tokens
(access + refresh), both stored as SHA-256 hashes in sessions + Valkey.
Key changes:
- GenerateOpaqueToken() replaces JWT issuance; TokenService removed
- Sessions now carry access_token_hash, refresh_token_hash, family_id,
parent_session_id, access_expires_at, absolute_expires_at, last_used_at,
revoked_at — per migration 000027 (updated to add access_expires_at)
- Refresh rotation is atomic (UPDATE...RETURNING); reuse detection kills
the entire token family and returns auth.refresh_reuse_detected
- RequireAuth/OptionalAuth now take SessionLookup (Valkey→Postgres) instead
of *TokenService; sets session_id in context alongside user_id
- last_used_at is bumped on each request, throttled to writes >60s old
- AuthConfig{AccessTTL,RefreshIdleTTL,RefreshAbsoluteTTL} replaces JWT TTL env
vars (AUTH_ACCESS_TTL=30m, AUTH_REFRESH_IDLE_TTL=168h, AUTH_REFRESH_ABSOLUTE_TTL=720h)
- JWT_SECRET kept for AI-settings key derivation (drops from auth flow)
Forced logout on deploy (D3 behaviour); pre-launch so acceptable.
structuredClone on a Svelte 5 reactive Proxy throws DataCloneError during
component init, causing MergeProposalPanel to silently fail to mount.
Replace with \$state.snapshot which is the documented way to deep-copy a
reactive prop into a local editable state.
Frontend budget was 180s — equal to the backend goroutine cap — so a race
determined which side timed out first. Bumped to 270s to guarantee the frontend
outlasts the backend's 3-minute window.
Added explicit null guard on result.proposal: if the LLM ever returns a
done-status without a proposal body the UI now surfaces a clear error instead
of silently assigning undefined (which kept the panel hidden with no feedback).
Also guards field_merges ?? {} in MergeProposalPanel to avoid Object.keys(null)
if the model returns a null map.