feat(security): close audit waves 1-4 (C1-C6, H1, H2, H4, H11, H13, H14, H16) #1

Merged
vikingowl merged 8 commits from feat/security-audit-waves-1-4-remediation into main 2026-05-01 00:02:10 +02:00
Owner

Implements the four-wave remediation plan from planning/19-security-audit-2026-04-30.md. All Critical findings (C1-C6) plus Wave 1-4 Highs (H1, H2, H4, H11, H13, H14, H16) closed in one branch. PoC security tests added per wave; full backend test suite green; helm chart lints clean; golangci-lint shows 0 issues.

Wave summary

Wave Findings closed Notes
1 C1, C2, H1, H2, H3 OAuth state nonce + email-verified gating; magic-link atomic consume; TOTP code replay guard; backup-code wipe on disable
2 C3, H4, H11, H16 CORS/CSRF regex anchored; ClientIP via APP_TRUSTED_PROXIES; body limit + DisallowUnknownFields; NetworkPolicy default-on + new web NP
3 C4, C5, M1 TOTP secret_v2 read/write; OAuth tokens encrypted via new migration 000033; HKDF domain separation per data class
4 C6, H13, H14 New pkg/safehttp blocks SSRF; promptguard German rules + NFKC + Cf-strip; ai.BudgetGate enforces AI_DAILY_CAP_USD pre-call

Test plan

  • go build ./... clean
  • go test ./... 18 packages pass; 8 new *_security_test.go files
  • go vet ./... clean
  • golangci-lint 0 issues
  • helm lint helm/marktvogt 0 failures
  • helm template helm/marktvogt renders both NetworkPolicies
  • Post-merge: run cmd/totp-encrypt to backfill any pre-migration TOTP rows; OAuth tokens self-rotate on next refresh
  • Post-merge: set APP_TRUSTED_PROXIES to ingress CIDR (currently empty = no proxy trust); set AI_DAILY_CAP_USD (currently 0 = no cap)

Notes

  • OAuth routes remain disabled in server/routes.go:60-61. C1/C2 fixes ensure correctness when re-enabled — nothing reachable on main is regressed.
  • Deferred Mediums (M2-M14) and Lows tracked in planning/19-security-audit-2026-04-30.md.
  • No frontend changes required: OAuth state already round-tripped through the IdP; magic-link verify is wire-compatible.
Implements the four-wave remediation plan from `planning/19-security-audit-2026-04-30.md`. All Critical findings (C1-C6) plus Wave 1-4 Highs (H1, H2, H4, H11, H13, H14, H16) closed in one branch. PoC security tests added per wave; full backend test suite green; helm chart lints clean; golangci-lint shows 0 issues. ## Wave summary | Wave | Findings closed | Notes | |------|-----------------|-------| | 1 | C1, C2, H1, H2, H3 | OAuth state nonce + email-verified gating; magic-link atomic consume; TOTP code replay guard; backup-code wipe on disable | | 2 | C3, H4, H11, H16 | CORS/CSRF regex anchored; ClientIP via APP_TRUSTED_PROXIES; body limit + DisallowUnknownFields; NetworkPolicy default-on + new web NP | | 3 | C4, C5, M1 | TOTP secret_v2 read/write; OAuth tokens encrypted via new migration 000033; HKDF domain separation per data class | | 4 | C6, H13, H14 | New `pkg/safehttp` blocks SSRF; promptguard German rules + NFKC + Cf-strip; `ai.BudgetGate` enforces `AI_DAILY_CAP_USD` pre-call | ## Test plan - [x] `go build ./...` clean - [x] `go test ./...` 18 packages pass; 8 new `*_security_test.go` files - [x] `go vet ./...` clean - [x] `golangci-lint` 0 issues - [x] `helm lint helm/marktvogt` 0 failures - [x] `helm template helm/marktvogt` renders both NetworkPolicies - [ ] Post-merge: run `cmd/totp-encrypt` to backfill any pre-migration TOTP rows; OAuth tokens self-rotate on next refresh - [ ] Post-merge: set `APP_TRUSTED_PROXIES` to ingress CIDR (currently empty = no proxy trust); set `AI_DAILY_CAP_USD` (currently 0 = no cap) ## Notes - OAuth routes remain disabled in `server/routes.go:60-61`. C1/C2 fixes ensure correctness when re-enabled — nothing reachable on `main` is regressed. - Deferred Mediums (M2-M14) and Lows tracked in `planning/19-security-audit-2026-04-30.md`. - No frontend changes required: OAuth state already round-tripped through the IdP; magic-link verify is wire-compatible.
vikingowl added 8 commits 2026-05-01 00:01:46 +02:00
Replace fmt.Printf to stderr with slog.Warn so cache-degradation events
are captured in Loki and queryable.

Audit finding M2.
New pkg/promptguard.Sanitize strips known structural injection patterns
(role labels, override directives, chat-template tokens, llama tokens,
prompt-exfil) from third-party scraped content before it reaches Gemini.

Wired into both LLM call sites:
- discovery/enrich.ProviderLLMEnricher.EnrichMissing (per-source quellen)
- market/research.buildUserPrompt (quellePage title + text)

Defense-in-depth on top of existing structural framing (JSON envelope in
research, JSON-Schema constrained decoding in enrich_b).

Audit finding H2.
VPA was added to per-service charts (backend/deploy/helm, web/deploy/helm)
on 2026-04-20 but lost when those charts were deleted in the 2026-04-28
monolithic chart migration. The orphan branch gitlab/feat/helm-vpa-off-mode
never made it into helm/marktvogt/.

Restores VPA gated under <svc>.vpa.enabled (default false), updateMode
"Off" so the recommender observes without eviction. Activate via:

  helm upgrade --reuse-values --set backend.vpa.enabled=true \
                                    --set web.vpa.enabled=true

After ~1 week of recommender data, decide: tune resources.requests
manually, or flip updateMode to "Auto" (requires PDB + replicaCount>=2).
When flipping to Auto with HPA on CPU, drop "cpu" from
controlledResources to avoid the HPA+VPA-on-same-metric anti-pattern.

Audit finding M1.
First security audit of the marktvogt backend. Covers custom auth,
sessions, OAuth, TOTP+backup codes, magic links, admin endpoints,
LLM pipeline, and AI-cost endpoints.

Findings:
- H1: revoked sessions stay valid in valkey cache (FIXED in this branch)
- H2: prompt-injection via aggregator scrapes (MITIGATED via promptguard)
- H3: no threat-model artefact existed (RESOLVED by this doc)
- M1: VPA lost in monolithic-chart migration (FIXED in this branch)
- M2: fmt.Printf used for valkey-cache failure (FIXED in this branch)
- M3: AI per-day cost cap is logged-only, not enforcing (OPEN)
- M4: replicaCount=1 + PDB disabled for backend+web (OPEN)
- M5: no body-tampering tests on admin discovery endpoints (OPEN)
- L1: panic in startup paths (acceptable, documented)
- I1: Stripe pre-implementation guards (verify-first, idempotency)

Audit finding H3.
RevokeSession, RevokeSessionsByFamilyID, DeleteUserSessions,
RevokeOtherSessions, and ConsumeRefreshToken updated revoked_at in
Postgres but did not invalidate the valkey access-token cache. The cache
serves the original Session JSON (RevokedAt: null) until its TTL expires
(JWT_ACCESS_TTL = 2h), so logout / admin-revoke / refresh-reuse-detection
took up to 2h to actually invalidate.

Fix: each revocation path now uses RETURNING access_token_hash and DELs
the cache key via new helper invalidateCachedSessions. revokeBulk handles
multi-row revocations.

Adds three router-level negative tests for the admin auth chain
(RequireAuth + RequireRole("admin")):
- TestAdminChain_UserRole_Returns403 — user role rejected with 403
- TestAdminChain_AdminRole_Passes — admin role accepted
- TestAdminChain_NoBearerToken_Returns401 — missing token rejected with
  401 (auth runs before role check)

Repository-level regression test for the cache invalidation requires
real Valkey + Postgres, currently not in test harness — flagged as TODO
in planning/18-security-threat-model.md.

Audit findings H1, E (negative tests for session validation, authz).
Replaces the Python pre-commit framework (.pre-commit-config.yaml) with
husky 9, kept faithful to the existing checks:

- Block direct commits to main (was: no-commit-to-branch).
- git diff --cached --check covers trailing-whitespace and merge-conflict
  marker detection.
- Custom large-file check (>500KB), excluding crawler test fixtures.
- Backend Go: gofmt -l (fail on diff), go vet, golangci-lint — only when
  backend/*.go is staged.
- Backend deps: go mod tidy -diff — only when go.mod/go.sum is staged.
- Web: prettier --check, eslint, svelte-check — only when web/ is
  staged.

lint-staged was intentionally not adopted — the previous config ran
hooks tree-wide (pass_filenames: false), so per-file optimisation would
be a behaviour change.

Install: pnpm install at repo root (the prepare script wires husky into
.git/hooks via core.hooksPath=.husky/_).
Implements the remediation pass described in
planning/19-security-audit-2026-04-30.md. All Critical findings and the
Wave 1-4 High findings are closed; PoC tests added; full backend test
suite green; helm chart lints clean.

Wave 1 - Auth & identity
- C1 OAuth state nonce: PutOAuthState / ConsumeOAuthState (valkey,
  GETDEL single-use, 15min TTL); Callback rejects missing/forged/cross-
  provider state before token exchange.
- C2 OAuth identity linking: refuse silent linking to existing user
  unless info.EmailVerified is true. fetchGitHubUser now consults the
  /user/emails endpoint for the verified flag (no more hardcoded true);
  fetchFacebookUser sets EmailVerified=false (FB exposes no per-email
  verification flag).
- H1 Magic-link verify: replaced Get + MarkUsed with a single atomic
  UPDATE...RETURNING (ConsumeMagicLink) - TOCTOU-free.
- H2 TOTP code replay: MarkTOTPCodeConsumed (valkey SET NX, 120s TTL)
  prevents replay of a successfully validated code; fails closed on
  transient store errors.
- H3 Backup-code orphan: DisableTOTP now also wipes totp_backup_codes.

Wave 2 - Middleware & network
- C3 CORS/CSRF regex anchoring: NewCORSConfig wraps each pattern with
  \A...\z so substring spoofing of origins is impossible.
- H4 ClientIP: server reads APP_TRUSTED_PROXIES; gin SetTrustedProxies
  is called explicitly (empty default = no proxy trust).
- H11 Body limit + DisallowUnknownFields: BodyLimitBytes middleware
  (1 MiB default) wraps every request; validate.BindJSON now uses a
  json.Decoder with DisallowUnknownFields and rejects trailing tokens;
  413 envelope on body-limit overflow.
- H16 NetworkPolicy: backend.networkPolicy.enabled defaults to true;
  new web-networkpolicy.yaml restricts web pod ingress to nginx-gateway
  and egress to backend service + DNS + 443.

Wave 3 - Encryption at rest
- C4 TOTP secrets: CreateTOTPSecret writes encrypted secret_v2;
  GetTOTPSecret prefers v2 with legacy fallback.
- C5 OAuth tokens: migration 000033 adds *_v2 columns; CreateOAuthAccount
  and UpdateOAuthTokens write encrypted; GetOAuthAccount reads v2 with
  legacy fallback.
- M1 Domain separation: crypto.DeriveKeyFor(secret, purpose) replaces
  single-purpose DeriveKey; settings, totp, oauth each use a distinct
  HKDF-derived subkey. DeriveKey kept as back-compat alias for settings.

Wave 4 - Input & AI safety
- C6 SSRF: new pkg/safehttp refuses to dial RFC1918, loopback, link-
  local, ULA, multicast, unspecified, or cloud-metadata IPs; scheme
  allowlist (http/https). Wired into pkg/scrape, discovery LinkChecker,
  and imageURLReachable. NewForTesting opt-in for httptest.
- H13 PromptGuard German + Unicode: NFKC + Cf-class strip pre-pass
  closes zero-width and full-width-homoglyph bypasses; new German rules
  for ignoriere/missachte/vergiss/role-escalation/prompt-exfil/verbatim;
  Gemma-style and pipe-delimited chat-template tokens covered;
  source-fence rule prevents '=== Quelle:' splice in scraped text.
- H14 BudgetGate: new ai.BudgetGate interface; UsageRepo.CheckBudget
  reads today's SUM(estimated_cost_usd) (10s cache) and refuses calls
  when AI_DAILY_CAP_USD is exceeded; GeminiProvider.Chat checks the
  gate before contacting Gemini.

OAuth routes remain disabled in server/routes.go, so C1/C2 are not
actively reachable today; fixes ensure correctness when re-enabled.
docs(security): record waves 1-4 remediation status in audit report
All checks were successful
ci/someci/push/backend Pipeline was successful
ci/someci/pr/backend Pipeline was successful
0d5788c951
vikingowl merged commit c3bd1c33e3 into main 2026-05-01 00:02:10 +02:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: vikingowl/marktvogt.de#1