CI deploy steps now target helm/marktvogt with --reset-then-reuse-values,
preserving the other service's image tag across pipeline runs. Each pipeline
sets only its own X.image.tag.
App-level secrets (smtp/turnstile/discovery/ai/JWT/oauth) moved out of CI's
--set chain in the previous phase — now pre-created via
scripts/k8s-secrets-sync.sh from .env.helm. The chart's conditional secret
templates remain for backward-compat with the live release's stored values
but will be removed in a follow-up once those values are cleared.
Old per-service chart directories deleted; only the monolithic
helm/marktvogt/ remains.
MIGRATION.md updated with the actual procedure that worked, including the
several pitfalls hit during the live tenant-2 migration on 2026-04-28
(helm uninstall trap, SSA field-manager swap for CRDs, kyverno hostname
allowlist for new subdomains).
Merge-plan and research-plan both call Gemini which can take >60s.
The default gateway timeout was killing connections with 504.
- Web HTTPRoute: add /admin/ rule with 120s request+backendRequest timeout
- Backend HTTPRoute: add /api/v1/admin/markets/ rule with 120s timeout
- MergePlan handler: add 110s context deadline for graceful degradation
before the gateway cuts the upstream connection
Replace the Mistral + Ollama AI stack with a single Google Gemini provider
backed by google.golang.org/genai. API key moves from env/Helm to the DB
(AES-256-GCM, key derived from JWT_SECRET via HKDF) so it can be rotated
via the admin UI without a pod restart.
New:
- pkg/crypto/secretbox — AES-256-GCM encrypt/decrypt for secrets at rest
- pkg/ai/gemini — GeminiProvider with grounding, structured output, usage
recording, and hot-reload (Reinitialize swaps client under mutex)
- pkg/ai/usage — UsageRecorder interface + UsageEvent struct
- domain/settings/store — DB-backed settings (model, grounding toggle, key)
- domain/settings/usage — UsageRepo implementing UsageRecorder; ai_usage table
- migrations 000021 (system_settings) + 000022 (ai_usage)
- settings API: GET /ai, POST /ai/key, POST /ai/model, POST /ai/grounding,
GET /ai/usage
- admin UI: 4-card settings page — provider status, model selector, grounding
toggle with quota, usage rollups + recent-calls table
Removed:
- pkg/ai/ollama, mistral_provider, ratelimiter (+ tests)
- Helm AI_API_KEY, AI_PROVIDER, AI_MODEL_COMPLEX, AI_AGENT_DISCOVERY,
AI_RATE_LIMIT_RPS env vars
Call sites set Grounded+CallType: research (true/"research"), enrich Pass B
(true/"enrich_b"), similarity (false/"similarity"). Integration test updated
to use a stub ai.Provider instead of a fake Ollama HTTP server.
Adds pkg/search (SearxNG impl), domain/market/research (orchestrator + embedded
German prompt and JSON schema), and reinstates POST /markets/:id/research on
top of the new pipeline. Seeds URLs from crawler provenance; falls back to
search when fewer than two distinct seed domains are known.
- HTTPRoute: add 300s request+backendRequest timeout rule for
/api/v1/admin/discovery/crawl; default rule unchanged. nginx-gateway's
60s default was cutting the connection mid-crawl.
- Service.Crawl: detach insert pipeline from HTTP request context with
a 3-minute internal timeout. Previously a canceled request ctx
cascaded into the link-verifier, failing every URL check and
counting every merged event as LinkCheckFailed. Inserts now complete
even if the gateway cut the connection.
- Log CrawlSummary at INFO on completion so outcomes are visible in
backend logs without needing the HTTP response body.
- New test: TestServiceCrawlDetachesInsertContextFromRequestCtx.
Service listens on port 80 (target: container 8080). The CronJob was
curling :8080 directly, which isn't exposed by the Service — every tick
timed out after ~135s with "Could not connect to server".
Switch to {{ .Values.service.port }} so the template always tracks the
actual Service port.
Previous deploys emitted 4 warnings on the discovery-tick Pod template
against the restricted:latest policy. Today they are warnings; if the
namespace enforcement tightens, admission will silently drop the Pod.
Pod-level: runAsNonRoot, runAsUser/runAsGroup 100 (curlimages/curl's
built-in non-root UID), seccompProfile RuntimeDefault.
Container-level: allowPrivilegeEscalation false, capabilities drop ALL.
Adds a batch/v1 CronJob that POSTs to /api/v1/admin/discovery/tick on a
configurable schedule (default every 4h). Wires DISCOVERY_TOKEN into the
ci-secrets Secret and projects discovery/AI env vars into the backend
Deployment.
- Set resources req=limit (100m/128Mi) for Guaranteed QoS class
- Add ConfigMap checksum annotation to trigger rollouts on config changes
- Add retry limit (60 attempts) to migration init container
- Use TARGETARCH in Dockerfile for multi-arch build support
- Set GOMAXPROCS and GOMEMLIMIT from cgroup limits to prevent
thread oversubscription and unbounded GC memory growth
- Add startup probe (60s budget) to gate liveness/readiness during
connection pool initialization
- Increase liveness failureThreshold to 5 to avoid restarts on
transient issues
- Remove initialDelaySeconds (startup probe replaces this)
- Upgrade CI from alpine/helm:3.17 to alpine/helm:4.1
- Replace deprecated --atomic with --rollback-on-failure + --wait=watcher
BusyBox 1.37 nc -z is broken (outputs "punt!" and never exits),
causing the wait-for-cache init container to loop indefinitely.
The cache is healthy — the backend should handle reconnects itself.
Prevents the backend from starting before the DragonflyDB operator
has the cache pod ready and reachable. Mirrors the existing
wait-for-postgres pattern in the migration job.
Replace manual Valkey Deployment+Service with DragonflyDB operator CRD.
Add sectionName to HTTPRoute for HTTPS listener pinning and a separate
HTTP→HTTPS 301 redirect route. Update resources from req=limit to
request/limit separation for pay-as-you-go billing. Fix NetworkPolicy
cache pod selector to match operator-managed labels.
Add Woodpecker secrets for AI_API_KEY, AI_AGENT_SIMPLE, and
TURNSTILE_SECRET_KEY. Create ci-secrets.yaml template and wire
them through the deploy step alongside existing SMTP secrets.
Set CPU and memory requests equal to limits (100m/100Mi) for backend,
cache, and web. Switch rolling update strategy to maxSurge=1,
maxUnavailable=0 so new pods start before old ones terminate.
Add readiness probe to cache deployment.
maxSurge=1 requires a second pod during rollout, but the tenant
ResourceQuota (1 CPU limit) is already at 900m — the extra 250m
exceeds the cap and the pod can't schedule, causing a 5min timeout.
Switch to maxSurge=0/maxUnavailable=1 (kill-then-start) to stay
within quota. Matches the web deployment strategy.
- Add SMTP_PORT, SMTP_FROM, ADMIN_EMAIL, FRONTEND_URL to ConfigMap
- Add Helm-managed SMTP secret for credentials (host, user, password)
- Wire Woodpecker secrets into deploy step via --set flags
- SMTP secret conditionally created only when values are provided
- Admin CRUD endpoints for markets with role-based middleware
- Anonymous market submission with Cloudflare Turnstile verification
- SMTP email notifications on new submissions (LogSender fallback)
- Market status workflow (pending/approved/rejected) with admin notes
- Nullable location column for submissions without coordinates
- CLI tool for promoting users to admin role
- Slug generation package extracted from seed
- Rate limiting on submission endpoint (3/hour per IP)
- Mailpit added to docker-compose for local email testing
Single-replica deployment with tight CPU quota (1 core) cannot run two
pods simultaneously during a rolling update. Recreate kills the old pod
before starting the new one.
Tenant SA lacks dragonflydb.io CRD permissions. Use a standard
Valkey Deployment+Service instead. Also re-enable CNPG (created
via kubectl), migrate job, and add seccompProfile to migrate pod.