44 Commits

Author SHA1 Message Date
75a626b127 chore: switch CI to monolithic chart, delete old per-service charts
Some checks failed
ci/someci/push/backend Pipeline failed
ci/someci/push/web Pipeline failed
CI deploy steps now target helm/marktvogt with --reset-then-reuse-values,
preserving the other service's image tag across pipeline runs. Each pipeline
sets only its own X.image.tag.

App-level secrets (smtp/turnstile/discovery/ai/JWT/oauth) moved out of CI's
--set chain in the previous phase — now pre-created via
scripts/k8s-secrets-sync.sh from .env.helm. The chart's conditional secret
templates remain for backward-compat with the live release's stored values
but will be removed in a follow-up once those values are cleared.

Old per-service chart directories deleted; only the monolithic
helm/marktvogt/ remains.

MIGRATION.md updated with the actual procedure that worked, including the
several pitfalls hit during the live tenant-2 migration on 2026-04-28
(helm uninstall trap, SSA field-manager swap for CRDs, kyverno hostname
allowlist for new subdomains).
2026-04-28 16:33:53 +02:00
4916b0d6af fix(infra): increase gateway timeout for admin+market routes to 120s
Merge-plan and research-plan both call Gemini which can take >60s.
The default gateway timeout was killing connections with 504.

- Web HTTPRoute: add /admin/ rule with 120s request+backendRequest timeout
- Backend HTTPRoute: add /api/v1/admin/markets/ rule with 120s timeout
- MergePlan handler: add 110s context deadline for graceful degradation
  before the gateway cuts the upstream connection
2026-04-25 22:03:20 +02:00
3ddfd87408 feat(ai): migrate to Google Gemini 2.5 Flash-Lite, drop Mistral/Ollama
Replace the Mistral + Ollama AI stack with a single Google Gemini provider
backed by google.golang.org/genai. API key moves from env/Helm to the DB
(AES-256-GCM, key derived from JWT_SECRET via HKDF) so it can be rotated
via the admin UI without a pod restart.

New:
- pkg/crypto/secretbox — AES-256-GCM encrypt/decrypt for secrets at rest
- pkg/ai/gemini — GeminiProvider with grounding, structured output, usage
  recording, and hot-reload (Reinitialize swaps client under mutex)
- pkg/ai/usage — UsageRecorder interface + UsageEvent struct
- domain/settings/store — DB-backed settings (model, grounding toggle, key)
- domain/settings/usage — UsageRepo implementing UsageRecorder; ai_usage table
- migrations 000021 (system_settings) + 000022 (ai_usage)
- settings API: GET /ai, POST /ai/key, POST /ai/model, POST /ai/grounding,
  GET /ai/usage
- admin UI: 4-card settings page — provider status, model selector, grounding
  toggle with quota, usage rollups + recent-calls table

Removed:
- pkg/ai/ollama, mistral_provider, ratelimiter (+ tests)
- Helm AI_API_KEY, AI_PROVIDER, AI_MODEL_COMPLEX, AI_AGENT_DISCOVERY,
  AI_RATE_LIMIT_RPS env vars

Call sites set Grounded+CallType: research (true/"research"), enrich Pass B
(true/"enrich_b"), similarity (false/"similarity"). Integration test updated
to use a stub ai.Provider instead of a fake Ollama HTTP server.
2026-04-25 09:54:49 +02:00
67b2eb5d74 feat(market): in-backend research orchestrator with SearxNG + schema-validated LLM
Adds pkg/search (SearxNG impl), domain/market/research (orchestrator + embedded
German prompt and JSON schema), and reinstates POST /markets/:id/research on
top of the new pipeline. Seeds URLs from crawler provenance; falls back to
search when fewer than two distinct seed domains are known.
2026-04-24 17:06:04 +02:00
52f3e4c009 chore: replace personal emails with contact@marktvogt.de 2026-04-21 10:56:07 +02:00
d6b65501ec security: redact agent ID from helm values; gitignore superpowers docs
Remove Mistral agent ID from agentDiscovery comment in helm values.yaml.
Add docs/superpowers/ to .gitignore to prevent re-tracking internal AI plans.
2026-04-21 09:48:32 +02:00
b52ac7d861 docs(ship-2): handoff note + chore(helm): bump JWT access TTL 15m to 2h
Handoff captures end-of-Ship-1 state and Ship 2 scope (§4.10 expanded
product additions: crawl-time enrichment, AI-augmented similarity,
inline enrich-before-accept, detail drawer, eval harness, enrichment
cache, auto-merge during crawl, keyboard shortcuts). §4.12 tracks the
admin auth refresh-on-401 fix; pending that work JWT_ACCESS_TTL bumped
from 15m to 2h as interim relief.
2026-04-19 01:05:52 +02:00
f6e4e5c29f fix(discovery): crawl survives gateway timeout and long-running runs
- HTTPRoute: add 300s request+backendRequest timeout rule for
  /api/v1/admin/discovery/crawl; default rule unchanged. nginx-gateway's
  60s default was cutting the connection mid-crawl.
- Service.Crawl: detach insert pipeline from HTTP request context with
  a 3-minute internal timeout. Previously a canceled request ctx
  cascaded into the link-verifier, failing every URL check and
  counting every merged event as LinkCheckFailed. Inserts now complete
  even if the gateway cut the connection.
- Log CrawlSummary at INFO on completion so outcomes are visible in
  backend logs without needing the HTTP response body.
- New test: TestServiceCrawlDetachesInsertContextFromRequestCtx.
2026-04-18 18:39:21 +02:00
ba453a910f chore(helm): daily discovery cron hits /crawl endpoint 2026-04-18 17:46:39 +02:00
5a561b3092 fix(helm): CronJob curls the Service port, not the container port
Service listens on port 80 (target: container 8080). The CronJob was
curling :8080 directly, which isn't exposed by the Service — every tick
timed out after ~135s with "Could not connect to server".

Switch to {{ .Values.service.port }} so the template always tracks the
actual Service port.
2026-04-18 09:00:16 +02:00
1ba8f856b4 fix(helm): add restricted PodSecurity settings to discovery CronJob
Previous deploys emitted 4 warnings on the discovery-tick Pod template
against the restricted:latest policy. Today they are warnings; if the
namespace enforcement tightens, admission will silently drop the Pod.

Pod-level: runAsNonRoot, runAsUser/runAsGroup 100 (curlimages/curl's
built-in non-root UID), seccompProfile RuntimeDefault.
Container-level: allowPrivilegeEscalation false, capabilities drop ALL.
2026-04-18 08:26:40 +02:00
31ce937f55 feat(helm): add discovery CronJob + token secret + env wiring
Adds a batch/v1 CronJob that POSTs to /api/v1/admin/discovery/tick on a
configurable schedule (default every 4h). Wires DISCOVERY_TOKEN into the
ci-secrets Secret and projects discovery/AI env vars into the backend
Deployment.
2026-04-18 07:57:18 +02:00
f9b77f362f chore(helm): right-size resource requests/limits per cluster telemetry
Drop requests to match observed peak usage and widen CPU limits for
burst headroom (Burstable QoS). Backend, web, Postgres, and Dragonfly
all had requests == limits pinned at defaults well above measured
7-day peaks.

- backend: req 100m/128Mi -> 50m/64Mi, lim 100m/128Mi -> 200m/128Mi
- web:     req 100m/128Mi -> 50m/96Mi, lim 100m/128Mi -> 200m/128Mi
- postgres (CNPG): req 50m/256Mi -> 15m/128Mi, lim 200m/512Mi -> 100m/256Mi
- dragonfly: req 100m/128Mi -> 100m/72Mi, lim 100m/128Mi -> 150m/128Mi

RAM limits unchanged where reasonable to preserve OOM protection;
Dragonfly CPU request kept at 100m (peak 74m) but limit raised to
avoid throttling under brief bursts.
2026-04-18 04:36:12 +02:00
a95d24876d fix(helm): update imagePullSecret to itsh-registry 2026-04-06 20:01:10 +02:00
e454e31472 fix(ci): switch container registry to registry.itsh.dev 2026-04-06 19:49:06 +02:00
53d7faae24 fix(helm): guaranteed QoS, config checksum, migration retry limit
- Set resources req=limit (100m/128Mi) for Guaranteed QoS class
- Add ConfigMap checksum annotation to trigger rollouts on config changes
- Add retry limit (60 attempts) to migration init container
- Use TARGETARCH in Dockerfile for multi-arch build support
2026-04-01 23:44:50 +02:00
482fcd180a feat(helm): add Go runtime tuning, startup probe, upgrade to Helm 4
- Set GOMAXPROCS and GOMEMLIMIT from cgroup limits to prevent
  thread oversubscription and unbounded GC memory growth
- Add startup probe (60s budget) to gate liveness/readiness during
  connection pool initialization
- Increase liveness failureThreshold to 5 to avoid restarts on
  transient issues
- Remove initialDelaySeconds (startup probe replaces this)
- Upgrade CI from alpine/helm:3.17 to alpine/helm:4.1
- Replace deprecated --atomic with --rollback-on-failure + --wait=watcher
2026-04-01 00:07:01 +02:00
74ee825039 fix(helm): switch migrate init container from busybox to alpine
busybox:1.37 nc -z is broken (outputs "punt!" and hangs).
Alpine 3.21 ships a working nc -z implementation.
2026-03-31 23:38:50 +02:00
08d83bc57e fix(helm): replace broken nc -z in migrate job init container
BusyBox 1.37 nc -z outputs "punt!" and hangs. Use nc -w 2 with
stdin redirect instead, which correctly tests TCP connectivity.
2026-03-31 23:06:51 +02:00
ab2484474e fix(helm): remove busybox init container blocking backend startup
BusyBox 1.37 nc -z is broken (outputs "punt!" and never exits),
causing the wait-for-cache init container to loop indefinitely.
The cache is healthy — the backend should handle reconnects itself.
2026-03-31 23:02:26 +02:00
9c051df350 feat(helm): add wait-for-cache init container to backend deployment
Prevents the backend from starting before the DragonflyDB operator
has the cache pod ready and reachable. Mirrors the existing
wait-for-postgres pattern in the migration job.
2026-03-08 20:00:52 +01:00
3d17e25764 fix(helm): bump dragonfly memory limit to 512Mi
DragonflyDB requires 256MiB minimum per thread. With container
overhead the 256Mi limit is insufficient, causing immediate exit.
2026-03-08 19:43:05 +01:00
b00e8df6db fix(helm): lower resource limits to fit within tenant-quota (1 CPU)
Set backend and cache limits to 200m/256Mi to stay within the
tenant-1 ResourceQuota of 1 CPU total.
2026-03-08 19:29:50 +01:00
2e1eed543d feat(helm): use DragonflyDB operator CRD, add HTTPRoute sectionName and HTTP→HTTPS redirects
Replace manual Valkey Deployment+Service with DragonflyDB operator CRD.
Add sectionName to HTTPRoute for HTTPS listener pinning and a separate
HTTP→HTTPS 301 redirect route. Update resources from req=limit to
request/limit separation for pay-as-you-go billing. Fix NetworkPolicy
cache pod selector to match operator-managed labels.
2026-03-08 19:01:34 +01:00
c7085e5337 chore: bump Go to 1.26 in CI and Dockerfile
Required by mistral-go-sdk which targets go 1.26.
2026-03-05 21:25:58 +01:00
02a03c3d41 feat: pass AI and Turnstile secrets via Helm deploy pipeline
Add Woodpecker secrets for AI_API_KEY, AI_AGENT_SIMPLE, and
TURNSTILE_SECRET_KEY. Create ci-secrets.yaml template and wire
them through the deploy step alongside existing SMTP secrets.
2026-03-05 18:41:42 +01:00
bec253506e chore: normalize resources to 100m/100Mi and enable zero-downtime deploys
Set CPU and memory requests equal to limits (100m/100Mi) for backend,
cache, and web. Switch rolling update strategy to maxSurge=1,
maxUnavailable=0 so new pods start before old ones terminate.
Add readiness probe to cache deployment.
2026-03-05 18:24:58 +01:00
fd879ba026 fix(deploy): use maxSurge=0 for rolling update to fit resource quota
maxSurge=1 requires a second pod during rollout, but the tenant
ResourceQuota (1 CPU limit) is already at 900m — the extra 250m
exceeds the cap and the pod can't schedule, causing a 5min timeout.

Switch to maxSurge=0/maxUnavailable=1 (kill-then-start) to stay
within quota. Matches the web deployment strategy.
2026-02-27 14:17:02 +01:00
2def99d163 fix: switch backend deployment to RollingUpdate for zero downtime
maxUnavailable=0 ensures old pod stays up until new pod passes
readiness probes. maxSurge=1 allows one extra pod during rollout.
2026-02-27 13:43:04 +01:00
2e5d7b726b feat: add SMTP config to Helm chart and Woodpecker pipeline
- Add SMTP_PORT, SMTP_FROM, ADMIN_EMAIL, FRONTEND_URL to ConfigMap
- Add Helm-managed SMTP secret for credentials (host, user, password)
- Wire Woodpecker secrets into deploy step via --set flags
- SMTP secret conditionally created only when values are provided
2026-02-27 13:31:37 +01:00
580b9d5e3c feat: add admin panel, market submissions, and email notifications
- Admin CRUD endpoints for markets with role-based middleware
- Anonymous market submission with Cloudflare Turnstile verification
- SMTP email notifications on new submissions (LogSender fallback)
- Market status workflow (pending/approved/rejected) with admin notes
- Nullable location column for submissions without coordinates
- CLI tool for promoting users to admin role
- Slug generation package extracted from seed
- Rate limiting on submission endpoint (3/hour per IP)
- Mailpit added to docker-compose for local email testing
2026-02-27 11:03:44 +01:00
8b478a11b8 fix(deploy): use Recreate strategy to fit tenant CPU quota
Single-replica deployment with tight CPU quota (1 core) cannot run two
pods simultaneously during a rolling update. Recreate kills the old pod
before starting the new one.
2026-02-22 12:01:50 +01:00
3236318e72 fix(deploy): add resource limits to migrate job to fit tenant quota 2026-02-22 11:55:42 +01:00
9e6608384c fix(deploy): increase migrate job deadline to 300s 2026-02-22 11:47:06 +01:00
e092a8d054 fix(deploy): replace Dragonfly CRD with plain Valkey deployment
Tenant SA lacks dragonflydb.io CRD permissions. Use a standard
Valkey Deployment+Service instead. Also re-enable CNPG (created
via kubectl), migrate job, and add seccompProfile to migrate pod.
2026-02-22 10:53:33 +01:00
f48a29c433 fix(deploy): disable CNPG and migrate job (tenant SA lacks CRD permissions)
Postgres, Dragonfly, and NetworkPolicy must be provisioned by the
platform admin or via itsh.dev dashboard, not by the tenant SA.
2026-02-22 10:32:10 +01:00
7c2e2cebff fix(deploy): disable Dragonfly CRD (tenant SA lacks dragonflydb.io permission) 2026-02-22 10:25:50 +01:00
e99ab896d3 fix(deploy): disable NetworkPolicy (tenant SA lacks networkpolicies permission) 2026-02-22 10:22:18 +01:00
ae54910f51 fix(docker): use existing nobody user instead of creating UID 65534 2026-02-22 10:19:19 +01:00
f6f07b2139 fix(deploy): add seccompProfile RuntimeDefault to satisfy PodSecurity restricted policy 2026-02-22 10:03:41 +01:00
a12c1b48f1 fix(ci): correct registry to somegit.dev, add golangci-lint v2 version field 2026-02-22 09:50:46 +01:00
10e1d15462 chore(deploy): remove superseded raw k8s manifests 2026-02-22 09:33:41 +01:00
7780c3378b feat(deploy): add Helm chart and update CI for k8s deployment
- Replace raw k8s manifests with a full Helm chart (deploy/helm/)
- Add CloudNativePG cluster with PostGIS extensions and hcloud-volumes storage
- Add DragonflyDB (Redis-compatible) cache via operator CRD
- Add migration Job as Helm pre-install/pre-upgrade hook
- Add NetworkPolicy restricting ingress to nginx-gateway, egress to DB/cache/DNS/HTTPS
- Add ServiceAccount with automountServiceAccountToken disabled
- Use HTTPRoute (Gateway API) instead of Ingress to match cluster setup
- Fix Dockerfile: explicit UID 65534, add golang-migrate CLI for migration Job
- Update CI: push immutable SHA tags, deploy via helm upgrade --install --atomic
2026-02-22 09:32:01 +01:00
a1d93f7a8e feat: implement MVP backend API
Go backend with Gin, pgx, Valkey (go-valkey), and PostGIS.

Domains:
- Market search with PostGIS geo-queries (ST_DWithin, ST_Distance),
  German full-text search (tsvector + ILIKE fallback for compound words),
  date range filtering, pagination, and slug-based detail endpoint
- Auth with email+password (bcrypt), JWT access tokens (15min),
  session tokens (30d, dual Valkey+Postgres storage), OAuth
  (Google/GitHub/Facebook), magic links, and TOTP 2FA
- User profile with CRUD, soft-delete (30d grace), and restore

Infrastructure:
- 6 database migrations (users, sessions, oauth_accounts, magic_links,
  markets with PostGIS+FTS, totp_secrets)
- Middleware: recovery, request ID, structured logging (slog), CORS,
  per-IP rate limiting, JWT auth
- Seed data: 10 medieval markets across DACH region
- Docker Compose (PostGIS 17 + Valkey 8), multi-stage Dockerfile,
  Woodpecker CI pipeline, Kubernetes manifests
- Justfile, golangci-lint config, env example
2026-02-18 05:52:20 +01:00