Files
gnoma/TODO.md
T
vikingowl aca830e7db feat(engine): consumption-time stream-error failover
When a stream errors out before producing any user-visible content
(text, thinking, or tool calls), the engine now transparently retries
on the next-best arm instead of bubbling the error to the TUI. Covers
the case from the post-SLM screenshot: subprocess CLI agents that
exit non-zero on auth/config failures, network drops mid-stream,
rate-limited arms whose error surfaces after Stream() already returned.

Mechanism: the stream-create + consume blocks are wrapped in a labeled
streamLoop. On s.Err() != nil with empty accumulator, the engine emits
a new EventFailover ("↻ <failed_arm> failed (<reason>) — retrying on
another arm"), excludes the failed arm via task.ExcludedArms, and
re-enters the loop. Cap of 4 failovers per round.

Guards:
- !acc.HasContent() — if text/tool calls already streamed, fail loud
  rather than duplicate visible output on retry.
- isFailoverable(err) — deny-list approach: context.Canceled/Deadline
  and HTTP 400/413 are fatal; everything else (auth, rate limit, 5xx,
  subprocess exit, network) is failoverable.
- Router.ForcedArm() == "" — when the user pinned an arm via --provider,
  failover is disabled by design.
- failoverAttempt < maxFailovers — bounded retry budget.

TUI renders EventFailover under the existing "cost" role styling.
shortFailReason strips the subprocess wrapper envelope so the user sees
"Invalid API key. Try again." instead of
"subprocess: exit status 1: Error: Invalid API key. Try again.".

Tests cover the classifier (isFailoverable, shortFailReason), end-to-end
auth-error failover, content-already-streamed guard, and context-cancel
guard. Deterministic across 10x -race runs by giving the failing arm
IsCLIAgent=true to anchor it in tier 0 ahead of the API-tier backup.
2026-05-20 02:20:00 +02:00

2.4 KiB

Gnoma — TODO

Active plans, newest first:

Phases (2026-05-07 roadmap):

  1. M8 Cleanup (wiring gaps)
  2. PTY Interactive Shell (tea.ExecProcess)
  3. SLM Task Classifier (Ollama HTTP, opt-in) — complete
  4. Router Revisit — superseded by post-SLM plan
  5. USP Security Integration
  6. ELF Binary Support (deferred/opportunistic)
  7. Distribution (CI trigger for goreleaser)

Stable Backlog (not in active phases)

  • Thinking mode (disabled / budget / adaptive) — M12 in milestones
  • Structured output with JSON schema validation — M12
  • Native agy JSON output — update subprocess provider to use --output-format stream-json once supported by agy CLI, replacing the current prompt-augmentation fallback.
  • SQLite session persistence + serve mode — M10
  • Task learning (pattern recognition, persistent tasks) — M11
  • Web UI (gnoma web) — M15
  • OAuth / keyring — M13
  • Observability (feature flags, cost dashboards) — M14
  • PE / Mach-O support — future, after ELF Phase 6

Architecture References

  • Milestones: docs/essentials/milestones.md
  • Decisions: docs/essentials/decisions/
  • ADR-013 (SLM routing, supersedes ADR-009): docs/essentials/decisions/002-slm-routing.md