aca830e7db
When a stream errors out before producing any user-visible content
(text, thinking, or tool calls), the engine now transparently retries
on the next-best arm instead of bubbling the error to the TUI. Covers
the case from the post-SLM screenshot: subprocess CLI agents that
exit non-zero on auth/config failures, network drops mid-stream,
rate-limited arms whose error surfaces after Stream() already returned.
Mechanism: the stream-create + consume blocks are wrapped in a labeled
streamLoop. On s.Err() != nil with empty accumulator, the engine emits
a new EventFailover ("↻ <failed_arm> failed (<reason>) — retrying on
another arm"), excludes the failed arm via task.ExcludedArms, and
re-enters the loop. Cap of 4 failovers per round.
Guards:
- !acc.HasContent() — if text/tool calls already streamed, fail loud
rather than duplicate visible output on retry.
- isFailoverable(err) — deny-list approach: context.Canceled/Deadline
and HTTP 400/413 are fatal; everything else (auth, rate limit, 5xx,
subprocess exit, network) is failoverable.
- Router.ForcedArm() == "" — when the user pinned an arm via --provider,
failover is disabled by design.
- failoverAttempt < maxFailovers — bounded retry budget.
TUI renders EventFailover under the existing "cost" role styling.
shortFailReason strips the subprocess wrapper envelope so the user sees
"Invalid API key. Try again." instead of
"subprocess: exit status 1: Error: Invalid API key. Try again.".
Tests cover the classifier (isFailoverable, shortFailReason), end-to-end
auth-error failover, content-already-streamed guard, and context-cancel
guard. Deterministic across 10x -race runs by giving the failing arm
IsCLIAgent=true to anchor it in tier 0 ahead of the API-tier backup.
2.4 KiB
2.4 KiB
Gnoma — TODO
Active plans, newest first:
- Post-audit security hardening — complete (2026-05-19). All 14 findings from the external review are closed across three waves + one ADR:
docs/superpowers/plans/2026-05-19-post-slm-unlock.md— outstanding work after the SLM unlock session. Phases A (two-stage tool routing), B (CLI agent binary override), C (user profiles), and D (per-arm capability tags) are complete. Phase E (compound tools) is held until ≥50 SLM observations inform which primitives are worth adding.docs/superpowers/plans/2026-05-07-gnoma-roadmap.md— broader roadmap (PTY shell, USP integration, ELF, distribution). Phase 4 ("Router Revisit") is superseded by the post-SLM plan above.
Phases (2026-05-07 roadmap):
- M8 Cleanup (wiring gaps)
- PTY Interactive Shell (
tea.ExecProcess) - SLM Task Classifier (Ollama HTTP, opt-in) — complete
- Router Revisit — superseded by post-SLM plan
- USP Security Integration
- ELF Binary Support (deferred/opportunistic)
- Distribution (CI trigger for goreleaser)
Stable Backlog (not in active phases)
- Thinking mode (disabled / budget / adaptive) — M12 in milestones
- Structured output with JSON schema validation — M12
- Native agy JSON output — update subprocess provider to use
--output-format stream-jsononce supported by agy CLI, replacing the current prompt-augmentation fallback. - SQLite session persistence + serve mode — M10
- Task learning (pattern recognition, persistent tasks) — M11
- Web UI (
gnoma web) — M15 - OAuth / keyring — M13
- Observability (feature flags, cost dashboards) — M14
- PE / Mach-O support — future, after ELF Phase 6
Architecture References
- Milestones:
docs/essentials/milestones.md - Decisions:
docs/essentials/decisions/ - ADR-013 (SLM routing, supersedes ADR-009):
docs/essentials/decisions/002-slm-routing.md