diff --git a/docs/essentials/INDEX.md b/docs/essentials/INDEX.md index 3dc5486..a674c06 100644 --- a/docs/essentials/INDEX.md +++ b/docs/essentials/INDEX.md @@ -21,15 +21,15 @@ essentials: | # | Essential | Status | Link | Last Updated | |---|-----------|--------|------|-------------| -| 1 | Vision | complete | [vision.md](vision.md) | 2026-04-02 | -| 2 | Domain Model | complete | [domain-model.md](domain-model.md) | 2026-04-02 | -| 3 | Architecture | complete | [architecture.md](architecture.md) | 2026-04-02 | -| 4 | Patterns | complete | [patterns.md](patterns.md) | 2026-04-02 | -| 5 | Process Flows | complete | [process-flows.md](process-flows.md) | 2026-04-02 | -| 6 | UML Diagrams | complete | [uml-diagrams.md](uml-diagrams.md) | 2026-04-02 | -| 7 | API Contracts | complete | [api-contracts.md](api-contracts.md) | 2026-04-02 | -| 8 | Tech Stack & Conventions | complete | [tech-stack.md](tech-stack.md) | 2026-04-02 | -| 9 | Constraints & Trade-offs | complete | [constraints.md](constraints.md) | 2026-04-02 | -| 10 | Milestones | complete | [milestones.md](milestones.md) | 2026-04-02 | -| 11 | Decision Log | complete | [decisions/001-initial-decisions.md](decisions/001-initial-decisions.md) | 2026-04-02 | -| 12 | Risk / Unknowns | complete | [risks.md](risks.md) | 2026-04-02 | +| 1 | Vision | complete | [vision.md](vision.md) | 2026-04-03 | +| 2 | Domain Model | complete | [domain-model.md](domain-model.md) | 2026-04-03 | +| 3 | Architecture | complete | [architecture.md](architecture.md) | 2026-04-03 | +| 4 | Patterns | complete | [patterns.md](patterns.md) | 2026-04-03 | +| 5 | Process Flows | complete | [process-flows.md](process-flows.md) | 2026-04-03 | +| 6 | UML Diagrams | complete | [uml-diagrams.md](uml-diagrams.md) | 2026-04-03 | +| 7 | API Contracts | complete | [api-contracts.md](api-contracts.md) | 2026-04-03 | +| 8 | Tech Stack & Conventions | complete | [tech-stack.md](tech-stack.md) | 2026-04-03 | +| 9 | Constraints & Trade-offs | complete | [constraints.md](constraints.md) | 2026-04-03 | +| 10 | Milestones | complete | [milestones.md](milestones.md) | 2026-04-03 | +| 11 | Decision Log | complete | [decisions/001-initial-decisions.md](decisions/001-initial-decisions.md) | 2026-04-03 | +| 12 | Risk / Unknowns | complete | [risks.md](risks.md) | 2026-04-03 | diff --git a/docs/essentials/architecture.md b/docs/essentials/architecture.md index 0452697..ac5abb1 100644 --- a/docs/essentials/architecture.md +++ b/docs/essentials/architecture.md @@ -85,9 +85,17 @@ graph TB | `internal/context` | Token tracking, compaction strategies, sliding window | Depends on message, provider | Internal | | `internal/config` | TOML layered config loading | BurntSushi/toml | Internal | | `internal/auth` | API key resolution from env/config | Pure Go | Internal | -| `internal/engine` | Agentic query loop, tool execution orchestration | Depends on all above | Internal | -| `internal/session` | Session lifecycle, channel-based UI decoupling | Depends on engine, stream | Internal | -| `internal/tui` | Terminal UI: chat, input, status, permission dialogs | Bubble Tea, lipgloss | Internal | +| `internal/security` | Firewall, secret scanner, unicode sanitizer, incognito mode | message, config | Security boundary | +| `internal/router` | Smart router: arm registry, pools, task classifier, selection | provider, message, config | Internal | +| `internal/engine` | Agentic query loop, tool execution orchestration | router, security, tool, stream, context | Internal | +| `internal/session` | Session lifecycle, channel-based UI decoupling | engine, stream | Internal | +| `internal/elf` | Sub-agent spawning, lifecycle, communication | engine, router, session | Internal | +| `internal/tui` | Terminal UI: chat, input, status, permission dialogs, config screen | session, stream, permission | Internal | +| `internal/hook` | Hook system: events, protocol, registration | message, tool | Internal | +| `internal/skill` | Skill loading, frontmatter parsing, discovery | message | Internal | +| `internal/mcp` | MCP client, tool discovery, tool replaceability | tool, config | External (stdio) | +| `internal/plugin` | Plugin manifest, loader, lifecycle | config | Internal | +| `internal/tasklearn` | Repetitive task detection, suggestions, persistent tasks | router, engine | Internal | ## Package Dependency Graph @@ -98,12 +106,20 @@ graph BT provider["provider"] tool["tool"] permission["permission"] + security["security"] + router["router"] context_mgr["context"] config["config"] auth["auth"] engine["engine"] session["session"] + elf["elf"] tui["tui"] + hook["hook"] + skill["skill"] + mcp["mcp"] + plugin["plugin"] + tasklearn["tasklearn"] cmd["cmd/gnoma"] stream --> message @@ -111,24 +127,44 @@ graph BT provider --> stream tool --> message permission --> message + permission --> config + security --> message + security --> config + router --> provider + router --> message + router --> config context_mgr --> message context_mgr --> provider - config --> permission - engine --> provider + engine --> router + engine --> security engine --> tool engine --> permission engine --> stream engine --> context_mgr session --> engine session --> stream + elf --> engine + elf --> router + elf --> session + hook --> message + hook --> tool + skill --> message + mcp --> tool + mcp --> config + plugin --> config + tasklearn --> router + tasklearn --> engine tui --> session tui --> stream + tui --> permission cmd --> tui cmd --> config cmd --> auth cmd --> session cmd --> provider cmd --> tool + cmd --> router + cmd --> security ``` ## Scope @@ -136,15 +172,19 @@ graph BT **In scope:** - Streaming chat with tool execution across 5+ LLM providers - Agentic loop (stream → tool calls → re-query → until done) -- Permission system for tool execution +- Security firewall with secret scanning, redaction, incognito mode +- Smart router with bandit-based multi-provider collaboration +- 6-mode permission system for tool execution - TUI and CLI pipe modes - TOML configuration with layering -- Context management and compaction -- Multi-agent (elfs) with per-elf provider routing -- Hook, skill, and MCP extensibility +- Context management and compaction (truncation + LLM summarization) +- Multi-agent (elfs) with router-integrated provider selection +- Hook, skill, MCP, and plugin extensibility +- Repetitive task learning and persistent tasks +- Session persistence (SQLite) and serve mode **Out of scope:** -- Web UI (future, via serve mode) +- Web UI (M15, via serve mode) - Cloud hosting / SaaS deployment - Training or fine-tuning models - IDE extension authoring (gnoma provides the backend, not the extension itself) diff --git a/docs/essentials/constraints.md b/docs/essentials/constraints.md index f24d342..29a8fa4 100644 --- a/docs/essentials/constraints.md +++ b/docs/essentials/constraints.md @@ -63,6 +63,35 @@ depends_on: [domain-model] - **Because:** User maintains the Mistral Go SDK, knows its internals. Good baseline — similar to OpenAI's API shape. Anthropic's unique features (thinking blocks, cache tokens) are better added as an M2 extension. - **Consequence:** Thinking block support tested later. Cache token tracking added with Anthropic provider. +### Security as core over plugin + +- **Chose:** Security firewall baked into gnoma core (`internal/security/`) +- **Over:** MCP-based security server (optional plugin) +- **Because:** Default-off security is no security. Every user should get secret scanning, unicode sanitization, and incognito mode out of the box. +- **Consequence:** Core binary is larger. False positives affect all users. Mitigated by configurable sensitivity and warn-first mode. + +### Proper shell parsing over regex decomposition + +- **Chose:** `mvdan.cc/sh` (Go POSIX shell parser) for compound command decomposition +- **Over:** Regex-based `splitCommand()` (CC approach, caps at 50 subcommands) +- **Because:** AST-based parsing is accurate for nested structures, doesn't need arbitrary caps, handles edge cases CC's regex misses. +- **Consequence:** Additional dependency. But `mvdan.cc/sh` is well-maintained and widely used in the Go ecosystem. + +### Full 6 permission modes over simplified 3 + +- **Chose:** All 6 CC permission modes (default, acceptEdits, bypass, deny, plan, auto) +- **Over:** Simplified 3-mode system (allow, deny, prompt) +- **Because:** Users need fine-grained control. `acceptEdits` is crucial for trusting file tools while verifying bash. `plan` mode enables read-only exploration. `auto` mode uses router signals for smart defaults. +- **Consequence:** More complex permission system. Testing matrix is larger (6 modes × rule types × tool types). + +### Router split over monolithic + +- **Chose:** Router in two milestones: M4 (heuristic) + M9 (bandit learning) +- **Over:** Full router in one milestone +- **Because:** Engine needs routing abstraction early (M4). Bandit learning needs elf feedback (M7) that doesn't exist yet. Building everything at once blocks other milestones. +- **Consequence:** Two integration points. Heuristic → bandit migration must be seamless. + ## Changelog - 2026-04-02: Initial version +- 2026-04-03: Added trade-offs for security-as-core, shell parsing, 6 permission modes, router split diff --git a/docs/essentials/decisions/001-initial-decisions.md b/docs/essentials/decisions/001-initial-decisions.md index f02fcad..78d8f67 100644 --- a/docs/essentials/decisions/001-initial-decisions.md +++ b/docs/essentials/decisions/001-initial-decisions.md @@ -183,6 +183,153 @@ Multi-provider collaboration is a core feature and part of gnoma's identity. The **Positive:** Clear differentiator from all existing tools. Shapes architecture from day one. **Negative:** Elf system design must account for per-elf provider config from the start. +--- + +# ADR-007: Security Firewall as Core (Not Plugin) + +**Status:** Accepted +**Date:** 2026-04-03 + +## Context + +gnoma needs to prevent secrets and sensitive data from leaking to LLM providers. Options: build it as an MCP server (plugin), or bake it into the core. + +## Decision + +Security firewall is a core component (`internal/security/`), not a plugin. It wraps all provider calls and tool results. Everyone benefits by default. + +## Alternatives Considered + +### Alternative A: MCP-based security server + +- **Pros:** Modular, replaceable, user can choose their own +- **Cons:** Users must opt-in. Default-off security is no security. MCP adds latency. + +## Consequences + +**Positive:** Every gnoma user gets secret scanning, unicode sanitization, and incognito mode out of the box. +**Negative:** Core binary is larger. False positives affect all users (mitigated by configurable sensitivity). + +--- + +# ADR-008: Router Split into Foundation + Advanced + +**Status:** Accepted +**Date:** 2026-04-03 + +## Context + +The smart router is gnoma's core differentiator but is a massive system (arm registry, limit pools, task classification, bandit learning, feedback, ensemble strategies, state persistence). Building it all at once blocks other milestones. + +## Decision + +Split into M4 (foundation: arm registry, pools, task classifier, heuristic selection) and M9 (advanced: bandit, feedback, ensemble, persistence). M4 gives the engine a routing abstraction early. M9 adds learning after elfs provide real feedback signals. + +## Alternatives Considered + +### Alternative A: Full router in one milestone + +- **Pros:** Complete system from day one +- **Cons:** Massive milestone, blocks TUI and other features, bandit needs elf feedback that doesn't exist yet + +## Consequences + +**Positive:** Engine routes from M4 onward. Heuristic selection is good enough for daily use. Bandit learning lands when feedback is available. +**Negative:** Two integration points instead of one. + +--- + +# ADR-009: Thompson Sampling for Multi-Armed Bandit + +**Status:** Accepted +**Date:** 2026-04-03 + +## Context + +The router needs to learn which arm (provider+model) performs best per task type. Options: epsilon-greedy, UCB, LinUCB, Thompson Sampling. + +## Decision + +Discounted Thompson Sampling with per-arm, per-task-type Beta(α, β) distributions. No ML framework dependency — Beta distribution sampling via Marsaglia-Tsang Gamma (~30 lines of Go). + +## Alternatives Considered + +### Alternative A: LinUCB (contextual bandit) + +- **Pros:** Uses full task feature vector, theoretically optimal +- **Cons:** Matrix inversion per decision, complex implementation, marginal gain at v1 scale + +### Alternative B: Epsilon-greedy + +- **Pros:** Simplest to implement +- **Cons:** Fixed exploration rate, doesn't adapt, wastes budget on known-bad arms + +## Consequences + +**Positive:** Natural exploration via sampling. Handles non-stationarity with discounting. No external deps. Fast (<1ms per decision). +**Negative:** Per-task-type, not contextual — can't generalize across task clusters. Contextual bandit (v2) planned as future upgrade. + +--- + +# ADR-010: MCP Tool Replaceability via Priority Registry + +**Status:** Accepted +**Date:** 2026-04-03 + +## Context + +MCP servers provide tools. Some users want MCP tools to replace gnoma's built-in tools (e.g., a custom file system tool). Need a mechanism for this. + +## Decision + +Tool registry has a priority system. MCP servers can declare `replace_default = "fs"` in config to replace all `fs.*` built-in tools. Resolution: MCP override > built-in. + +## Consequences + +**Positive:** Users can swap any built-in tool via config. No code changes needed. +**Negative:** MCP tool must implement the same contract (same parameter schema). Mismatch → runtime errors. + +--- + +# ADR-011: Task Learning as Late-Stage Feature (M11) + +**Status:** Accepted +**Date:** 2026-04-03 + +## Context + +Task learning (detecting recurring patterns, suggesting persistent tasks) could be built early or late. + +## Decision + +M11 — after router advanced (M9) and persistence (M10). Task learning needs: (1) router feedback signals to understand quality, (2) session persistence to observe patterns across sessions, (3) enough real usage to detect meaningful repetitions. + +## Consequences + +**Positive:** Built on solid foundations. Feedback signals are real, not synthetic. +**Negative:** Users don't benefit from task learning until late in the roadmap. + +--- + +# ADR-012: Incognito Mode as Core Security Feature + +**Status:** Accepted +**Date:** 2026-04-03 + +## Context + +Users working with sensitive code need a way to prevent any data from being persisted, logged, or fed back to the learning system. + +## Decision + +Incognito mode is part of the security firewall (M3). When active: no session persistence, no router learning, no logging of content, optional local-only routing. Activated via `--incognito` flag or TUI toggle. Visual indicator in status bar. + +## Consequences + +**Positive:** Strong privacy guarantee. Users can work on sensitive projects without worrying about data leakage to disk or learning systems. +**Negative:** No learning improvement from incognito sessions. Router stays static. + ## Changelog -- 2026-04-02: Initial decisions from architecture planning session +- 2026-04-02: Initial decisions (ADR-001 through ADR-006) +- 2026-04-03: Added ADR-007 through ADR-012 (security, router split, Thompson Sampling, MCP replaceability, task learning, incognito) diff --git a/docs/essentials/domain-model.md b/docs/essentials/domain-model.md index 590e8b6..bded259 100644 --- a/docs/essentials/domain-model.md +++ b/docs/essentials/domain-model.md @@ -79,15 +79,47 @@ classDiagram +Wait() ElfResult } + class Router { + +Select(task) RoutingDecision + +ClassifyTask(history) Task + } + + class Arm { + +ID: ArmID + +Provider: Provider + +ModelName: string + +IsLocal: bool + +Pools: []LimitPool + } + + class LimitPool { + +ID: string + +Kind: PoolKind + +TotalLimit: float64 + +Used: float64 + +Reserved: float64 + +ScarcityMultiplier() float64 + } + + class Firewall { + +ScanOutgoing(req) req + +ScanToolResult(result) result + +Incognito: IncognitoMode + } + Session "1" --> "1" Engine : owns - Engine "1" --> "1" Provider : uses + Engine "1" --> "1" Router : routes through + Engine "1" --> "1" Firewall : scans through + Router "1" --> "*" Arm : selects from + Arm "1" --> "1" Provider : wraps + Arm "1" --> "*" LimitPool : draws from Engine "1" --> "*" Tool : executes Engine "1" --> "*" Message : history Engine "1" --> "*" Turn : produces Message "1" --> "*" Content : contains Provider "1" --> "*" Stream : creates Stream "1" --> "*" Event : yields - Session "1" --> "*" Elf : spawns (future) + Session "1" --> "*" Elf : spawns Elf "1" --> "1" Engine : owns ``` @@ -98,7 +130,12 @@ classDiagram | gnoma | The host application — single binary, agentic coding assistant | `gnoma "list files"` | | Elf | A sub-agent (goroutine) with its own engine, history, and provider. Named after the elf owl. | Background elf exploring `auth/` on Ollama | | Session | A conversation boundary between UI and engine. Owns one engine, communicates via channels. | TUI session, CLI pipe session | -| Engine | The agentic loop orchestrator. Manages history, streams from provider, executes tools, loops until done. | Engine running on Mistral with 5 tools | +| Engine | The agentic loop orchestrator. Routes through firewall and router, executes tools, loops until done. | Engine running via router with 5 tools | +| Router | The smart routing layer. Classifies tasks, selects arms based on quality/cost/scarcity, learns from feedback. | Router picks local Qwen for boilerplate, Claude for security review | +| Arm | A provider+model pair registered in the router. Has capability metadata, pool memberships, and performance stats. | `ollama/mistral-7b`, `anthropic/claude-opus-4` | +| LimitPool | A shared resource budget that arms draw from. Tracks usage with optimistic reservation and scarcity multipliers. | Daily cost cap of 5 EUR shared across API providers | +| Firewall | Security layer that scans outgoing requests and tool results for sensitive data. Manages incognito mode. | Redacts `sk-ant-...` from prompts before sending to API | +| Incognito | Mode where no data is persisted, logged, or fed back to the router. Optional local-only routing. | User toggles incognito for sensitive work | | Provider | An LLM backend adapter. Translates gnoma types to/from SDK-specific types. | Anthropic provider, OpenAI-compat provider | | Stream | Pull-based iterator over streaming events from a provider. Unified interface across all SDKs. | `for s.Next() { e := s.Current() }` | | Event | A single streaming delta — text chunk, tool call fragment, thinking trace, or usage update. | `EventTextDelta{Text: "hello"}` | @@ -108,9 +145,11 @@ classDiagram | ToolResult | The output of executing a tool, correlated to a ToolCall by ID. | `{ToolCallID: "tc_1", Content: "file1.go\nfile2.go"}` | | Turn | The result of a complete agentic loop — may span multiple API calls and tool executions. | Turn with 3 rounds: stream → tool → stream → tool → stream → done | | Accumulator | Assembles a complete Response from a sequence of streaming Events. Shared across all providers. | Text fragments → complete assistant message | +| TaskType | Classification of a task for routing purposes. 10 types from boilerplate to security review. | `TaskGeneration`, `TaskRefactor`, `TaskSecurityReview` | | Callback | Function the engine calls for each streaming event, enabling real-time UI updates. | `func(evt stream.Event) { ch <- evt }` | | Round | A single API call within a Turn. A turn with 2 tool-use loops has 3 rounds. | Round 1: initial query. Round 2: after tool results. | | Routing | Directing tasks to different providers based on capability, cost, or latency rules. | Complex reasoning → Claude, quick lookups → local Qwen | +| PersistentTask | A user-confirmed recurring task pattern saved for re-execution. | `/task release v1.2.0` runs the saved release workflow | ## Invariants diff --git a/docs/essentials/milestones.md b/docs/essentials/milestones.md index 2483154..936425b 100644 --- a/docs/essentials/milestones.md +++ b/docs/essentials/milestones.md @@ -1,24 +1,46 @@ --- essential: milestones status: complete -last_updated: 2026-04-02 +last_updated: 2026-04-03 project: gnoma depends_on: [vision] --- # Milestones +## Overview + +| # | Name | Core Deliverable | Deps | +|---|------|-----------------|------| +| M1 | Core Engine | Pipe mode, Mistral, tools, agentic loop | — | +| M2 | Multi-Provider | All providers, config, dynamic switching | M1 | +| M3 | Security Firewall | Request/response scanning, redaction, incognito | M2 | +| M4 | Router Foundation | Arm registry, pools, task classifier, heuristic selection | M2 | +| M5 | TUI | Bubble Tea, 6 permission modes, config screen | M3, M4 | +| M6 | Context Intelligence | Local tokenizer, full compaction (truncate + summarize) | M5 | +| M7 | Elfs | Router-integrated sub-agents, parallel work | M4, M6 | +| M8 | Extensibility | Hooks, skills, MCP client, MCP tool replaceability, plugins | M7 | +| M9 | Router Advanced | Bandit core, feedback, ensemble strategies, state persistence | M7 | +| M10 | Persistence & Serve | SQLite sessions, serve mode, coordinator | M7 | +| M11 | Task Learning | Pattern recognition, task suggestions, persistent tasks | M9 | +| M12 | Thinking & Structured Output | Thinking modes, schema validation | M2 | +| M13 | Auth | OAuth PKCE, keyring, multi-account | M5 | +| M14 | Observability | Feature flags, telemetry, cost dashboards | M10 | +| M15 | Web UI | `gnoma web` CLI flag, browser UI via serve mode | M10 | + +--- + ## M1: Core Engine (MVP) -**Scope:** First working assistant. CLI pipe mode. Mistral as reference provider. Bash + file tools. No TUI, no permissions, no config file. +**Scope:** First working assistant. CLI pipe mode. Mistral as reference provider. Bash + file tools (with 7 critical security checks). No TUI, no permissions, no config file. **Deliverables:** -- [ ] Architecture docs in `docs/essentials/` +- [x] Architecture docs in `docs/essentials/` - [ ] Foundation types (`internal/message/`) - [ ] Streaming abstraction (`internal/stream/`) - [ ] Provider interface + Mistral adapter -- [ ] Tool system: bash, fs.read, fs.write, fs.edit, fs.glob, fs.grep +- [ ] Tool system: bash (with security checks), fs.read, fs.write, fs.edit, fs.glob, fs.grep - [ ] Engine agentic loop (stream → tool → re-query → done) - [ ] CLI pipe mode (`echo "list files" | gnoma`) @@ -26,152 +48,221 @@ depends_on: [vision] ## M2: Multi-Provider -**Scope:** All remaining providers. Config file. Dynamic provider switching. +**Scope:** All remaining providers. TOML config with layered loading. Dynamic provider switching. **Deliverables:** +- [ ] TOML config system (defaults → user → project → env → flags) +- [ ] API key resolution from env vars and config - [ ] Anthropic provider (streaming + tool use + thinking blocks) - [ ] OpenAI provider (streaming + tool use) -- [ ] Google provider (streaming + function calling) +- [ ] Google provider (streaming + function calling, goroutine bridge) - [ ] OpenAI-compat for Ollama and llama.cpp -- [ ] TOML config (global + project + env + flags) -- [ ] `/model provider/model` switching mid-session +- [ ] `--provider` / `--model` flag switching -**Exit criteria:** Chat with any configured provider via CLI pipe. Switch providers mid-session. +**Exit criteria:** `echo "hello" | gnoma --provider openai` works. All 5+ providers functional. -## M3: TUI +## M3: Security Firewall -**Scope:** Interactive terminal UI. Permission system. +**Scope:** Core security layer built into gnoma. Scans outgoing LLM requests and incoming tool results for sensitive data. Redacts or blocks. Incognito mode. **Deliverables:** -- [ ] Bubble Tea TUI: chat panel, input box, streaming output -- [ ] Status bar (provider, model, token usage) -- [ ] Permission system (allow / deny / prompt modes) -- [ ] Permission dialog overlay +- [ ] Secret scanner (gitleaks-derived, 40+ regex patterns, Shannon entropy detection) +- [ ] Unicode sanitization (NFKC + Cf/Co/Cn stripping, recursive on nested structs) +- [ ] Redactor (replace matched groups with `[REDACTED]`, preserve context) +- [ ] Configurable rules (regex patterns, action: redact/block/warn) +- [ ] Remaining bash security checks (checks 8-23 from CC bashSecurity.ts) +- [ ] Incognito mode: no persistence, no learning, no logging, optional local-only routing +- [ ] `--incognito` CLI flag + +**Exit criteria:** Provider requests with embedded API keys get redacted. Incognito suppresses all persistence. Unicode attack vectors sanitized. + +## M4: Router Foundation + +**Scope:** Arm registry, limit pools, task classification, heuristic selection. Engine switches from direct provider calls to `router.Select()`. + +**Deliverables:** + +- [ ] Arm type (provider+model pair) with capability introspection +- [ ] Limit pools (RPM, RPD, tokens/day, cost caps, custom units) +- [ ] Pool tracker with optimistic reservation and scarcity multipliers +- [ ] Task classifier (10 types: Boilerplate, Generation, Refactor, Review, UnitTest, Planning, Orchestration, SecurityReview, Debug, Explain) +- [ ] Complexity scoring and value scoring +- [ ] Heuristic arm selection (score = quality × value / effective_cost) +- [ ] Background provider discovery (poll ollama, llama.cpp, API providers) +- [ ] Engine integration: `router.Select()` replaces direct provider calls + +**Exit criteria:** Engine routes tasks through router. Limit pools track consumption. Task classification works for 10 types. + +## M5: TUI + +**Scope:** Interactive terminal UI. Full 6-mode permission system. Session management. In-app config. Incognito toggle. + +**Deliverables:** + +- [ ] Permission system with all 6 modes: + - `default` — prompt for each tool invocation + - `acceptEdits` — auto-allow file ops, prompt for bash/destructive + - `bypass` — allow everything + - `deny` — deny all unless explicit allow rule + - `plan` — read-only tools only + - `auto` — router task classification + tool risk scoring +- [ ] Permission rules with compound bash command decomposition (via `mvdan.cc/sh` AST) +- [ ] 7-step permission decision flow (deny gates → tool check → safety → mode → allow → passthrough → hooks) +- [ ] Bubble Tea TUI: chat panel, input, streaming output +- [ ] Status bar (provider, model, tokens, incognito indicator) +- [ ] Permission prompt overlay - [ ] Model picker overlay -- [ ] Input history (up/down) +- [ ] In-app config editor (`/config` command) +- [ ] Incognito toggle (`/incognito` command) +- [ ] Session management (channel-based) -**Exit criteria:** Launch TUI, chat interactively, tools execute with permission prompts. +**Exit criteria:** Launch TUI, chat interactively, 6 permission modes work, config editable in-app, incognito toggleable. -## M4: Context Intelligence +## M6: Context Intelligence -**Scope:** Long sessions. Token tracking. Compaction. Local tokenizer. +**Scope:** Long sessions. Local tokenizer. Full compaction with both truncation and LLM summarization. **Deliverables:** -- [ ] Local tokenizer for accurate token counting without provider round-trips -- [ ] Token tracker (cumulative usage, OK/warning/critical states) -- [ ] Truncate compaction (drop old messages, keep system + recent) -- [ ] Summarize compaction (LLM summarizes dropped messages) -- [ ] Compact boundaries (transaction markers for crash recovery) -- [ ] Deferred tool loading (non-essential tools loaded on demand) -- [ ] Result persistence (large tool outputs written to disk) +- [ ] Local tokenizer for accurate token counting +- [ ] Token tracker with warning states (OK / Warning / Critical) +- [ ] TruncateStrategy: drop oldest, preserve system + recent +- [ ] SummarizeStrategy: spawn compaction elf, LLM-powered summary, image stripping, boundary messages +- [ ] Auto-compaction triggers (threshold-based, reactive on 413, circuit breaker after 3 failures) +- [ ] Pre/post compact hooks +- [ ] Tool result persistence (>50KB → disk, 2KB preview + filepath) +- [ ] Deferred tool loading (`ShouldDefer()`, full schema on demand) +- [ ] Post-compact restoration budget (50K total, 5K/file, 25K/skill) -**Exit criteria:** 100+ turn conversation stays coherent within token budget. Local token counting matches provider reports within 5%. +**Exit criteria:** 100+ turn conversation stays coherent. Summarization produces useful summaries. Token counting within 5% of provider. -## M5: Elfs (Multi-Agent + Multi-Provider Routing) +## M7: Elfs (Router-Integrated) -**Scope:** Sub-agents on different providers. Parallel work. Provider routing. +**Scope:** Sub-agents using router for provider selection. Parallel work. Feedback to router. **Deliverables:** -- [ ] Elf spawning (`Engine.SpawnElf` with per-elf provider config) -- [ ] Background elfs (independent goroutine + engine) +- [ ] Elf interface + SyncElf + BackgroundElf implementations +- [ ] ElfManager: spawn, monitor, cancel, collect results +- [ ] Router-integrated spawning (`router.Select()` picks arm per elf) - [ ] Parent ↔ elf communication via typed channels -- [ ] Concurrent tool execution (read-only parallel, writes sequential) -- [ ] Provider routing rules (route by capability, cost, latency) — research needed -- [ ] Coordinator dispatches tasks to elfs on different providers +- [ ] Concurrent tool execution (read-only parallel via errgroup, writes serial) +- [ ] Elf results feed back to router as quality signals +- [ ] Coordinator mode: orchestrator dispatches to worker elfs -**Exit criteria:** Coordinator on Claude spawns research elf on local Qwen + review elf on OpenAI, collects and synthesizes results. +**Exit criteria:** Parent spawns 3 background elfs on different providers (chosen by router), collects and synthesizes results. -## M6: Extensibility +## M8: Extensibility -**Scope:** Hooks, skills, MCP, plugin foundation. +**Scope:** Hooks, skills, MCP client with tool replaceability, plugin system. **Deliverables:** -- [ ] Hook system (PreToolUse / PostToolUse, stdin/stdout protocol) -- [ ] Skill loading (`.gnoma/skills/*.md` with frontmatter) -- [ ] MCP client (JSON-RPC over stdio, tool discovery) -- [ ] Plugin foundation (manifest, install, lifecycle) +- [ ] Hook system: PreToolUse, PostToolUse, SessionStart/End, PreCompact, Stop +- [ ] Hook protocol: stdin JSON, stdout JSON, exit codes (0=allow, 2=deny) +- [ ] Hook command types: command (shell), prompt (LLM), agent (spawn elf) +- [ ] Skill loading from .gnoma/skills/, ~/.config/gnoma/skills/, bundled, plugins +- [ ] Skill frontmatter: YAML (name, description, whenToUse, allowedTools, paths) +- [ ] MCP client: JSON-RPC over stdio, tool discovery +- [ ] MCP tool naming: `mcp__{server}__{tool}` +- [ ] MCP tool replaceability: `replace_default` config swaps built-in tools +- [ ] Plugin system: plugin.json manifest, install/enable/disable lifecycle -**Exit criteria:** MCP server tools appear in gnoma. Skills invocable by model. Hook logs all bash commands. +**Exit criteria:** MCP tools appear in gnoma. `replace_default` swaps built-ins. Skills invocable. Hooks fire on tool use. -## M7: Persistence & Serve +## M9: Router Advanced -**Scope:** Session persistence via SQLite. Serve mode for external clients. Coordinator mode. +**Scope:** Full bandit learning. Feedback collection. Ensemble execution strategies. State persistence. **Deliverables:** -- [ ] Session persistence with SQLite (save/restore conversations across restarts) -- [ ] Serve mode (Unix socket listener, external UI clients) -- [ ] Coordinator mode (orchestrator dispatches to worker elfs) +- [ ] Discounted Thompson Sampling (per-arm, per-task-type Beta distributions) +- [ ] Feedback collection: implicit (acceptance, edit distance, escalation) + explicit +- [ ] Delayed attribution for orchestration/planning tasks +- [ ] Execution strategies: SingleArm, CascadeWithReview, ParallelEnsemble, MultiRoundSynthesis +- [ ] Strategy selection as learned routing decision +- [ ] Background arm benchmarking (TTFT, tok/s) +- [ ] State persistence (gob, versioned schema, atomic writes, CRC32) +- [ ] Cold start: shipped default.state with embedded priors +- [ ] Heuristic fallback for <5 observations per arm-task pair -**Exit criteria:** Resume yesterday's conversation. VS Code extension connects via serve mode. Coordinator parallelizes subtasks. +**Exit criteria:** Bandit converges after ~50 observations. Ensemble outperforms single-arm on complex tasks. State persists across restarts. -## M8: Thinking & Structured Output +## M10: Persistence & Serve -**Scope:** Extended thinking support across providers. Schema-validated structured output. +**Scope:** SQLite session persistence. Serve mode. Coordinator mode. + +**Deliverables:** + +- [ ] SQLite session storage (messages, parentUuid chain, tombstones) +- [ ] Session memory: background elf extracts notes from conversation +- [ ] Incognito enforcement: sessions NOT persisted +- [ ] Serve mode: Unix socket listener, spawn session goroutine per client +- [ ] Coordinator mode: orchestrator dispatches to restricted worker elfs + +**Exit criteria:** Resume yesterday's conversation. External client connects via serve mode. + +## M11: Task Learning + +**Scope:** Detect recurring task patterns. Suggest persistent tasks. Refinement loop. + +**Deliverables:** + +- [ ] Pattern detector: observe turn sequences, identify repeats (≥3 times) +- [ ] Task suggestion UX: prompt user to save as persistent task +- [ ] Persistent task definitions: parameterized sequences, stored in .gnoma/tasks/ or ~/.config/gnoma/tasks/ +- [ ] `/task [args]` execution command +- [ ] Router feedback integration: learn which arm works best per task step +- [ ] Task refinement: re-split tasks, measure improvement + +**Exit criteria:** gnoma suggests a persistent task after 3+ repetitions. `/task release v1.2.0` executes a saved workflow. + +## M12: Thinking & Structured Output **Deliverables:** - [ ] Thinking mode (disabled / enabled with budget / adaptive) -- [ ] Thinking block streaming and display in TUI +- [ ] Thinking block streaming and TUI display - [ ] Structured output with JSON schema validation - [ ] Retry logic for schema validation failures -**Exit criteria:** Extended thinking with budget works on Anthropic. Structured output validates against schema on all providers that support it. - -## M9: Auth - -**Scope:** OAuth 2.0 + PKCE for cloud providers. Credential management. +## M13: Auth **Deliverables:** -- [ ] OAuth 2.0 + PKCE flow (browser redirect → callback → token exchange) -- [ ] Token refresh (proactive, before expiry) -- [ ] OS keyring integration for secure credential storage +- [ ] OAuth 2.0 + PKCE flow (browser → callback → token exchange) +- [ ] Proactive token refresh (before expiry) +- [ ] OS keyring integration for credential storage - [ ] Multi-account support per provider -**Exit criteria:** `gnoma login anthropic` opens browser, completes OAuth flow, stores token in keyring. Automatic refresh works. - -## M10: Observability - -**Scope:** Feature flags. Opt-in telemetry and analytics. +## M14: Observability **Deliverables:** -- [ ] Feature flag system (local config + optional remote evaluation) +- [ ] Feature flag system (local config + optional remote) - [ ] Opt-in analytics (event queue, local-only by default) - [ ] Usage dashboards (token spend, provider usage, tool frequency) - [ ] Cost tracking per provider/model -**Exit criteria:** Feature flags gate experimental features. User can view their token spend breakdown. Analytics disabled by default. - -## M11: Web UI - -**Scope:** Browser-based UI as alternative to TUI. Requires serve mode (M7). +## M15: Web UI **Deliverables:** -- [ ] `gnoma web` CLI subcommand (or `gnoma --web`) starts local web server -- [ ] Web UI connects to serve mode backend +- [ ] `gnoma web` CLI subcommand starts local web server +- [ ] Connects to serve mode backend (M10 prerequisite) - [ ] Chat interface with streaming, tool output, permission prompts -- [ ] Responsive design for desktop browsers - -**Exit criteria:** `gnoma web` opens browser, full chat with streaming and tool execution. Serve mode required as prerequisite. ## Future -Ideas not yet committed: - - Voice input/output via provider audio APIs - Collaborative sessions (multiple humans + elfs) - Plugin marketplace - Remote agent execution +- Federated learning for router priors (opt-in, anonymized) ## Changelog -- 2026-04-02: Initial version — M1-M6 -- 2026-04-02: Split M2 into providers (M2) and TUI (M3). Added M8-M11 for thinking, auth, observability, web UI. Local tokenizer in M4. SQLite for session persistence in M7. +- 2026-04-02: Initial version (M1-M11) +- 2026-04-03: Restructured to M1-M15. Split providers/TUI. Added Security (M3), Router Foundation (M4), Router Advanced (M9), Task Learning (M11). Full 6 permission modes. Full compaction. CC pattern integration. diff --git a/docs/essentials/risks.md b/docs/essentials/risks.md index 3d76c7b..392a12b 100644 --- a/docs/essentials/risks.md +++ b/docs/essentials/risks.md @@ -19,16 +19,24 @@ depends_on: [] | R-007 | Multi-provider routing complexity — coordinating elfs on different providers with different capabilities | High | Design routing interface early (M4), start simple (manual provider assignment), add rules incrementally | Open | | R-008 | Context compaction coherence — summarization may lose critical details | Medium | Truncation as safe default, summarization opt-in, compact boundaries for recovery | Open | | R-009 | Permission prompt UX in pipe mode — no TUI for interactive prompts | Low | Default to `allow` or `deny` in pipe mode, require explicit flag | Open | +| R-010 | Router complexity — bandit tuning, cold start problem | High | Ship default.state with embedded priors, heuristic fallback for <5 observations | Open | +| R-011 | Security false positives — blocking legitimate content | Medium | Warn-first mode, user override per-pattern, configurable sensitivity | Open | +| R-012 | Feedback attribution — delayed/noisy signals for orchestration tasks | Medium | Neutral default for missing signals, ensemble contribution rank as strong signal | Open | +| R-013 | Task learning privacy — pattern data persistence | Low | Patterns stored locally only, cleared in incognito mode | Open | +| R-014 | Ensemble synthesis quality — depends heavily on synthesis prompt | Medium | Invest in prompt engineering, A/B test with polisher arm | Open | +| R-015 | Shell parser dependency — `mvdan.cc/sh` for compound command decomposition | Low | Well-maintained Go package, fallback to regex-based decomposition if needed | Open | ## Open Questions - [ ] How should routing rules be expressed in config? Per-task rules, model capability tags, cost-based? — needs research before M5 - [ ] Which local tokenizer library to use? (tiktoken port, sentencepiece, or provider-specific) -- [ ] Serve mode protocol — choose what fits best when implementing M7 -- [x] ~~Should gnoma embed a tokenizer?~~ → Yes, include local tokenizer (M4) -- [x] ~~Session persistence format?~~ → SQLite (M7) +- [ ] Serve mode protocol — choose what fits best when implementing M10 +- [ ] What automated quality evaluation to use for router feedback? (compile check, linter, self-consistency, small local judge model) +- [x] ~~Should gnoma embed a tokenizer?~~ → Yes, include local tokenizer (M6) +- [x] ~~Session persistence format?~~ → SQLite (M10) - [x] ~~Mistral SDK as long-term reference?~~ → Yes for now, revisit after M2 ## Changelog - 2026-04-02: Initial version +- 2026-04-03: Added R-010 through R-015 for router, security, feedback, task learning, shell parser