feat: Ollama/gemma4 compat — /init flow, stream filter, safety fixes

provider/openai: - Fix doubled tool call args (argsComplete flag): Ollama sends complete args in the first streaming chunk then repeats them as delta, causing doubled JSON and 400 errors in elfs - Handle fs: prefix (gemma4 uses fs:grep instead of fs.grep) - Add Reasoning field support for Ollama thinking output cmd/gnoma: - Early TTY detection so logger is created with correct destination before any component gets a reference to it (fixes slog WARN bleed into TUI textarea) permission: - Exempt spawn_elfs and agent tools from safety scanner: elf prompt text may legitimately mention .env/.ssh/credentials patterns and should not be blocked tui/app: - /init retry chain: no-tool-calls → spawn_elfs nudge → write nudge (ask for plain text output) → TUI fallback write from streamBuf - looksLikeAgentsMD + extractMarkdownDoc: validate and clean fallback content before writing (reject refusals, strip narrative preambles) - Collapse thinking output to 3 lines; ctrl+o to expand (live stream and committed messages) - Stream-level filter for model pseudo-tool-call blocks: suppresses <<tool_code>>...</tool_code>> and <<function_call>>...<tool_call|> from entering streamBuf across chunk boundaries - sanitizeAssistantText regex covers both block formats - Reset streamFilterClose at every turn start
2026-04-05 19:24:51 +02:00
parent 14b88cadcc
commit cb2d63d06f
51 changed files with 2855 additions and 353 deletions
--- a/internal/router/task.go
+++ b/internal/router/task.go
@@ -99,17 +99,19 @@ type QualityThreshold struct {
 	Target     float64 // ideal
 }

+// DefaultThresholds are calibrated for M4 heuristic scores (range ~0–0.85).
+// M9 will replace these with bandit-derived values once quality data accumulates.
 var DefaultThresholds = map[TaskType]QualityThreshold{
-	TaskBoilerplate:    {0.50, 0.70, 0.80},
-	TaskGeneration:     {0.60, 0.75, 0.88},
-	TaskRefactor:       {0.65, 0.78, 0.90},
-	TaskReview:         {0.70, 0.82, 0.92},
-	TaskUnitTest:       {0.60, 0.75, 0.85},
-	TaskPlanning:       {0.75, 0.88, 0.95},
-	TaskOrchestration:  {0.80, 0.90, 0.96},
-	TaskSecurityReview: {0.88, 0.94, 0.99},
-	TaskDebug:          {0.65, 0.80, 0.90},
-	TaskExplain:        {0.55, 0.72, 0.85},
+	TaskBoilerplate:    {0.40, 0.55, 0.70}, // any capable arm works
+	TaskGeneration:     {0.45, 0.60, 0.75},
+	TaskRefactor:       {0.50, 0.65, 0.78},
+	TaskReview:         {0.55, 0.68, 0.80},
+	TaskUnitTest:       {0.45, 0.60, 0.75},
+	TaskPlanning:       {0.60, 0.72, 0.82},
+	TaskOrchestration:  {0.65, 0.75, 0.83},
+	TaskSecurityReview: {0.70, 0.78, 0.84}, // requires thinking or large context window
+	TaskDebug:          {0.50, 0.65, 0.78},
+	TaskExplain:        {0.40, 0.55, 0.72},
 }

 // ClassifyTask infers a TaskType from the user's prompt using keyword heuristics.