feat: Ollama/gemma4 compat — /init flow, stream filter, safety fixes

provider/openai:
- Fix doubled tool call args (argsComplete flag): Ollama sends complete
  args in the first streaming chunk then repeats them as delta, causing
  doubled JSON and 400 errors in elfs
- Handle fs: prefix (gemma4 uses fs:grep instead of fs.grep)
- Add Reasoning field support for Ollama thinking output

cmd/gnoma:
- Early TTY detection so logger is created with correct destination
  before any component gets a reference to it (fixes slog WARN bleed
  into TUI textarea)

permission:
- Exempt spawn_elfs and agent tools from safety scanner: elf prompt
  text may legitimately mention .env/.ssh/credentials patterns and
  should not be blocked

tui/app:
- /init retry chain: no-tool-calls → spawn_elfs nudge → write nudge
  (ask for plain text output) → TUI fallback write from streamBuf
- looksLikeAgentsMD + extractMarkdownDoc: validate and clean fallback
  content before writing (reject refusals, strip narrative preambles)
- Collapse thinking output to 3 lines; ctrl+o to expand (live stream
  and committed messages)
- Stream-level filter for model pseudo-tool-call blocks: suppresses
  <<tool_code>>...</tool_code>> and <<function_call>>...<tool_call|>
  from entering streamBuf across chunk boundaries
- sanitizeAssistantText regex covers both block formats
- Reset streamFilterClose at every turn start
This commit is contained in:
2026-04-05 19:24:51 +02:00
parent 14b88cadcc
commit cb2d63d06f
51 changed files with 2855 additions and 353 deletions

View File

@@ -99,17 +99,19 @@ type QualityThreshold struct {
Target float64 // ideal
}
// DefaultThresholds are calibrated for M4 heuristic scores (range ~00.85).
// M9 will replace these with bandit-derived values once quality data accumulates.
var DefaultThresholds = map[TaskType]QualityThreshold{
TaskBoilerplate: {0.50, 0.70, 0.80},
TaskGeneration: {0.60, 0.75, 0.88},
TaskRefactor: {0.65, 0.78, 0.90},
TaskReview: {0.70, 0.82, 0.92},
TaskUnitTest: {0.60, 0.75, 0.85},
TaskPlanning: {0.75, 0.88, 0.95},
TaskOrchestration: {0.80, 0.90, 0.96},
TaskSecurityReview: {0.88, 0.94, 0.99},
TaskDebug: {0.65, 0.80, 0.90},
TaskExplain: {0.55, 0.72, 0.85},
TaskBoilerplate: {0.40, 0.55, 0.70}, // any capable arm works
TaskGeneration: {0.45, 0.60, 0.75},
TaskRefactor: {0.50, 0.65, 0.78},
TaskReview: {0.55, 0.68, 0.80},
TaskUnitTest: {0.45, 0.60, 0.75},
TaskPlanning: {0.60, 0.72, 0.82},
TaskOrchestration: {0.65, 0.75, 0.83},
TaskSecurityReview: {0.70, 0.78, 0.84}, // requires thinking or large context window
TaskDebug: {0.50, 0.65, 0.78},
TaskExplain: {0.40, 0.55, 0.72},
}
// ClassifyTask infers a TaskType from the user's prompt using keyword heuristics.