The SLM had two intended jobs — classify every prompt and execute the
small ones itself — but in practice three independent gates kept it
out of nearly all real work:
1. llamafile cold-start blocked pipe-mode runs (always faster than
the 15 s health check)
2. ClassifyTask defaulted RequiresTools=true, excluding the SLM arm
(ToolUse=false) from 9/10 task types
3. armTier hard-coded CLI agents > local > API, so even when the SLM
arm was feasible a CLI agent won
Each gate is addressed below. The result is an SLM that actually does
its job — small stuff stays local, complex stuff routes up — gated by
arm capability rather than by accidents of the boot order.
Backend layer (the bigger change)
The original implementation hard-coded llamafile. That's fine if you
have nothing else, but most users with a local model setup already run
Ollama or llama.cpp. The new factory at internal/slm/backend.go picks
between:
- ollama (any local Ollama daemon)
- llamacpp (any llama.cpp server)
- llamafile (gnoma-managed, current behaviour)
- openaicompat (LM Studio, vLLM, remote API)
- auto (probes in order, picks first reachable)
- disabled
[slm].backend in config.toml selects which. Documented in
docs/slm-backends.md with copy-paste presets for each. The factory
probes the underlying model's actual capabilities (Ollama /api/show,
llama.cpp /props) and sets the SLM arm's ToolUse accordingly — so the
arm picks up simple file-read style tasks on tool-capable models and
stays knowledge-only on completion-only models.
Trivial-prompt heuristic (Gate 2)
ClassifyTask now flips RequiresTools=false for short, low-complexity
prompts whose task type doesn't imply existing code (Explain,
Generation, Boilerplate). Tool-needing tokens (read, write, run, test,
file, …) keep RequiresTools=true even when the prompt is brief.
Complexity-aware tier ordering (Gate 3)
armTier takes a Task and returns tier 0 for arms whose MaxComplexity
ceiling fits the task. CLI agents drop to tier 1, local to 2, API to 3.
For trivial tasks the SLM arm wins; for complex tasks the SLM falls
out of the feasible set (MaxComplexity exclusion) and the original
ordering reasserts.
Eager boot with user-facing wait (Gate 1)
Removed the original goroutine-only path. SLM startup now blocks
synchronously inside the factory; for llamafile that means up to
[slm].startup_timeout (default 5 s) of waiting on the first
invocation, with "Starting SLM…" → "SLM ready (backend, model, tools,
boot=N)" / "SLM unavailable: …" messages on stderr. Ollama / llamacpp
backends boot instantly because the daemon is already running.
waitHealthy() now respects the caller's context deadline instead of
its old hardcoded 15 s ceiling.
Classifier reliability
Classifier timeout bumped 2 s → 5 s for thinking-mode models like
Qwen3-distilled Tiny3.5. System prompt includes /no_think directive
for the same family. These help but don't eliminate small-model
JSON-contract failures — see the docs section on picking a model.
Probe + telemetry surfaces
gnoma slm status now prints the configured backend + model + a live
probe result (✓/✗) instead of just the llamafile manifest state.
`gnoma router stats` already (from the previous commit) shows the
classifier-source mix; with this change you can finally see slm /
slm_fallback / heuristic share rise from "always heuristic" to
something reflecting real SLM activity.
Tests
- 9 new backend-factory tests (httptest-backed Ollama probe, error
paths, auto-detection, capability flags)
- Tier-ordering tests cover the new "specialised small arm wins
trivial task" path
- Trivial-prompt heuristic tested for both halves (knowledge-only
flips RequiresTools=false; debug/file/run keeps it true)
Deletes the dead SLMManager field from the TUI Config — it was
declared but never read.
7.2 KiB
SLM Backends
The small-language-model (SLM) layer has two jobs:
- Classify every prompt into a
TaskType+ complexity score, feeding the router's arm selection. - Execute trivial tasks itself — anything with complexity ≤ 0.3 and no tool use — so the heavy provider arms only see real work.
Gnoma supports several backends for the SLM role. Pick the one that matches what you already run; you don't need to install anything new for most setups.
Copy a preset into ~/.config/gnoma/config.toml (or the project-local .gnoma/config.toml) and adjust the model name to one you have available.
Choosing a backend
| Backend | Cold start | External daemon | Setup | Good for |
|---|---|---|---|---|
ollama |
none (already running) | Ollama daemon | ollama pull <model> once |
Most local-model users |
llamacpp |
none (already running) | llama-server |
manual server launch | llama.cpp users |
llamafile |
15–30 s on first prompt | none — gnoma manages the process | gnoma slm setup |
Zero-dependency single-binary setups |
openaicompat |
none | user-managed | point at any OpenAI-compatible URL | LM Studio, vLLM, remote API, etc. |
auto |
depends on what's reachable | depends | none | Lazy default — gnoma probes and picks |
disabled |
n/a | n/a | n/a | Skip the SLM entirely; classifier stays heuristic |
The "ollama" path is the easiest if you're already running a local model — it has no cold-start cost. The "llamafile" path is the most portable (gnoma owns the lifecycle) but pays a one-time boot per gnoma invocation.
Presets
Presets use reecdev/tiny3.5:500m as the default model — a 500 M-parameter Qwen3.5 distillation with tool support, available on Ollama. Pull it once with:
ollama pull reecdev/tiny3.5:500m # ~1 GB
# or the 1.5 B variant for slightly better quality:
ollama pull reecdev/tiny3.5:1.5b # ~3 GB
Substitute any small Ollama model you prefer. The probe at startup reads each model's actual capability — tools enables the SLM arm to handle simple file reads; without it, the SLM only handles knowledge-only prompts.
Preset 1 — Ollama (recommended for most users)
[slm]
enabled = true
backend = "ollama"
model = "reecdev/tiny3.5:500m"
# base_url defaults to http://localhost:11434
Prereq: ollama pull reecdev/tiny3.5:500m (or any model you'd rather use).
Preset 2 — llama.cpp server
[slm]
enabled = true
backend = "llamacpp"
# base_url defaults to http://localhost:8080
# model defaults to "default" — llama.cpp's server ignores the field
Prereq: a running llama-server (or llama.cpp server) on the configured port. Model is determined by what you launched the server with.
Preset 3 — Llamafile (gnoma-managed)
[slm]
enabled = true
backend = "llamafile"
# Optional overrides:
# model_url = "https://huggingface.co/.../TinyLlama-...-llamafile"
# data_dir = "" # empty = XDG default (~/.local/share/gnoma/slm)
# startup_timeout = "10s" # how long to block on first-boot before falling back
Prereq: gnoma slm setup once to download the binary. After that gnoma starts/stops the llamafile process automatically. Expect ~15–30 s cold start on the first prompt of each gnoma invocation.
Preset 4 — LM Studio / generic OpenAI-compatible
[slm]
enabled = true
backend = "openaicompat"
base_url = "http://localhost:1234/v1" # LM Studio's default
model = "tinyllama-1.1b"
Use this for any OpenAI-compatible endpoint that isn't Ollama or llama.cpp: LM Studio, vLLM, llamaedge, a remote relay, etc.
Preset 5 — Auto (default)
[slm]
enabled = true
backend = "auto"
Gnoma probes in this order on startup:
- If you have
model_urlconfigured and llamafile is set up → use llamafile. - Ollama at
localhost:11434→ use it, picking the smallest model available. - llama.cpp at
localhost:8080→ use it. - Llamafile (if it happens to be set up).
- Nothing reachable → SLM stays disabled, classifier stays heuristic.
This is what you get if you don't set backend at all.
Preset 6 — Disabled
[slm]
enabled = false
Skips the SLM entirely. The router uses only the keyword-based heuristic classifier; the SLM arm isn't registered. Useful for slow systems or air-gapped setups.
Custom backends
The openaicompat backend IS the escape hatch — point it at any OpenAI-compatible URL and any model name. If you can curl it with a standard chat-completion payload, gnoma can use it as the SLM.
What the SLM actually does
| Role | Triggered by | Effect |
|---|---|---|
| Classifier | Every prompt | Returns task_type (debug / generation / refactor / …), complexity (0–1), requires_tools (bool). Drives router arm selection. |
| Arm | Tasks with complexity ≤ 0.3 | Gnoma routes the task to the SLM directly, including simple tool calls like fs.read foo.go. |
Both roles use the same backend + model. The SLM arm is registered with MaxComplexity=0.3 so anything more complex automatically routes to a bigger arm. Trivial work — knowledge questions, short explanations, single file reads — stays local on the small model.
Picking a model
The two roles have different demands:
- Arm execution is forgiving. The model just has to answer the prompt or emit a single tool call. Tiny3.5-500M handles general chat, trivia, and simple file reads (when
toolscapability is present). Any small model with tool support works here. - Classifier is stricter. The model has to follow a JSON output schema. Models below ~3 B parameters frequently fail this contract — they emit prose, partial JSON, or thinking tokens instead of clean output. The classifier then falls back to the heuristic, which is fine but means the SLM signal isn't contributing. If
gnoma router statsshows a highslm_fallbackshare, the model is missing the JSON contract; bumping to ~3 B parameters (qwen2.5-coder:3b,phi-3-mini,ministral-3:3b) typically resolves this.
You don't have to pick a model that does both well. The common shape is:
- Use a small (500 M – 1.5 B) tool-capable model as the SLM arm — it answers trivial questions and runs simple tool calls without going up to a bigger model.
- Accept that the classifier role falls back to the keyword heuristic on small models. The heuristic is good enough for routing.
- Watch
gnoma router statsto see the actual mix — theslm/<backend>arm row counts tells you whether the SLM is executing real work, which is what matters most.
Verifying
After picking a preset:
gnoma slm status
Output looks like:
slm enabled: true
slm backend: ollama
model: reecdev/tiny3.5:500m
live probe:
✓ ollama ready (model=reecdev/tiny3.5:500m, boot=0s)
Run a few prompts, then check:
gnoma router stats
The classifier-source breakdown reveals what's actually being used:
Classifier source breakdown:
SOURCE COUNT SHARE
slm 18 60.0%
slm_fallback 4 13.3%
heuristic 8 26.7%
total observations: 30
A healthy SLM share (≥50 %) means the classifier is firing reliably. High slm_fallback means the model is failing to return valid JSON — try a larger model.