Files
gnoma/docs/slm-backends.md
T
vikingowl a14fe8b504 feat(slm): pluggable backends + trivial-prompt routing
The SLM had two intended jobs — classify every prompt and execute the
small ones itself — but in practice three independent gates kept it
out of nearly all real work:

  1. llamafile cold-start blocked pipe-mode runs (always faster than
     the 15 s health check)
  2. ClassifyTask defaulted RequiresTools=true, excluding the SLM arm
     (ToolUse=false) from 9/10 task types
  3. armTier hard-coded CLI agents > local > API, so even when the SLM
     arm was feasible a CLI agent won

Each gate is addressed below. The result is an SLM that actually does
its job — small stuff stays local, complex stuff routes up — gated by
arm capability rather than by accidents of the boot order.

Backend layer (the bigger change)

The original implementation hard-coded llamafile. That's fine if you
have nothing else, but most users with a local model setup already run
Ollama or llama.cpp. The new factory at internal/slm/backend.go picks
between:

  - ollama (any local Ollama daemon)
  - llamacpp (any llama.cpp server)
  - llamafile (gnoma-managed, current behaviour)
  - openaicompat (LM Studio, vLLM, remote API)
  - auto (probes in order, picks first reachable)
  - disabled

[slm].backend in config.toml selects which. Documented in
docs/slm-backends.md with copy-paste presets for each. The factory
probes the underlying model's actual capabilities (Ollama /api/show,
llama.cpp /props) and sets the SLM arm's ToolUse accordingly — so the
arm picks up simple file-read style tasks on tool-capable models and
stays knowledge-only on completion-only models.

Trivial-prompt heuristic (Gate 2)

ClassifyTask now flips RequiresTools=false for short, low-complexity
prompts whose task type doesn't imply existing code (Explain,
Generation, Boilerplate). Tool-needing tokens (read, write, run, test,
file, …) keep RequiresTools=true even when the prompt is brief.

Complexity-aware tier ordering (Gate 3)

armTier takes a Task and returns tier 0 for arms whose MaxComplexity
ceiling fits the task. CLI agents drop to tier 1, local to 2, API to 3.
For trivial tasks the SLM arm wins; for complex tasks the SLM falls
out of the feasible set (MaxComplexity exclusion) and the original
ordering reasserts.

Eager boot with user-facing wait (Gate 1)

Removed the original goroutine-only path. SLM startup now blocks
synchronously inside the factory; for llamafile that means up to
[slm].startup_timeout (default 5 s) of waiting on the first
invocation, with "Starting SLM…" → "SLM ready (backend, model, tools,
boot=N)" / "SLM unavailable: …" messages on stderr. Ollama / llamacpp
backends boot instantly because the daemon is already running.

waitHealthy() now respects the caller's context deadline instead of
its old hardcoded 15 s ceiling.

Classifier reliability

Classifier timeout bumped 2 s → 5 s for thinking-mode models like
Qwen3-distilled Tiny3.5. System prompt includes /no_think directive
for the same family. These help but don't eliminate small-model
JSON-contract failures — see the docs section on picking a model.

Probe + telemetry surfaces

gnoma slm status now prints the configured backend + model + a live
probe result (✓/✗) instead of just the llamafile manifest state.

`gnoma router stats` already (from the previous commit) shows the
classifier-source mix; with this change you can finally see slm /
slm_fallback / heuristic share rise from "always heuristic" to
something reflecting real SLM activity.

Tests

  - 9 new backend-factory tests (httptest-backed Ollama probe, error
    paths, auto-detection, capability flags)
  - Tier-ordering tests cover the new "specialised small arm wins
    trivial task" path
  - Trivial-prompt heuristic tested for both halves (knowledge-only
    flips RequiresTools=false; debug/file/run keeps it true)

Deletes the dead SLMManager field from the TUI Config — it was
declared but never read.
2026-05-19 18:53:32 +02:00

7.2 KiB
Raw Blame History

SLM Backends

The small-language-model (SLM) layer has two jobs:

  • Classify every prompt into a TaskType + complexity score, feeding the router's arm selection.
  • Execute trivial tasks itself — anything with complexity ≤ 0.3 and no tool use — so the heavy provider arms only see real work.

Gnoma supports several backends for the SLM role. Pick the one that matches what you already run; you don't need to install anything new for most setups.

Copy a preset into ~/.config/gnoma/config.toml (or the project-local .gnoma/config.toml) and adjust the model name to one you have available.

Choosing a backend

Backend Cold start External daemon Setup Good for
ollama none (already running) Ollama daemon ollama pull <model> once Most local-model users
llamacpp none (already running) llama-server manual server launch llama.cpp users
llamafile 1530 s on first prompt none — gnoma manages the process gnoma slm setup Zero-dependency single-binary setups
openaicompat none user-managed point at any OpenAI-compatible URL LM Studio, vLLM, remote API, etc.
auto depends on what's reachable depends none Lazy default — gnoma probes and picks
disabled n/a n/a n/a Skip the SLM entirely; classifier stays heuristic

The "ollama" path is the easiest if you're already running a local model — it has no cold-start cost. The "llamafile" path is the most portable (gnoma owns the lifecycle) but pays a one-time boot per gnoma invocation.

Presets

Presets use reecdev/tiny3.5:500m as the default model — a 500 M-parameter Qwen3.5 distillation with tool support, available on Ollama. Pull it once with:

ollama pull reecdev/tiny3.5:500m   # ~1 GB
# or the 1.5 B variant for slightly better quality:
ollama pull reecdev/tiny3.5:1.5b   # ~3 GB

Substitute any small Ollama model you prefer. The probe at startup reads each model's actual capability — tools enables the SLM arm to handle simple file reads; without it, the SLM only handles knowledge-only prompts.

[slm]
enabled = true
backend = "ollama"
model   = "reecdev/tiny3.5:500m"
# base_url defaults to http://localhost:11434

Prereq: ollama pull reecdev/tiny3.5:500m (or any model you'd rather use).

Preset 2 — llama.cpp server

[slm]
enabled = true
backend = "llamacpp"
# base_url defaults to http://localhost:8080
# model defaults to "default" — llama.cpp's server ignores the field

Prereq: a running llama-server (or llama.cpp server) on the configured port. Model is determined by what you launched the server with.

Preset 3 — Llamafile (gnoma-managed)

[slm]
enabled = true
backend  = "llamafile"
# Optional overrides:
# model_url       = "https://huggingface.co/.../TinyLlama-...-llamafile"
# data_dir        = ""        # empty = XDG default (~/.local/share/gnoma/slm)
# startup_timeout = "10s"     # how long to block on first-boot before falling back

Prereq: gnoma slm setup once to download the binary. After that gnoma starts/stops the llamafile process automatically. Expect ~1530 s cold start on the first prompt of each gnoma invocation.

Preset 4 — LM Studio / generic OpenAI-compatible

[slm]
enabled = true
backend = "openaicompat"
base_url = "http://localhost:1234/v1"   # LM Studio's default
model    = "tinyllama-1.1b"

Use this for any OpenAI-compatible endpoint that isn't Ollama or llama.cpp: LM Studio, vLLM, llamaedge, a remote relay, etc.

Preset 5 — Auto (default)

[slm]
enabled = true
backend = "auto"

Gnoma probes in this order on startup:

  1. If you have model_url configured and llamafile is set up → use llamafile.
  2. Ollama at localhost:11434 → use it, picking the smallest model available.
  3. llama.cpp at localhost:8080 → use it.
  4. Llamafile (if it happens to be set up).
  5. Nothing reachable → SLM stays disabled, classifier stays heuristic.

This is what you get if you don't set backend at all.

Preset 6 — Disabled

[slm]
enabled = false

Skips the SLM entirely. The router uses only the keyword-based heuristic classifier; the SLM arm isn't registered. Useful for slow systems or air-gapped setups.

Custom backends

The openaicompat backend IS the escape hatch — point it at any OpenAI-compatible URL and any model name. If you can curl it with a standard chat-completion payload, gnoma can use it as the SLM.

What the SLM actually does

Role Triggered by Effect
Classifier Every prompt Returns task_type (debug / generation / refactor / …), complexity (01), requires_tools (bool). Drives router arm selection.
Arm Tasks with complexity ≤ 0.3 Gnoma routes the task to the SLM directly, including simple tool calls like fs.read foo.go.

Both roles use the same backend + model. The SLM arm is registered with MaxComplexity=0.3 so anything more complex automatically routes to a bigger arm. Trivial work — knowledge questions, short explanations, single file reads — stays local on the small model.

Picking a model

The two roles have different demands:

  • Arm execution is forgiving. The model just has to answer the prompt or emit a single tool call. Tiny3.5-500M handles general chat, trivia, and simple file reads (when tools capability is present). Any small model with tool support works here.
  • Classifier is stricter. The model has to follow a JSON output schema. Models below ~3 B parameters frequently fail this contract — they emit prose, partial JSON, or thinking tokens instead of clean output. The classifier then falls back to the heuristic, which is fine but means the SLM signal isn't contributing. If gnoma router stats shows a high slm_fallback share, the model is missing the JSON contract; bumping to ~3 B parameters (qwen2.5-coder:3b, phi-3-mini, ministral-3:3b) typically resolves this.

You don't have to pick a model that does both well. The common shape is:

  • Use a small (500 M 1.5 B) tool-capable model as the SLM arm — it answers trivial questions and runs simple tool calls without going up to a bigger model.
  • Accept that the classifier role falls back to the keyword heuristic on small models. The heuristic is good enough for routing.
  • Watch gnoma router stats to see the actual mix — the slm/<backend> arm row counts tells you whether the SLM is executing real work, which is what matters most.

Verifying

After picking a preset:

gnoma slm status

Output looks like:

slm enabled: true
slm backend: ollama
  model:   reecdev/tiny3.5:500m

live probe:
  ✓ ollama ready (model=reecdev/tiny3.5:500m, boot=0s)

Run a few prompts, then check:

gnoma router stats

The classifier-source breakdown reveals what's actually being used:

Classifier source breakdown:
  SOURCE        COUNT  SHARE
  slm           18     60.0%
  slm_fallback  4      13.3%
  heuristic     8      26.7%
  total observations: 30

A healthy SLM share (≥50 %) means the classifier is firing reliably. High slm_fallback means the model is failing to return valid JSON — try a larger model.