Files
gnoma/docs/benchmarks/README.md
T
vikingowl 6bb9c33d04 fix(m8): replace_default map, error UX, benchmarks, and launch prep
- Fix replace_default positional bug: []string → map[string]string for
  explicit MCP tool → built-in name mapping
- Improve error messages for missing API keys (3 actionable options) and
  unknown providers (early validation with available list)
- Remove python3 dependency from MCP tests (pure bash grep/sed parsing)
- Add router benchmark scaffold (6 benchmarks in bench_test.go + docs)
- Add .goreleaser.yml for cross-platform binary releases with ldflags
- Add launch-ready README with quickstart, extensibility docs, GIF placeholder
- Add CONTRIBUTING.md and Gitea issue templates (bug report, feature request)
2026-04-12 03:34:58 +02:00

1.6 KiB

Router Benchmarks

Tracking how gnoma's multi-armed bandit router (M4 heuristic, M9 bandit) performs across providers, task types, and cost envelopes.

Methodology

Each benchmark run:

  1. Registers a set of arms (provider/model pairs) with known cost profiles
  2. Generates synthetic tasks across all 10 task types with varying complexity
  3. Runs N routing decisions and records: arm selected, latency, quality score, cost
  4. Reports convergence metrics after simulated quality feedback

Metrics

Metric Description
Selection accuracy % of tasks routed to the optimal arm (vs. oracle with perfect knowledge)
Cost efficiency Total cost relative to always-cheapest and always-best-quality baselines
Convergence speed Observations needed before bandit matches heuristic on quality (M9)
Pool utilization % of rate limit budget consumed before exhaustion
Latency overhead Time spent in Select() excluding provider round-trip

Running

# Go benchmarks (in-process, no real API calls)
go test -bench=. -benchmem ./internal/router/

# Synthetic routing simulation (when available)
go run ./cmd/gnoma-bench/ --arms=5 --tasks=1000 --seed=42

Results

No benchmark results yet. This scaffold will be populated as M9 (Router Advanced) lands.

Planned comparisons

  • Heuristic-only (M4) vs. bandit (M9) after 50, 200, 1000 observations
  • 2-arm (local + cloud) vs. 5-arm (mixed providers) scenarios
  • Cost-capped routing: $5/day budget with mixed task load
  • Quality degradation under rate limit pressure (pool scarcity)