Every additional tool increases prompt cost and reasoning overhead.
Kerf scores your catalog per turn and reduces it to a small per-turn working set, before the call, in under 10ms, with no external cloud dependency in the selection path.
Score, threshold, rationale, outcome. Pipe it to your logger, your dashboard, or your stdout.
Illustrative output. Actual results vary by catalog size and query complexity.
Methodology: adapted τ-bench workflows using custom tool catalogs with production-style agent policies. Each row represents a distinct workflow scenario with routing instrumentation enabled.
| Benchmark | Model | Tools | Avg Selected | Invalid Calls ↓ | Task Success ↑ | Token Reduction ↓ | Latency |
|---|---|---|---|---|---|---|---|
| τ-bench* · airline workflows | gpt-4o | 14 | 2.4 | −41% | +3 pts | −38% | 8ms |
| τ-bench* · retail workflows | claude-sonnet-4 | 21 | 2.8 | −57% | +4 pts | −44% | 9ms |
| τ-bench* · support operations | gpt-4o-mini | 24 | 2.6 | −68% | +6 pts | −59% | 9ms |
| τ-bench* · developer agent | claude-opus-4 | 118 | 4.9 | −61% | +3 pts | −48% | 14ms |
| τ-bench* · ops runbooks | gpt-4o | 312 | 6.7 | −54% | +2 pts | −37% | 21ms |
*Results are representative examples for illustration purposes, not official τ-bench scores. We expect 40–60% token reduction in catalog-heavy workflows. We're looking for design partners to validate these results.
"Where is my order #4821? Also, can you cancel it?"
OpenAI-spec compatible · runs in your VPC · no data leaves your infra.
A selection model scores your full catalog and returns only the highest-scoring tools for the current turn. In-process, single-digit to low-double-digit ms latency. Kerf minimizes irrelevant tool exposure per turn.
Structured reasoning, scores, and an OTLP span per decision. Stream to Datadog, Honeycomb, Grafana — or kerf trace --tail in any shell.
Continuous monitoring of selection confidence across production traffic. When drift crosses your threshold, kerf flags it — designed for fast retraining cycles.
Planned eval suite: τ-bench workflows, custom scenarios, and regression checks. Every change runs against held-out traces and regression scenarios. CI integration planned.
Optional kerf.run() drives the full loop — planning, tool calls, retries, fallback. Or stay surgical and only use the selection call.
Per-tool, per-route policy. Designed to support PII scrubbing, argument validation, allow-lists, and rate limits. Designed to surface failures in the same trace.
If kerf adds more latency than a single token costs, it isn't worth running. Selection runs on a lightweight CPU-bound reranker with no external inference calls.
Bootstrapped from tool schemas and optionally refined with local runtime telemetry — never trained on inference traffic. No external inference calls in the hot path. SOC 2 Type II planned.
Reads the tools[] array you have already written. Returns the same tool schema shape your model provider expects. Compatible with OpenAI-style, Anthropic, Gemini, and MCP tool interfaces.
Every selection emits an OpenTelemetry span with scores, threshold, and reasoning. Ships to Datadog, Honeycomb, Grafana, or stdout.
No human labels, no annotation pipeline. The selector bootstraps from your tool schemas alone.
Fallback behavior is configurable per project and route — never silently drops a tool.
kerf.select() call.Provides a drop-in replacement for custom tool routing. Returns the same tools JSON your model already expects. Bindings for Node, Python, and Go are in progress.
1import { kerf } from "@kerf/sdk"2import OpenAI from "openai"3await kerf.init({ projectId: "proj_x7k" })4const { tools, trace } = await kerf.select(message)5const res = await openai.chat.completions.create({6 model: "gpt-4o",7 messages: message,8 tools, // ← only what this turn needs9})10logger.debug({ trace })
Imports as a library. Runs locally with no additional infrastructure required.
Runs as a sidecar container next to your agent service. Same node, localhost only.
Self-hosted gateway compatible with OpenAI-style APIs. Zero SDK changes.

Kerf is in private beta. We're working with a small number of teams to validate selection quality on real catalogs. If this problem is live for you, we'd like to talk.