DanLevy.net

Don't Marry Your Model

LLM Routing, so hot right now

Most engineering teams pick a language model and stick with it. One provider, one model, all tasks. It’s like hiring one person to do your coding, your copywriting, and your taxes because they happened to be good at the first interview.

At any given moment, one model is better at code, another is better with long messy context, and another is the cheapest boring workhorse for classification. The names change. The shape of the problem does not. Treating one model like it excels at everything means you’re either overpaying for simple tasks or getting subpar results on specialized ones.

I watched a team burn through thousands of dollars running sentiment analysis through a $30-per-million-token model when a $0.50 model would’ve done the job just as well. Simple JSON formatting, basic classification tasks, all going through their premium provider. The only thing getting heated was their AWS bill.

There’s a better way, and it’s not particularly complicated.

Delegation Over Devotion

What if you could route requests to the model that’s actually best suited for that specific task? Use your expensive powerhouse for the hard stuff, but drop the simple parsing and formatting down to something cheaper. Get the benefits of multiple providers without having to manually juggle them in your codebase.

Mastra lets you build exactly this kind of system. You set up specialist agents for different types of work, then create a router agent that figures out which specialist should handle each request. The model IDs below are examples, not a leaderboard. Swap them for the current models that win your evals and fit your budget.

Think of it like this: you have three specialists on your team.

./src/mastra/index.ts
import { Mastra } from '@mastra/core';
import { Agent } from '@mastra/core/agent';
import { openai } from '@ai-sdk/openai';
import { anthropic } from '@ai-sdk/anthropic';
import { google } from '@ai-sdk/google';
export const claudeAgent = new Agent({
id: 'claude-agent',
instructions: 'You are an expert engineer. Write bugs? You are fired.',
model: anthropic(process.env.CODE_MODEL ?? 'claude-sonnet-4-5'),
});
export const geminiAgent = new Agent({
id: 'gemini-agent',
instructions: 'You are a creative writer. Be weird.',
model: google(process.env.LONG_CONTEXT_MODEL ?? 'gemini-3-pro-preview'),
});
export const gptAgent = new Agent({
id: 'gpt-agent',
instructions: 'You are a helpful assistant. Be boring.',
model: openai(process.env.GENERAL_MODEL ?? 'gpt-5.2'),
});

Each one has a job. Your code agent should be the model that passes your repo-specific coding evals. Your long-context agent should be the one that survives your actual documents without turning the middle into soup. Your general agent should be cheap, reliable, and boring in the best possible way.

Here’s where it gets interesting. You add a router that acts as an intelligent proxy:

export const routerAgent = new Agent({
id: 'router-agent',
name: 'The Boss',
instructions: `You are an intelligent router.
- Coding -> Claude
- Poetry -> Gemini
- Facts -> GPT
Do not do the work yourself. Delegate.`,
model: openai(process.env.ROUTER_MODEL ?? 'gpt-5-mini'), // Use a cheap model for routing!
agents: {
claudeAgent,
geminiAgent,
gptAgent,
},
});
export const mastra = new Mastra({
agents: { routerAgent, claudeAgent, geminiAgent, gptAgent },
});

The router itself runs on a lightweight model because it’s just making decisions about where to send traffic. You’re not paying premium rates to figure out which other premium model to use. Measure this too; a bad router quietly turns savings into misroutes.

When someone asks for a bubble sort implementation, the router recognizes it as code work and hands it to your code specialist. Creative writing prompt? That goes to the model you’ve chosen for voice and range. Factual question about historical events? Route it to the general agent, ideally with retrieval when freshness or citation matters.

The Practical Benefits

Cost efficiency matters more than you think. A small routing model making delegation decisions costs a fraction of running every single request through your most expensive provider. Over time, especially at scale, this adds up to real money. You only pay for the heavy-duty intelligence when you actually need it.

Quality improves when you match models to tasks. The winner changes by month, task, and prompt shape. That is why the routing layer should depend on your evals, not on whatever model was winning Twitter the week you wrote the integration.

Resilience becomes a side benefit. When OpenAI has one of its periodic outages (and they do), your router can redirect traffic to other providers. You’re not dead in the water waiting for one specific API to come back online.

This isn’t about being clever for the sake of it. It’s about building systems that make sense both financially and technically. You wouldn’t use the same hammer for every construction task, and you probably shouldn’t use the same language model for every AI task either.

The beauty of this approach is that your application code doesn’t change. You still just call your router agent. The complexity of deciding which model to use for which task lives in one place, configured once, rather than scattered throughout your codebase in a bunch of conditional logic.

Resources

Read the Series

  1. LLM Routing (This Post)
  2. Security & Guardrails
  3. MCP & Tool Integrations
  4. Workflows & Memory