Reasoning Models in Production: When Extended Thinking Actually Pays Off
Reasoning models like Claude Opus and o3 are powerful but expensive. The engineering challenge is knowing when to use them, how to configure them, and how to build systems that get the value without paying the cost on every request.
Pranav Saji
Head of AI Security · ML Consultant at LinkedIn
Reasoning models have moved from research curiosity to production option in 2026. Claude Opus, o3, Gemini thinking variants - most organizations with serious AI deployments now have access to models that spend tokens thinking through a problem before responding. The question isn't whether to use them; it's when.
The answer isn't "always" and it isn't "never." It's a routing problem, and most teams are solving it poorly - either defaulting to reasoning models for everything (expensive, slow) or avoiding them because of cost (leaving capability on the table).
Here's how to think about this.
What Reasoning Models Actually Do Differently
Standard LLMs generate tokens left to right with no mechanism for backtracking or explicit intermediate reasoning. Reasoning models add a structured "thinking" phase - a block of generated tokens that the model uses to work through a problem before producing its final response.
This thinking block is not polished prose. It's closer to working memory - the model trying approaches, catching contradictions, exploring alternative framings, self-correcting. The final response quality improves because the model can identify and discard wrong paths before committing to an answer.
The practical benefit shows up in specific categories of task:
- Multi-step logical reasoning where each step depends on prior steps and errors compound
- Mathematical and quantitative problems requiring sequential calculation
- Code debugging and architecture decisions where the correct answer requires ruling out multiple plausible alternatives
- Complex planning tasks with many interacting constraints
- Security analysis where thorough adversarial thinking improves coverage
Standard tasks - summarization, extraction, classification, question answering from provided context - show minimal improvement and don't justify the cost.
The Cost Reality
Reasoning model costs are materially higher than standard model costs. The thinking tokens cost the same as output tokens, and a reasoning model might generate thousands of thinking tokens per request. A task that costs $0.002 on a standard model might cost $0.08-0.20 on a reasoning model, depending on thinking budget.
At low volumes, this doesn't matter. At production scale - tens of thousands of requests per day - it becomes the dominant line item in your inference budget. The teams successfully using reasoning models in production have solved this with routing.
Routing: The Core Engineering Problem
The design pattern that works: a lightweight classifier determines whether a given request warrants a reasoning model, then routes accordingly.
The classifier can be:
A small, fast model - a Haiku or similar low-latency model that takes the incoming request and predicts whether it's "reasoning-worthy." This adds a small latency overhead but saves orders of magnitude in cost on the requests that don't need it.
Rule-based routing - if you have well-defined task categories, explicit rules often outperform a classifier. Requests tagged as "code review," "architectural decision," or "security analysis" go to the reasoning model. Requests tagged as "summarize," "extract," or "translate" don't.
Confidence-based escalation - route to a standard model first. If the model's response falls below a confidence threshold (measured by perplexity, self-reported uncertainty, or output validation failure), escalate to a reasoning model. This works well when low-confidence cases are rare.
User-or-context-triggered - let the application layer signal when reasoning is needed. A legal document review tool might always use reasoning; a customer FAQ bot never does.
In practice, hybrid approaches work best: explicit rules for known high-value tasks, a classifier for ambiguous cases, confidence-based escalation as a fallback.
Configuring Extended Thinking
When you do route to a reasoning model, the thinking budget matters. More thinking tokens generally improve output quality up to a point, then plateau. Calibrating the budget for your task type is worth the investment.
In the Anthropic API:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=16000,
thinking={
"type": "enabled",
"budget_tokens": 8000 # adjust based on task complexity
},
messages=[{
"role": "user",
"content": "Analyze this architecture for security vulnerabilities and propose mitigations..."
}]
)
# Thinking content is separate from the final response
for block in response.content:
if block.type == "thinking":
thinking_text = block.thinking # the chain of thought
elif block.type == "text":
final_answer = block.text
The thinking block is available in the response and is worth logging - it's a gold mine for debugging incorrect outputs and understanding why the model reached a particular conclusion.
Calibration guidelines:
- Simple reasoning tasks: 2,000-4,000 thinking tokens
- Complex multi-step reasoning: 8,000-16,000 thinking tokens
- Maximum complexity (adversarial security analysis, complex code architecture): 16,000-32,000 thinking tokens
Beyond 32,000 thinking tokens, most tasks show diminishing returns. The exception is extremely complex mathematical proofs or multi-step planning problems - test empirically for your specific use case.
What to Build Around Reasoning Models
Reasoning models change what's tractable to build, not just how good your existing products are.
The category of tasks that become viable with reasoning models but weren't viable before: complex single-shot decision making. Tasks that previously required multi-step pipelines with multiple model calls, intermediate validation, and human review can sometimes be collapsed into a single well-prompted reasoning model call.
A security audit that previously required: initial scan, categorization model, prioritization model, human review, report generation model - might be reducible to a single reasoning model call with a comprehensive system prompt. The reasoning process substitutes for some of the pipeline complexity.
This isn't always the right architectural choice - pipelines offer explainability, checkpointing, and per-step validation that a single model call doesn't. But for organizations that need fewer moving parts, reasoning models make collapsing pipeline steps viable.
The Observability Layer
Reasoning models generate more signal than standard models - specifically, the thinking trace. This trace is underutilized by most teams.
What you can extract from thinking traces:
- Uncertainty signals: the model hedging, considering multiple possibilities, or expressing low confidence in its reasoning
- Assumption identification: the model explicitly stating what it's assuming, which you can validate against ground truth
- Failure mode identification: when outputs are wrong, the thinking trace usually shows where the reasoning went off track
- Prompt improvement signals: recurring patterns in the thinking trace that suggest the prompt is underspecified
Logging and analyzing thinking traces - even sampled at 5-10% of production traffic - dramatically improves your ability to improve model outputs over time. Don't discard them.
The Right Mental Model
Reasoning models are not universally better models. They're models optimized for a different point on the quality/cost/latency tradeoff curve - trading speed and cost for quality on hard problems.
The teams using them well have two things in place: a routing layer that ensures reasoning models handle the requests where quality improvement is worth the cost, and an observability layer that uses the thinking trace to continuously improve their prompts and routing decisions.
The teams using them poorly are either running every request through a reasoning model (burning budget on simple tasks) or avoiding them entirely (leaving quality on the table for the complex tasks where they'd help).
The routing infrastructure is the product, not the model choice.
Want to talk about AI security or engineering?
I'm always open to conversations about AI strategy, security architecture, and enterprise AI deployments.
Get in touch