Let's chat

LLM providers compared — choosing the right model for your product

OpenAI, Anthropic, Google Gemini, Mistral, DeepSeek, Llama, and the open-source frontier. A practitioner’s guide to which model fits which job — with real pricing, benchmark data, and the decision framework I use with clients building AI-powered products.

The bottom line

Default to GPT-4.1 for general work and route simple tasks to a budget model like Gemini Flash-Lite or DeepSeek V3. Most production workloads should use two to four models, not one. This routing strategy alone cuts AI costs by 50–70% before you optimize a single prompt.

The 2026 landscape at a glance

The LLM market in 2026 looks nothing like it did eighteen months ago. The “just use GPT-4” era is over. There are now genuinely competitive models across four tiers — premium closed-source, cost-efficient closed-source, open-weight frontier, and open-weight efficient — and the performance gap between tiers has compressed dramatically.

The key market dynamic: enterprise AI spending hit $37 billion in 2025 — up from $11.5 billion in 2024. OpenAI’s enterprise market share eroded from 50 percent to roughly 25 percent. Anthropic gained the most enterprise ground. Open-source models went from “interesting research” to “production-ready alternatives.” Teams that match models to tasks do better than teams that bet everything on one provider.

This guide maps the full provider landscape, compares pricing at every tier, and gives you the decision framework I use when helping clients architect AI-powered products. I have preferences and I will tell you what they are and why.

TierWhat it meansKey playersTypical output cost per 1M tokens
Premium closed-sourceFrontier intelligence, highest costGPT-5.4, Claude Opus 4.6, Gemini 3 Pro, Grok 4$8–$25
Cost-efficient closed-sourceProduction workhorses — 80–90% of frontier quality at a fraction of the costGPT-4.1, Claude Sonnet 4.6, Gemini 2.5 Pro, o4-mini$1.60–$15
Open-weight frontierComparable to closed-source on many tasks, self-hostableDeepSeek V3.2, Qwen 3.5, Mistral Large 3, Kimi K2.5$0.42–$2 (API) or infrastructure cost if self-hosted
Open-weight efficientGood enough for most simple tasks, dramatically cheaperLlama 4 Scout/Maverick, Mistral Medium 3, smaller models$0.20–$2 (API) or self-hosted
Premium closed-source
What it meansFrontier intelligence, highest cost
Key playersGPT-5.4, Claude Opus 4.6, Gemini 3 Pro, Grok 4
Output / 1M$8–$25
Cost-efficient closed-source
What it meansProduction workhorses — 80–90% of frontier quality at a fraction of the cost
Key playersGPT-4.1, Claude Sonnet 4.6, Gemini 2.5 Pro, o4-mini
Output / 1M$1.60–$15
Open-weight frontier
What it meansComparable to closed-source on many tasks, self-hostable
Key playersDeepSeek V3.2, Qwen 3.5, Mistral Large 3, Kimi K2.5
Output / 1M$0.42–$2 (API) or infrastructure cost if self-hosted
Open-weight efficient
What it meansGood enough for most simple tasks, dramatically cheaper
Key playersLlama 4 Scout/Maverick, Mistral Medium 3, smaller models
Output / 1M$0.20–$2 (API) or self-hosted

The boundary between tiers is blurring. Open-weight models now match or exceed last year’s premium closed-source on most benchmarks. The real question is no longer “which model is best” — it is “which model is cheapest for each job in my pipeline.”

The major providers

OpenAI — the ecosystem giant

OpenAI’s advantage is no longer that they have the best model — it is that they have the widest ecosystem. More tutorials, more libraries, more third-party integrations, more production battle-testing than any other provider. If you value community support and ecosystem breadth, OpenAI is the safe default.

The current lineup splits into three families. GPT-5.x is the flagship generation — GPT-5.4 at $2.50/$15.00 per million tokens for input/output, with a Pro tier at $30/$180 for the hardest problems. GPT-4.1 is the production workhorse that replaced GPT-4o — better coding, better instruction following, a million-token context window, and priced at $2.00/$8.00 with a nano variant at $0.10/$0.40 that handles classification and extraction tasks at rock-bottom cost. The o-series reasoning models (o3, o4-mini) trade latency for accuracy on complex problems — o4-mini at $1.10/$4.40 retains roughly 90 percent of o3’s capability at a fraction of the cost.

The batch API gives 50 percent off everything with a 24-hour turnaround. Prompt caching drops repeated context costs by 75 percent. These two features alone make OpenAI dramatically cheaper in production than the headline prices suggest. EU data residency is available for API usage since February 2025, but only for new projects — existing projects cannot be migrated.

ModelInput / 1MOutput / 1MContextBest for
GPT-5.4$2.50$15.00Long contextHardest problems
GPT-4.1$2.00$8.001MProduction workhorse
GPT-4.1 mini$0.40$1.601MBalanced cost/quality
GPT-4.1 nano$0.10$0.401MClassification, routing, extraction
o3$2.00$8.00200KComplex multi-step reasoning
o4-mini$1.10$4.40200KBest-value reasoning
GPT-5.4
Input$2.50
Output$15.00
ContextLong context
Best forHardest problems
GPT-4.1
Input$2.00
Output$8.00
Context1M
Best forProduction workhorse
GPT-4.1 mini
Input$0.40
Output$1.60
Context1M
Best forBalanced cost/quality
GPT-4.1 nano
Input$0.10
Output$0.40
Context1M
Best forClassification, routing, extraction
o3
Input$2.00
Output$8.00
Context200K
Best forComplex multi-step reasoning
o4-mini
Input$1.10
Output$4.40
Context200K
Best forBest-value reasoning
Choose this when

Ecosystem breadth matters most — widest library support, most documentation, strongest parallel function calling. The GPT-4.1 family is the best all-around production lineup.

Anthropic Claude — structured output and reasoning

Claude’s strength is reliability on structured tasks. When you need JSON output that actually validates, multi-step reasoning that shows its work, or code generation that sustains quality across long sessions, Claude is typically the best option. The structured outputs feature — strict JSON schema validation combined with tool use — produces the most consistent machine-readable output of any provider I have tested.

The current lineup: Claude Opus 4.6 at $5/$25 is the most capable model for mission-critical work — 67 percent cheaper than previous Opus pricing. Claude Sonnet 4.6 at $3/$15 is the balanced general-purpose option with a million-token context window. Claude Haiku 4.5 at $1/$5 is the speed-optimized choice for high-volume work.

Claude’s extended thinking mode lets you dynamically control how much reasoning the model applies — near-instant responses for simple tasks, deeper thinking for complex ones, all within the same model. For coding, Opus 4 was “the world’s best coding model” when it launched, and the 4.6 generation sustains that strength. On creative writing with constraints (following word counts, structural requirements), Claude significantly outperforms OpenAI — one independent test showed a 93.9 percent validation rate for Anthropic versus 77.8 percent for OpenAI.

The limitation for European teams: Claude API has EU data residency since August 2025, but Claude.ai and Claude Desktop process everything in the US. If your team uses Claude interactively for work involving personal data, this is a real GDPR friction point.

ModelInput / 1MOutput / 1MContextBest for
Claude Opus 4.6$5.00$25.001MMission-critical, complex reasoning
Claude Sonnet 4.6$3.00$15.001MBalanced quality/price, coding
Claude Haiku 4.5$1.00$5.00200KHigh-volume, speed-critical
Claude Opus 4.6
Input$5.00
Output$25.00
Context1M
Best forMission-critical, complex reasoning
Claude Sonnet 4.6
Input$3.00
Output$15.00
Context1M
Best forBalanced quality/price, coding
Claude Haiku 4.5
Input$1.00
Output$5.00
Context200K
Best forHigh-volume, speed-critical
Choose this when

Structured output reliability is critical — JSON schema validation, multi-step tool use, and code generation are genuinely stronger. Best for AI pipelines that need machine-readable output you can trust.

Google Gemini — the multimodal and long-context leader

Gemini’s differentiator is that it was designed from the ground up as a multimodal model — not a text model with vision bolted on. Text, images, video, and audio are first-class inputs within a single unified architecture. If your product processes documents with layout structure, analyzes video, or needs to understand audio, Gemini handles these natively in ways the competition still approximates.

The other standout: context windows. Gemini 3 Pro handles 2 million tokens. Gemini 2.5 Pro handles 1 million-plus. These are the industry’s largest. For tasks like multi-document summarization, large codebase analysis, or processing entire book-length inputs, Gemini has no practical equivalent.

Price-wise, Gemini is the most generous. Gemini 2.5 Flash at $0.30/$2.50 is competitive with models that cost five to ten times more on many tasks. Flash-Lite at $0.10/$0.40 is among the cheapest capable models available from any provider. The free tier is generous enough that prototyping is essentially free. Google’s TPU infrastructure advantage is real — even OpenAI began leasing Google TPU capacity in mid-2025.

ModelInput / 1MOutput / 1MContextBest for
Gemini 3 ProPremiumPremium2MFrontier multimodal
Gemini 2.5 Pro$1.25$10.001M+Long-context, reasoning
Gemini 2.5 Flash$0.30$2.501MHigh-volume quality tasks
Gemini 2.5 Flash-Lite$0.10$0.401MBudget, fastest in lineup
Gemini 3 Pro
InputPremium
OutputPremium
Context2M
Best forFrontier multimodal
Gemini 2.5 Pro
Input$1.25
Output$10.00
Context1M+
Best forLong-context, reasoning
Gemini 2.5 Flash
Input$0.30
Output$2.50
Context1M
Best forHigh-volume quality tasks
Gemini 2.5 Flash-Lite
Input$0.10
Output$0.40
Context1M
Best forBudget, fastest in lineup
Choose this when

You need native multimodal understanding, massive context windows, or aggressive cost efficiency. The Flash models give you 80% of frontier quality at 10% of frontier price.

xAI Grok — the emerging challenger

Grok is the newest serious entrant. Grok 4 at $3/$15 offers frontier reasoning with native tool use and real-time search integration. The standout is Grok 4.1 Fast at $0.20/$0.50 with a 2 million token context window — potentially the best value proposition for long-context work in the entire market.

The trade-off: smaller ecosystem, less production battle-testing, fewer third-party integrations. Grok is worth evaluating for cost-sensitive long-context workloads, but I would not make it my primary provider for production systems without more track record.

Choose this when

You need long-context processing at rock-bottom prices. Grok 4.1 Fast’s 2M context at $0.20/$0.50 is hard to beat. Watch this space — but it is newer and less battle-tested.

The open-source frontier

The open-source story in 2026 is no longer “interesting for research, not ready for production.” Multiple open-weight models now match or exceed last year’s closed-source frontier on real-world benchmarks. The question has shifted from “can open-source compete?” to “when does the operational overhead of self-hosting justify the cost savings and control?”

DeepSeek — open-source, rock-bottom pricing

DeepSeek changed the LLM economics conversation. V3.2 at $0.28/$0.42 per million tokens is roughly one-tenth to one-twentieth of OpenAI for comparable quality, with a Speciale variant that rivals Gemini 3 Pro on reasoning benchmarks.

The architecture is clever: 685 billion total parameters with Mixture of Experts (MoE — a technique where a large model only activates a fraction of its parameters per request, reducing compute cost), activating only a fraction at inference. Fully open-source under MIT license. The caveat: Chinese data processing. For European companies with GDPR obligations, this means either self-hosting on EU infrastructure or accepting the sovereignty risk of sending data to China.

Choose this when

Maximum cost efficiency matters and data sovereignty permits. If you self-host on EU infrastructure, you get MIT-licensed frontier capability at a fraction of API costs.

Meta Llama 4 — the community standard

Llama 4 brought two genuinely useful innovations: Mixture of Experts architecture for efficient inference, and an industry-leading 10 million token context window on the Scout model. Llama 4 Maverick exceeds GPT-4o on coding, reasoning, and multilingual benchmarks, though it falls short of the current top tier (Gemini 2.5 Pro, Claude Sonnet 4).

The real advantage: community. Llama is the most deployed open-weight model family. More fine-tuned variants, more deployment guides, more infrastructure tooling than any competitor. Available across every major cloud platform. Scout runs on a single H100 GPU.

Choose this when

You need self-hosted deployments where community support and deployment tooling matter. The 10M-token context on Scout is unmatched for open-weight long-context work.

Mistral — the European option

Mistral is the only major LLM provider that is a European company with EU-based infrastructure as the default. For European businesses where data sovereignty is non-negotiable, this matters more than benchmarks. Mistral Medium 3 at $0.40/$2.00 performs at roughly 90 percent of Claude Sonnet on benchmarks at significantly lower cost.

Mistral also offers enterprise fine-tuning, custom pre-training, and strong multilingual support (particularly French, German, Spanish, Italian). Their open-weight models (Apache 2.0) are self-hostable for maximum control. Devstral is positioned as the best open-source model for coding agents.

Choose this when

European data sovereignty is a requirement, not a preference. The only major provider where EU hosting is the default, not an add-on. Competitive quality at compelling prices.

Qwen and the Chinese open-source wave

Alibaba’s Qwen 3.5 and Zhipu’s GLM-4.7 lead a wave of Chinese open-source models that are genuinely world-class. Qwen 3.5 hits 88.4% on GPQA Diamond and 76.4% on SWE-bench Verified — frontier-tier results. The remarkable efficiency story: Qwen3-4B rivals the performance of Qwen2.5-72B, meaning each generation gets roughly the same quality from a model half the size.

Same caveat as DeepSeek: Chinese origin means you either self-host on your own infrastructure or accept data sovereignty risk. But for teams that can self-host, these models offer strong capability per dollar.

For European teams

If GDPR compliance shapes your AI decisions, your shortlist looks different:

  • Mistral: EU company, EU servers by default, open-weight for self-hosting
  • OpenAI API: EU data residency since Feb 2025 (new projects only)
  • Claude API: EU data residency since Aug 2025 (API only — claude.ai remains US)
  • Self-hosted open-weight: Mistral, Llama, or DeepSeek on your own EU infrastructure

DeepSeek and Qwen APIs process data in China. Their models are MIT/Apache licensed for self-hosting, which solves the sovereignty problem if you run them yourself.

The pricing matrix

Pricing is where most teams make their first mistake. They pick one expensive model and run everything through it. The reality: a well-designed AI pipeline uses cheap models for simple tasks and reserves expensive models for hard ones. The difference between “$500/month” and “$5,000/month” is usually architecture, not model choice.

Which LLM is cheapest for production?

ModelInputOutputContextTierBest for
GPT-4.1 nano$0.10$0.401MBudgetClassification, extraction, routing
Gemini 2.5 Flash-Lite$0.10$0.401MBudgetHigh-volume, low-latency
Grok 4.1 Fast$0.20$0.502MBudgetLong-context at lowest cost
DeepSeek V3.2$0.28$0.42128KBudgetMaximum cost efficiency
Gemini 2.5 Flash$0.30$2.501MMidHigh-volume quality tasks
Mistral Medium 3$0.40$2.00131KMidEuropean hosting, balanced
GPT-4.1 mini$0.40$1.601MMidBalanced performance/cost
DeepSeek R1$0.55$2.19128KMidBudget reasoning
Claude Haiku 4.5$1.00$5.00200KMidFast structured output
o4-mini$1.10$4.40200KMidBest-value reasoning
Gemini 2.5 Pro$1.25$10.001M+PremiumMultimodal, long context
GPT-4.1$2.00$8.001MPremiumProduction workhorse
o3$2.00$8.00200KPremiumComplex reasoning
Claude Sonnet 4.6$3.00$15.001MPremiumStructured output, coding
Grok 4$3.00$15.00256KPremiumFrontier reasoning
GPT-5.4$2.50$15.00LongFrontierHardest problems
Claude Opus 4.6$5.00$25.001MFrontierMission-critical
GPT-4.1 nano
In$0.10
Out$0.40
Context1M
TierBudget
Best forClassification, extraction, routing
Gemini 2.5 Flash-Lite
In$0.10
Out$0.40
Context1M
TierBudget
Best forHigh-volume, low-latency
Grok 4.1 Fast
In$0.20
Out$0.50
Context2M
TierBudget
Best forLong-context at lowest cost
DeepSeek V3.2
In$0.28
Out$0.42
Context128K
TierBudget
Best forMaximum cost efficiency
Gemini 2.5 Flash
In$0.30
Out$2.50
Context1M
TierMid
Best forHigh-volume quality tasks
Mistral Medium 3
In$0.40
Out$2.00
Context131K
TierMid
Best forEuropean hosting, balanced
GPT-4.1 mini
In$0.40
Out$1.60
Context1M
TierMid
Best forBalanced performance/cost
DeepSeek R1
In$0.55
Out$2.19
Context128K
TierMid
Best forBudget reasoning
Claude Haiku 4.5
In$1.00
Out$5.00
Context200K
TierMid
Best forFast structured output
o4-mini
In$1.10
Out$4.40
Context200K
TierMid
Best forBest-value reasoning
Gemini 2.5 Pro
In$1.25
Out$10.00
Context1M+
TierPremium
Best forMultimodal, long context
GPT-4.1
In$2.00
Out$8.00
Context1M
TierPremium
Best forProduction workhorse
o3
In$2.00
Out$8.00
Context200K
TierPremium
Best forComplex reasoning
Claude Sonnet 4.6
In$3.00
Out$15.00
Context1M
TierPremium
Best forStructured output, coding
Grok 4
In$3.00
Out$15.00
Context256K
TierPremium
Best forFrontier reasoning
GPT-5.4
In$2.50
Out$15.00
ContextLong
TierFrontier
Best forHardest problems
Claude Opus 4.6
In$5.00
Out$25.00
Context1M
TierFrontier
Best forMission-critical

Per-million-token pricing, March 2026

Budget tier ($0.10–$0.50 output) handles most simple tasks — classification, extraction, routing, summarization. GPT-4.1 nano and Gemini Flash-Lite are interchangeable here. DeepSeek V3.2 offers the best raw value if data sovereignty is not a concern. Most production pipelines should route 60 to 80 percent of their requests to this tier.

Mid tier ($1–$5 output) is where production workloads live. o4-mini is the best-value reasoning model. Claude Haiku is the speed-structured-output sweet spot. Gemini Flash offers the broadest capability at this price point. Mistral Medium 3 is the European-first choice.

Premium tier ($8–$25 output) is for complex reasoning, coding, and mission-critical tasks. GPT-4.1 and Claude Sonnet are the two production defaults. Reserve these for the 10 to 20 percent of requests that actually need frontier intelligence.

Builder’s note

These prices change quarterly. Every new model generation is cheaper than the last. When I built the AI enrichment pipeline for PAJ by Imparato, we designed the architecture to be provider-agnostic from day one — versioned prompts with per-version model and provider configuration. When prices drop or a better model appears, we swap the config. No code changes. This is not premature abstraction — it pays for itself the first time you need to swap.

The decision framework

There is no single “best LLM.” The right model depends on five axes: what kind of task, how fast it needs to respond, what you can spend, where your data is allowed to go, and how you want to deploy. Here is the framework I use with clients.

What does your pipeline need most?

What does your pipeline need most?

Additional decision axes

Latency requirements. Real-time chat under one second requires smaller models (Haiku, Flash, GPT-4.1 nano). Background processing should use the batch API for 50 percent savings.

Cost budget. Route 60–80 percent of requests to the budget tier (see pricing matrix above), reserve premium models for the 10–20 percent that need frontier intelligence. Above $5,000/month in API spend, evaluate self-hosting.

Should I self-host an LLM or use an API?

Self-hosting is not free — it shifts the bill from API costs to engineering salaries, infrastructure overhead, and maintenance. Total cost of ownership for production-grade open-source deployment is often 5 to 10 times higher than API costs when you factor in everything. Self-hosting makes sense above roughly $1,000/month in API spend, where payback drops to under three months — but only if you have the engineering capacity to maintain it.

Reasoning models — when thinking time pays for itself

Reasoning models dynamically allocate computational resources during inference. They “think” before responding, trading latency for accuracy. The trade-off shifts: instead of “bigger model = better,” it becomes “more thinking time = better, on problems that reward it.”

DeepSeek R1o3o4-miniGemini 3 Pro
AIME 202479.8%83.3%~77%81.5%
GPQA Diamond71.5%87.7%84.2%
ARC-AGI-241.2%52.8%56.4%
SWE-bench Verified69.1%68.1%
Output cost / 1M$2.19$8.00$4.40$5–$10
Open sourceYes (MIT)NoNoNo
AIME 2024
R179.8%
o383.3%
o4-mini~77%
Gemini 381.5%
GPQA Diamond
R171.5%
o387.7%
o4-mini
Gemini 384.2%
ARC-AGI-2
R141.2%
o352.8%
o4-mini
Gemini 356.4%
SWE-bench Verified
R1
o369.1%
o4-mini68.1%
Gemini 3
Output cost / 1M
R1$2.19
o3$8.00
o4-mini$4.40
Gemini 3$5–$10
Open source
R1Yes (MIT)
o3No
o4-miniNo
Gemini 3No

Use reasoning models for: complex multi-step logic, math, science, code debugging, contract analysis, financial model validation — anywhere the cost of errors is high and thinking time improves accuracy.

Use standard models for: classification, extraction, summarization, content generation, chat — where speed matters more than depth.

The key insight is that o4-mini at $1.10/$4.40 is both smarter and cheaper than the previous generation’s o1 at $15/$60. The frontier keeps moving downward in price.

European considerations — data sovereignty and the AI Act

EU AI Act timeline

DateWhat happens
Feb 2025Prohibited AI practices and literacy obligations in effect
Aug 2025GPAI governance rules (including LLMs) applicable
Aug 2026Full application for most operators including high-risk systems
Aug 2027Extended transition for high-risk systems in regulated products
Feb 2025
What happensProhibited AI practices and literacy obligations in effect
Aug 2025
What happensGPAI governance rules (including LLMs) applicable
Aug 2026
What happensFull application for most operators including high-risk systems
Aug 2027
What happensExtended transition for high-risk systems in regulated products

Which LLM provider has EU data residency?

ProviderEU API processingEU data at restSelf-hostableEU company
Mistral✓ default✓ France
OpenAI✓ since Feb 2025 (new projects)✓ Enterprise/Edu✗ US
Anthropic✓ since Aug 2025 (API only)✗ (claude.ai = US)✗ US
Google✓ Vertex AI regional✗ US
DeepSeek✗ China✓ MIT✗ China
Llama (Meta)Via cloud providersVia cloud providers✗ US
Qwen (Alibaba)✗ China✗ China
Mistral
EU API✓ default
EU data
Self-host
EU co.✓ France
OpenAI
EU API✓ since Feb 2025 (new projects)
EU data✓ Enterprise/Edu
Self-host
EU co.✗ US
Anthropic
EU API✓ since Aug 2025 (API only)
EU data✗ (claude.ai = US)
Self-host
EU co.✗ US
Google
EU API✓ Vertex AI regional
EU data
Self-host
EU co.✗ US
DeepSeek
EU API✗ China
EU data
Self-host✓ MIT
EU co.✗ China
Llama (Meta)
EU APIVia cloud providers
EU dataVia cloud providers
Self-host
EU co.✗ US
Qwen (Alibaba)
EU API✗ China
EU data
Self-host
EU co.✗ China

The US Cloud Act means American firms can be compelled to surrender data regardless of where it is stored. For maximum sovereignty, the practical answer is self-hosting open-weight models on EU infrastructure. Mistral is the only major provider where this is the default posture rather than an enterprise add-on.

For teams that need certainty — not just contractual assurance but actual technical guarantees — the practical option is self-hosted open-weight models (Mistral, Llama, or DeepSeek) on your own EU cloud infrastructure, with your own encryption keys and no third-party access to the model weights or your data.

Multi-provider strategy — how production teams actually do it

Most production setups work better with multiple providers. The idea is task-based routing: Claude for structured extraction and complex reasoning. Gemini Flash for high-volume processing where cost matters more than maximum quality. OpenAI for tasks where ecosystem integrations matter. Self-hosted models for privacy-critical operations or language-specific refinement.

LiLiteLLM makes multi-provider routing trivial — one unified interface across all providers. For enterprise scale, AI gateways (Bifrost, Cloudflare AI Gateway, Kong) add intelligent load balancing, automatic failover, and token-aware rate limiting.

The key design principle: build provider-agnostic from day one. Versioned prompts with per-version model and provider configuration. When a provider changes pricing, deprecates a model, or a competitor ships something better, you swap a config file — not rewrite code.

The practitioner angle — what I use and why

What we chose for PAJ by Imparato — and why

PAJ is an AI-powered theater play discovery platform — 559 plays enriched with 25,000 to 40,000 AI-generated data points. Built on a French Ministry of Culture grant. The LLM decisions had real budget and compliance consequences.

We chose Mistral as the primary LLM for the enrichment pipeline. The reasons: EU data hosting (non-negotiable for a government-funded cultural project), strong French language support (analyzing theatrical texts in French), competitive pricing, and reliable API. For daily title normalisation — a lightweight, repetitive task — we route to Gemini via an n8n workflow because it is the cheapest option that handles the task reliably.

The architecture is provider-agnostic by design. Our custom synthetics framework uses versioned prompts with ERB templates and JSON Schema output validation. Each prompt version stores its model, provider, and cost configuration. Swapping from Mistral to Claude for a specific enrichment type requires changing one config line.

The cost optimisation that mattered most was architectural, not model selection. We send the full play text to the LLM once, compress it into a structured interaction matrix, and reuse that matrix for seven-plus downstream enrichments. This preprocessing pattern saves roughly six redundant full-text LLM calls per play. At 559 plays, that is thousands of saved API calls — a cost reduction that no model switch could match.

What I reach for on new projects

  • Structured output and complex reasoning — my default is Claude. The schema validation is genuinely more reliable.
  • Ecosystem breadth and general-purpose tasks — OpenAI. Widest library support, most community documentation.
  • Cost-sensitive high-volume work — Gemini Flash. Best capability-to-cost ratio for batch processing.
  • European data sovereignty requirements — Mistral first, then self-hosted open-weight models.

What works best for startups: start with one provider’s mid-tier model. Add a second provider when you have a specific task that the first handles poorly or expensively. Add model routing when your monthly spend crosses roughly $500. Before that, the engineering complexity of multi-model is not worth the cost savings.

Building something with AI?

I help founders and CTOs architect AI-powered products — from choosing the right model to building the pipeline that keeps it running in production. A 30-minute conversation can save months of wrong turns.

Frequently asked questions

Which LLM provider is best for European startups with GDPR requirements?

Mistral is the safest default — it is a French company with EU servers by default. Claude offers EU data residency since August 2025, and OpenAI since February 2025 for new projects. For maximum control, self-host an open model like DeepSeek V3 or Llama on EU infrastructure.

How much does it cost to run an LLM in production?

Costs vary 100x depending on model choice. Budget models (Gemini Flash-Lite, GPT-4.1 nano) cost $0.10–0.40 per million tokens. Mid-tier (Claude Sonnet, GPT-4.1) cost $2–15 per million tokens. Premium reasoning models (Claude Opus, o4-mini) cost $5–25. Most teams overspend by using one expensive model for everything instead of routing by task complexity.

Should I self-host an open-source LLM or use an API?

Use APIs until your monthly spend exceeds $1,000/month and you have engineering capacity to manage infrastructure. Above that threshold, self-hosting models like DeepSeek V3 or Llama typically pays back in under three months. Self-hosting also solves data sovereignty concerns completely.

What is the best LLM for structured JSON output?

Claude (Sonnet and Opus) consistently leads on structured output reliability — fewer malformed responses, better schema adherence. OpenAI’s structured output mode is also strong. For budget tasks, Gemini Flash with a strict schema prompt works well at a fraction of the cost.

Do I need a reasoning model like o4-mini or can I use a standard LLM?

Standard models handle 90% of production tasks — classification, extraction, summarization, generation. Use reasoning models only for multi-step logic, mathematical computation, or complex planning where you need the model to show its work. Reasoning models are 3–5x more expensive and significantly slower.

How do I reduce LLM costs without losing quality?

Route each request to the cheapest model that can handle it — use a budget model for classification and extraction, a mid-tier model for generation, and premium only for complex reasoning. Use batch processing for non-real-time tasks (50% discount on most providers). Cache frequent queries. This strategy cuts costs 50–70% with no quality loss on most workloads.

Sources and tools

Pricing & documentation

Market research & benchmarks

Regulatory

Let's talk about what you're building.

30-minute call. No pitch deck. Just tell me what you're trying to build. I'll tell you how I'd approach it.

High StickersPAJ by ImparatoIris GaleriePlancton by PimpantKoudetatCHU NantesGuest SuiteAsmodeeRobin des Fermes #1Meme pas CapDrakkarHigh StickersPAJ by ImparatoIris GaleriePlancton by PimpantKoudetatCHU NantesGuest SuiteAsmodeeRobin des Fermes #1Meme pas CapDrakkar