LLM providers compared — choosing the right model for your product

OpenAI, Anthropic, Google Gemini, Mistral, DeepSeek, Llama, and the open-source frontier. A practitioner’s guide to which model fits which job — with real pricing, benchmark data, and the decision framework I use with clients building AI-powered products.

Nicolas 14 min read Mar 2026

The bottom line

Default to GPT-4.1 for general work and route simple tasks to a budget model like Gemini Flash-Lite or DeepSeek V3. Most production workloads should use two to four models, not one. This routing strategy alone cuts AI costs by 50–70% before you optimize a single prompt.

The 2026 landscape at a glance

The LLM market in 2026 looks nothing like it did eighteen months ago. The “just use GPT-4” era is over. There are now genuinely competitive models across four tiers — premium closed-source, cost-efficient closed-source, open-weight frontier, and open-weight efficient — and the performance gap between tiers has compressed dramatically.

The key market dynamic: enterprise AI spending hit $37 billion in 2025 — up from $11.5 billion in 2024. OpenAI’s enterprise market share eroded from 50 percent to roughly 25 percent. Anthropic gained the most enterprise ground. Open-source models went from “interesting research” to “production-ready alternatives.” Teams that match models to tasks do better than teams that bet everything on one provider.

This guide maps the full provider landscape, compares pricing at every tier, and gives you the decision framework I use when helping clients architect AI-powered products. I have preferences and I will tell you what they are and why.

Tier	What it means	Key players	Typical output cost per 1M tokens
Premium closed-source	Frontier intelligence, highest cost	GPT-5.4, Claude Opus 4.6, Gemini 3 Pro, Grok 4	$8–$25
Cost-efficient closed-source	Production workhorses — 80–90% of frontier quality at a fraction of the cost	GPT-4.1, Claude Sonnet 4.6, Gemini 2.5 Pro, o4-mini	$1.60–$15
Open-weight frontier	Comparable to closed-source on many tasks, self-hostable	DeepSeek V3.2, Qwen 3.5, Mistral Large 3, Kimi K2.5	$0.42–$2 (API) or infrastructure cost if self-hosted
Open-weight efficient	Good enough for most simple tasks, dramatically cheaper	Llama 4 Scout/Maverick, Mistral Medium 3, smaller models	$0.20–$2 (API) or self-hosted

Premium closed-source

What it meansFrontier intelligence, highest cost

Key playersGPT-5.4, Claude Opus 4.6, Gemini 3 Pro, Grok 4

Output / 1M$8–$25

Cost-efficient closed-source

What it meansProduction workhorses — 80–90% of frontier quality at a fraction of the cost

Key playersGPT-4.1, Claude Sonnet 4.6, Gemini 2.5 Pro, o4-mini

Output / 1M$1.60–$15

Open-weight frontier

What it meansComparable to closed-source on many tasks, self-hostable

Key playersDeepSeek V3.2, Qwen 3.5, Mistral Large 3, Kimi K2.5

Output / 1M$0.42–$2 (API) or infrastructure cost if self-hosted

Open-weight efficient

What it meansGood enough for most simple tasks, dramatically cheaper

Key playersLlama 4 Scout/Maverick, Mistral Medium 3, smaller models

Output / 1M$0.20–$2 (API) or self-hosted

The boundary between tiers is blurring. Open-weight models now match or exceed last year’s premium closed-source on most benchmarks. The real question is no longer “which model is best” — it is “which model is cheapest for each job in my pipeline.”

The major providers

OpenAI — the ecosystem giant

OpenAI’s advantage is no longer that they have the best model — it is that they have the widest ecosystem. More tutorials, more libraries, more third-party integrations, more production battle-testing than any other provider. If you value community support and ecosystem breadth, OpenAI is the safe default.

The current lineup splits into three families. GPT-5.x is the flagship generation — GPT-5.4 at $2.50/$15.00 per million tokens for input/output, with a Pro tier at $30/$180 for the hardest problems. GPT-4.1 is the production workhorse that replaced GPT-4o — better coding, better instruction following, a million-token context window, and priced at $2.00/$8.00 with a nano variant at $0.10/$0.40 that handles classification and extraction tasks at rock-bottom cost. The o-series reasoning models (o3, o4-mini) trade latency for accuracy on complex problems — o4-mini at $1.10/$4.40 retains roughly 90 percent of o3’s capability at a fraction of the cost.

The batch API gives 50 percent off everything with a 24-hour turnaround. Prompt caching drops repeated context costs by 75 percent. These two features alone make OpenAI dramatically cheaper in production than the headline prices suggest. EU data residency is available for API usage since February 2025, but only for new projects — existing projects cannot be migrated.

Model	Input / 1M	Output / 1M	Context	Best for
GPT-5.4	$2.50	$15.00	Long context	Hardest problems
GPT-4.1	$2.00	$8.00	1M	Production workhorse
GPT-4.1 mini	$0.40	$1.60	1M	Balanced cost/quality
GPT-4.1 nano	$0.10	$0.40	1M	Classification, routing, extraction
o3	$2.00	$8.00	200K	Complex multi-step reasoning
o4-mini	$1.10	$4.40	200K	Best-value reasoning

GPT-5.4

Input$2.50

Output$15.00

ContextLong context

Best forHardest problems

GPT-4.1

Input$2.00

Output$8.00

Context1M

Best forProduction workhorse

GPT-4.1 mini

Input$0.40

Output$1.60

Context1M

Best forBalanced cost/quality

GPT-4.1 nano

Input$0.10

Output$0.40

Context1M

Best forClassification, routing, extraction

Input$2.00

Output$8.00

Context200K

Best forComplex multi-step reasoning

o4-mini

Input$1.10

Output$4.40

Context200K

Best forBest-value reasoning

Choose this when

Ecosystem breadth matters most — widest library support, most documentation, strongest parallel function calling. The GPT-4.1 family is the best all-around production lineup.

Anthropic Claude — structured output and reasoning

Claude’s strength is reliability on structured tasks. When you need JSON output that actually validates, multi-step reasoning that shows its work, or code generation that sustains quality across long sessions, Claude is typically the best option. The structured outputs feature — strict JSON schema validation combined with tool use — produces the most consistent machine-readable output of any provider I have tested.

The current lineup: Claude Opus 4.6 at $5/$25 is the most capable model for mission-critical work — 67 percent cheaper than previous Opus pricing. Claude Sonnet 4.6 at $3/$15 is the balanced general-purpose option with a million-token context window. Claude Haiku 4.5 at $1/$5 is the speed-optimized choice for high-volume work.

Claude’s extended thinking mode lets you dynamically control how much reasoning the model applies — near-instant responses for simple tasks, deeper thinking for complex ones, all within the same model. For coding, Opus 4 was “the world’s best coding model” when it launched, and the 4.6 generation sustains that strength. On creative writing with constraints (following word counts, structural requirements), Claude significantly outperforms OpenAI — one independent test showed a 93.9 percent validation rate for Anthropic versus 77.8 percent for OpenAI.

The limitation for European teams: Claude API has EU data residency since August 2025, but Claude.ai and Claude Desktop process everything in the US. If your team uses Claude interactively for work involving personal data, this is a real GDPR friction point.

Model	Input / 1M	Output / 1M	Context	Best for
Claude Opus 4.6	$5.00	$25.00	1M	Mission-critical, complex reasoning
Claude Sonnet 4.6	$3.00	$15.00	1M	Balanced quality/price, coding
Claude Haiku 4.5	$1.00	$5.00	200K	High-volume, speed-critical

Claude Opus 4.6

Input$5.00

Output$25.00

Context1M

Best forMission-critical, complex reasoning

Claude Sonnet 4.6

Input$3.00

Output$15.00

Context1M

Best forBalanced quality/price, coding

Claude Haiku 4.5

Input$1.00

Output$5.00

Context200K

Best forHigh-volume, speed-critical

Choose this when

Structured output reliability is critical — JSON schema validation, multi-step tool use, and code generation are genuinely stronger. Best for AI pipelines that need machine-readable output you can trust.

Google Gemini — the multimodal and long-context leader

Gemini’s differentiator is that it was designed from the ground up as a multimodal model — not a text model with vision bolted on. Text, images, video, and audio are first-class inputs within a single unified architecture. If your product processes documents with layout structure, analyzes video, or needs to understand audio, Gemini handles these natively in ways the competition still approximates.

The other standout: context windows. Gemini 3 Pro handles 2 million tokens. Gemini 2.5 Pro handles 1 million-plus. These are the industry’s largest. For tasks like multi-document summarization, large codebase analysis, or processing entire book-length inputs, Gemini has no practical equivalent.

Price-wise, Gemini is the most generous. Gemini 2.5 Flash at $0.30/$2.50 is competitive with models that cost five to ten times more on many tasks. Flash-Lite at $0.10/$0.40 is among the cheapest capable models available from any provider. The free tier is generous enough that prototyping is essentially free. Google’s TPU infrastructure advantage is real — even OpenAI began leasing Google TPU capacity in mid-2025.

Model	Input / 1M	Output / 1M	Context	Best for
Gemini 3 Pro	Premium	Premium	2M	Frontier multimodal
Gemini 2.5 Pro	$1.25	$10.00	1M+	Long-context, reasoning
Gemini 2.5 Flash	$0.30	$2.50	1M	High-volume quality tasks
Gemini 2.5 Flash-Lite	$0.10	$0.40	1M	Budget, fastest in lineup

Gemini 3 Pro

InputPremium

OutputPremium

Context2M

Best forFrontier multimodal

Gemini 2.5 Pro

Input$1.25

Output$10.00

Context1M+

Best forLong-context, reasoning

Gemini 2.5 Flash

Input$0.30

Output$2.50

Context1M

Best forHigh-volume quality tasks

Gemini 2.5 Flash-Lite

Input$0.10

Output$0.40

Context1M

Best forBudget, fastest in lineup

Choose this when

You need native multimodal understanding, massive context windows, or aggressive cost efficiency. The Flash models give you 80% of frontier quality at 10% of frontier price.

xAI Grok — the emerging challenger

Grok is the newest serious entrant. Grok 4 at $3/$15 offers frontier reasoning with native tool use and real-time search integration. The standout is Grok 4.1 Fast at $0.20/$0.50 with a 2 million token context window — potentially the best value proposition for long-context work in the entire market.

The trade-off: smaller ecosystem, less production battle-testing, fewer third-party integrations. Grok is worth evaluating for cost-sensitive long-context workloads, but I would not make it my primary provider for production systems without more track record.

Choose this when

You need long-context processing at rock-bottom prices. Grok 4.1 Fast’s 2M context at $0.20/$0.50 is hard to beat. Watch this space — but it is newer and less battle-tested.

The open-source frontier

The open-source story in 2026 is no longer “interesting for research, not ready for production.” Multiple open-weight models now match or exceed last year’s closed-source frontier on real-world benchmarks. The question has shifted from “can open-source compete?” to “when does the operational overhead of self-hosting justify the cost savings and control?”

DeepSeek — open-source, rock-bottom pricing

DeepSeek changed the LLM economics conversation. V3.2 at $0.28/$0.42 per million tokens is roughly one-tenth to one-twentieth of OpenAI for comparable quality, with a Speciale variant that rivals Gemini 3 Pro on reasoning benchmarks.

The architecture is clever: 685 billion total parameters with Mixture of Experts (MoE — a technique where a large model only activates a fraction of its parameters per request, reducing compute cost), activating only a fraction at inference. Fully open-source under MIT license. The caveat: Chinese data processing. For European companies with GDPR obligations, this means either self-hosting on EU infrastructure or accepting the sovereignty risk of sending data to China.

Choose this when

Maximum cost efficiency matters and data sovereignty permits. If you self-host on EU infrastructure, you get MIT-licensed frontier capability at a fraction of API costs.

Meta Llama 4 — the community standard

Llama 4 brought two genuinely useful innovations: Mixture of Experts architecture for efficient inference, and an industry-leading 10 million token context window on the Scout model. Llama 4 Maverick exceeds GPT-4o on coding, reasoning, and multilingual benchmarks, though it falls short of the current top tier (Gemini 2.5 Pro, Claude Sonnet 4).

The real advantage: community. Llama is the most deployed open-weight model family. More fine-tuned variants, more deployment guides, more infrastructure tooling than any competitor. Available across every major cloud platform. Scout runs on a single H100 GPU.

Choose this when

You need self-hosted deployments where community support and deployment tooling matter. The 10M-token context on Scout is unmatched for open-weight long-context work.

Mistral — the European option

Mistral is the only major LLM provider that is a European company with EU-based infrastructure as the default. For European businesses where data sovereignty is non-negotiable, this matters more than benchmarks. Mistral Medium 3 at $0.40/$2.00 performs at roughly 90 percent of Claude Sonnet on benchmarks at significantly lower cost.

Mistral also offers enterprise fine-tuning, custom pre-training, and strong multilingual support (particularly French, German, Spanish, Italian). Their open-weight models (Apache 2.0) are self-hostable for maximum control. Devstral is positioned as the best open-source model for coding agents.

Choose this when

European data sovereignty is a requirement, not a preference. The only major provider where EU hosting is the default, not an add-on. Competitive quality at compelling prices.

Qwen and the Chinese open-source wave

Alibaba’s Qwen 3.5 and Zhipu’s GLM-4.7 lead a wave of Chinese open-source models that are genuinely world-class. Qwen 3.5 hits 88.4% on GPQA Diamond and 76.4% on SWE-bench Verified — frontier-tier results. The remarkable efficiency story: Qwen3-4B rivals the performance of Qwen2.5-72B, meaning each generation gets roughly the same quality from a model half the size.

Same caveat as DeepSeek: Chinese origin means you either self-host on your own infrastructure or accept data sovereignty risk. But for teams that can self-host, these models offer strong capability per dollar.

For European teams

If GDPR compliance shapes your AI decisions, your shortlist looks different:

Mistral: EU company, EU servers by default, open-weight for self-hosting
OpenAI API: EU data residency since Feb 2025 (new projects only)
Claude API: EU data residency since Aug 2025 (API only — claude.ai remains US)
Self-hosted open-weight: Mistral, Llama, or DeepSeek on your own EU infrastructure

DeepSeek and Qwen APIs process data in China. Their models are MIT/Apache licensed for self-hosting, which solves the sovereignty problem if you run them yourself.

The pricing matrix

Pricing is where most teams make their first mistake. They pick one expensive model and run everything through it. The reality: a well-designed AI pipeline uses cheap models for simple tasks and reserves expensive models for hard ones. The difference between “$500/month” and “$5,000/month” is usually architecture, not model choice.

Which LLM is cheapest for production?

Model	Input	Output	Context	Tier	Best for
GPT-4.1 nano	$0.10	$0.40	1M	Budget	Classification, extraction, routing
Gemini 2.5 Flash-Lite	$0.10	$0.40	1M	Budget	High-volume, low-latency
Grok 4.1 Fast	$0.20	$0.50	2M	Budget	Long-context at lowest cost
DeepSeek V3.2	$0.28	$0.42	128K	Budget	Maximum cost efficiency
Gemini 2.5 Flash	$0.30	$2.50	1M	Mid	High-volume quality tasks
Mistral Medium 3	$0.40	$2.00	131K	Mid	European hosting, balanced
GPT-4.1 mini	$0.40	$1.60	1M	Mid	Balanced performance/cost
DeepSeek R1	$0.55	$2.19	128K	Mid	Budget reasoning
Claude Haiku 4.5	$1.00	$5.00	200K	Mid	Fast structured output
o4-mini	$1.10	$4.40	200K	Mid	Best-value reasoning
Gemini 2.5 Pro	$1.25	$10.00	1M+	Premium	Multimodal, long context
GPT-4.1	$2.00	$8.00	1M	Premium	Production workhorse
o3	$2.00	$8.00	200K	Premium	Complex reasoning
Claude Sonnet 4.6	$3.00	$15.00	1M	Premium	Structured output, coding
Grok 4	$3.00	$15.00	256K	Premium	Frontier reasoning
GPT-5.4	$2.50	$15.00	Long	Frontier	Hardest problems
Claude Opus 4.6	$5.00	$25.00	1M	Frontier	Mission-critical

GPT-4.1 nano

In$0.10

Out$0.40

Context1M

TierBudget

Best forClassification, extraction, routing

Gemini 2.5 Flash-Lite

In$0.10

Out$0.40

Context1M

TierBudget

Best forHigh-volume, low-latency

Grok 4.1 Fast

In$0.20

Out$0.50

Context2M

TierBudget

Best forLong-context at lowest cost

DeepSeek V3.2

In$0.28

Out$0.42

Context128K

TierBudget

Best forMaximum cost efficiency

Gemini 2.5 Flash

In$0.30

Out$2.50

Context1M

TierMid

Best forHigh-volume quality tasks

Mistral Medium 3

In$0.40

Out$2.00

Context131K

TierMid

Best forEuropean hosting, balanced

GPT-4.1 mini

In$0.40

Out$1.60

Context1M

TierMid

Best forBalanced performance/cost

DeepSeek R1

In$0.55

Out$2.19

Context128K

TierMid

Best forBudget reasoning

Claude Haiku 4.5

In$1.00

Out$5.00

Context200K

TierMid

Best forFast structured output

o4-mini

In$1.10

Out$4.40

Context200K

TierMid

Best forBest-value reasoning

Gemini 2.5 Pro

In$1.25

Out$10.00

Context1M+

TierPremium

Best forMultimodal, long context

GPT-4.1

In$2.00

Out$8.00

Context1M

TierPremium

Best forProduction workhorse

In$2.00

Out$8.00

Context200K

TierPremium

Best forComplex reasoning

Claude Sonnet 4.6

In$3.00

Out$15.00

Context1M

TierPremium

Best forStructured output, coding

Grok 4

In$3.00

Out$15.00

Context256K

TierPremium

Best forFrontier reasoning

GPT-5.4

In$2.50

Out$15.00

ContextLong

TierFrontier

Best forHardest problems

Claude Opus 4.6

In$5.00

Out$25.00

Context1M

TierFrontier

Best forMission-critical

Per-million-token pricing, March 2026

Budget tier ($0.10–$0.50 output) handles most simple tasks — classification, extraction, routing, summarization. GPT-4.1 nano and Gemini Flash-Lite are interchangeable here. DeepSeek V3.2 offers the best raw value if data sovereignty is not a concern. Most production pipelines should route 60 to 80 percent of their requests to this tier.

Mid tier ($1–$5 output) is where production workloads live. o4-mini is the best-value reasoning model. Claude Haiku is the speed-structured-output sweet spot. Gemini Flash offers the broadest capability at this price point. Mistral Medium 3 is the European-first choice.

Premium tier ($8–$25 output) is for complex reasoning, coding, and mission-critical tasks. GPT-4.1 and Claude Sonnet are the two production defaults. Reserve these for the 10 to 20 percent of requests that actually need frontier intelligence.

Builder’s note

These prices change quarterly. Every new model generation is cheaper than the last. When I built the AI enrichment pipeline for PAJ by Imparato, we designed the architecture to be provider-agnostic from day one — versioned prompts with per-version model and provider configuration. When prices drop or a better model appears, we swap the config. No code changes. This is not premature abstraction — it pays for itself the first time you need to swap.

The decision framework

There is no single “best LLM.” The right model depends on five axes: what kind of task, how fast it needs to respond, what you can spend, where your data is allowed to go, and how you want to deploy. Here is the framework I use with clients.

What does your pipeline need most?

Additional decision axes

Latency requirements. Real-time chat under one second requires smaller models (Haiku, Flash, GPT-4.1 nano). Background processing should use the batch API for 50 percent savings.

Cost budget. Route 60–80 percent of requests to the budget tier (see pricing matrix above), reserve premium models for the 10–20 percent that need frontier intelligence. Above $5,000/month in API spend, evaluate self-hosting.

Should I self-host an LLM or use an API?

Self-hosting is not free — it shifts the bill from API costs to engineering salaries, infrastructure overhead, and maintenance. Total cost of ownership for production-grade open-source deployment is often 5 to 10 times higher than API costs when you factor in everything. Self-hosting makes sense above roughly $1,000/month in API spend, where payback drops to under three months — but only if you have the engineering capacity to maintain it.

Reasoning models — when thinking time pays for itself

Reasoning models dynamically allocate computational resources during inference. They “think” before responding, trading latency for accuracy. The trade-off shifts: instead of “bigger model = better,” it becomes “more thinking time = better, on problems that reward it.”

	DeepSeek R1	o3	o4-mini	Gemini 3 Pro
AIME 2024	79.8%	83.3%	~77%	81.5%
GPQA Diamond	71.5%	87.7%	—	84.2%
ARC-AGI-2	41.2%	52.8%	—	56.4%
SWE-bench Verified	—	69.1%	68.1%	—
Output cost / 1M	$2.19	$8.00	$4.40	$5–$10
Open source	Yes (MIT)	No	No	No

AIME 2024

R179.8%

o383.3%

o4-mini~77%

Gemini 381.5%

GPQA Diamond

R171.5%

o387.7%

o4-mini—

Gemini 384.2%

ARC-AGI-2

R141.2%

o352.8%

o4-mini—

Gemini 356.4%

SWE-bench Verified

R1—

o369.1%

o4-mini68.1%

Gemini 3—

Output cost / 1M

R1$2.19

o3$8.00

o4-mini$4.40

Gemini 3$5–$10

Open source

R1Yes (MIT)

o3No

o4-miniNo

Gemini 3No

Use reasoning models for: complex multi-step logic, math, science, code debugging, contract analysis, financial model validation — anywhere the cost of errors is high and thinking time improves accuracy.

Use standard models for: classification, extraction, summarization, content generation, chat — where speed matters more than depth.

The key insight is that o4-mini at $1.10/$4.40 is both smarter and cheaper than the previous generation’s o1 at $15/$60. The frontier keeps moving downward in price.

European considerations — data sovereignty and the AI Act

EU AI Act timeline

Date	What happens
Feb 2025	Prohibited AI practices and literacy obligations in effect
Aug 2025	GPAI governance rules (including LLMs) applicable
Aug 2026	Full application for most operators including high-risk systems
Aug 2027	Extended transition for high-risk systems in regulated products

Feb 2025

What happensProhibited AI practices and literacy obligations in effect

Aug 2025

What happensGPAI governance rules (including LLMs) applicable

Aug 2026

What happensFull application for most operators including high-risk systems

Aug 2027

What happensExtended transition for high-risk systems in regulated products

Which LLM provider has EU data residency?

Provider	EU API processing	EU data at rest	Self-hostable	EU company
Mistral	✓ default	✓	✓	✓ France
OpenAI	✓ since Feb 2025 (new projects)	✓ Enterprise/Edu	✗	✗ US
Anthropic	✓ since Aug 2025 (API only)	✗ (claude.ai = US)	✗	✗ US
Google	✓ Vertex AI regional	✓	✗	✗ US
DeepSeek	✗ China	✗	✓ MIT	✗ China
Llama (Meta)	Via cloud providers	Via cloud providers	✓	✗ US
Qwen (Alibaba)	✗ China	✗	✓	✗ China

Mistral

EU API✓ default

EU data✓

Self-host✓

EU co.✓ France

OpenAI

EU API✓ since Feb 2025 (new projects)

EU data✓ Enterprise/Edu

Self-host✗

EU co.✗ US

Anthropic

EU API✓ since Aug 2025 (API only)

EU data✗ (claude.ai = US)

Self-host✗

EU co.✗ US

Google

EU API✓ Vertex AI regional

EU data✓

Self-host✗

EU co.✗ US

DeepSeek

EU API✗ China

EU data✗

Self-host✓ MIT

EU co.✗ China

Llama (Meta)

EU APIVia cloud providers

EU dataVia cloud providers

Self-host✓

EU co.✗ US

Qwen (Alibaba)

EU API✗ China

EU data✗

Self-host✓

EU co.✗ China

The US Cloud Act means American firms can be compelled to surrender data regardless of where it is stored. For maximum sovereignty, the practical answer is self-hosting open-weight models on EU infrastructure. Mistral is the only major provider where this is the default posture rather than an enterprise add-on.

For teams that need certainty — not just contractual assurance but actual technical guarantees — the practical option is self-hosted open-weight models (Mistral, Llama, or DeepSeek) on your own EU cloud infrastructure, with your own encryption keys and no third-party access to the model weights or your data.

Multi-provider strategy — how production teams actually do it

Most production setups work better with multiple providers. The idea is task-based routing: Claude for structured extraction and complex reasoning. Gemini Flash for high-volume processing where cost matters more than maximum quality. OpenAI for tasks where ecosystem integrations matter. Self-hosted models for privacy-critical operations or language-specific refinement.

LiLiteLLM makes multi-provider routing trivial — one unified interface across all providers. For enterprise scale, AI gateways (Bifrost, Cloudflare AI Gateway, Kong) add intelligent load balancing, automatic failover, and token-aware rate limiting.

The key design principle: build provider-agnostic from day one. Versioned prompts with per-version model and provider configuration. When a provider changes pricing, deprecates a model, or a competitor ships something better, you swap a config file — not rewrite code.

The practitioner angle — what I use and why

What we chose for PAJ by Imparato — and why

PAJ is an AI-powered theater play discovery platform — 559 plays enriched with 25,000 to 40,000 AI-generated data points. Built on a French Ministry of Culture grant. The LLM decisions had real budget and compliance consequences.

We chose Mistral as the primary LLM for the enrichment pipeline. The reasons: EU data hosting (non-negotiable for a government-funded cultural project), strong French language support (analyzing theatrical texts in French), competitive pricing, and reliable API. For daily title normalisation — a lightweight, repetitive task — we route to Gemini via an n8n workflow because it is the cheapest option that handles the task reliably.

The architecture is provider-agnostic by design. Our custom synthetics framework uses versioned prompts with ERB templates and JSON Schema output validation. Each prompt version stores its model, provider, and cost configuration. Swapping from Mistral to Claude for a specific enrichment type requires changing one config line.

The cost optimisation that mattered most was architectural, not model selection. We send the full play text to the LLM once, compress it into a structured interaction matrix, and reuse that matrix for seven-plus downstream enrichments. This preprocessing pattern saves roughly six redundant full-text LLM calls per play. At 559 plays, that is thousands of saved API calls — a cost reduction that no model switch could match.

What I reach for on new projects

Structured output and complex reasoning — my default is Claude. The schema validation is genuinely more reliable.
Ecosystem breadth and general-purpose tasks — OpenAI. Widest library support, most community documentation.
Cost-sensitive high-volume work — Gemini Flash. Best capability-to-cost ratio for batch processing.
European data sovereignty requirements — Mistral first, then self-hosted open-weight models.

What works best for startups: start with one provider’s mid-tier model. Add a second provider when you have a specific task that the first handles poorly or expensively. Add model routing when your monthly spend crosses roughly $500. Before that, the engineering complexity of multi-model is not worth the cost savings.

Building something with AI?

I help founders and CTOs architect AI-powered products — from choosing the right model to building the pipeline that keeps it running in production. A 30-minute conversation can save months of wrong turns.

Let’s chat →

Frequently asked questions

Which LLM provider is best for European startups with GDPR requirements?

Mistral is the safest default — it is a French company with EU servers by default. Claude offers EU data residency since August 2025, and OpenAI since February 2025 for new projects. For maximum control, self-host an open model like DeepSeek V3 or Llama on EU infrastructure.

How much does it cost to run an LLM in production?

Costs vary 100x depending on model choice. Budget models (Gemini Flash-Lite, GPT-4.1 nano) cost $0.10–0.40 per million tokens. Mid-tier (Claude Sonnet, GPT-4.1) cost $2–15 per million tokens. Premium reasoning models (Claude Opus, o4-mini) cost $5–25. Most teams overspend by using one expensive model for everything instead of routing by task complexity.

Should I self-host an open-source LLM or use an API?

Use APIs until your monthly spend exceeds $1,000/month and you have engineering capacity to manage infrastructure. Above that threshold, self-hosting models like DeepSeek V3 or Llama typically pays back in under three months. Self-hosting also solves data sovereignty concerns completely.

What is the best LLM for structured JSON output?

Claude (Sonnet and Opus) consistently leads on structured output reliability — fewer malformed responses, better schema adherence. OpenAI’s structured output mode is also strong. For budget tasks, Gemini Flash with a strict schema prompt works well at a fraction of the cost.

Do I need a reasoning model like o4-mini or can I use a standard LLM?

Standard models handle 90% of production tasks — classification, extraction, summarization, generation. Use reasoning models only for multi-step logic, mathematical computation, or complex planning where you need the model to show its work. Reasoning models are 3–5x more expensive and significantly slower.

How do I reduce LLM costs without losing quality?

Route each request to the cheapest model that can handle it — use a budget model for classification and extraction, a mid-tier model for generation, and premium only for complex reasoning. Use batch processing for non-real-time tasks (50% discount on most providers). Cache frequent queries. This strategy cuts costs 50–70% with no quality loss on most workloads.

Sources and tools

Pricing & documentation

OpenAI API Pricing — model pricing, batch API discounts (GPT-5.4, GPT-4.1, o3, o4-mini)
Anthropic Claude Models — API pricing, model documentation (Opus 4.6, Sonnet 4.6, Haiku 4.5)
Google Gemini API Pricing — model pricing, free tier details (Gemini 3 Pro, 2.5 Pro, 2.5 Flash)
Mistral Pricing — API pricing, model documentation (Medium 3, Devstral)
DeepSeek API Pricing — model pricing, documentation (V3.2, R1)
xAI Grok Models — API pricing (Grok 4, Grok 4.1 Fast)
Meta Llama 4 — model card, architecture details (Scout, Maverick)

Market research & benchmarks

Menlo Ventures — State of Generative AI in the Enterprise (2025) — market share, enterprise AI spending data ($37B in 2025)
ARC-AGI-2, SWE-bench Verified, GPQA Diamond, AIME 2024 — benchmark results cited per model
Independent structured output validation tests — Anthropic vs OpenAI schema compliance rates

Regulatory

EU AI Act Timeline — official implementation schedule and GPAI governance rules
OpenAI EU Data Residency (Feb 2025) & Anthropic EU Data Residency (Aug 2025)
US Cloud Act — implications for EU data sovereignty

LLM providers compared — choosing the right model for your product

The 2026 landscape at a glance

The major providers

OpenAI — the ecosystem giant

Anthropic Claude — structured output and reasoning

Google Gemini — the multimodal and long-context leader

xAI Grok — the emerging challenger

The open-source frontier

DeepSeek — open-source, rock-bottom pricing

Meta Llama 4 — the community standard

Mistral — the European option

Qwen and the Chinese open-source wave

The pricing matrix

Which LLM is cheapest for production?

The decision framework

What does your pipeline need most?

Additional decision axes

Should I self-host an LLM or use an API?

Reasoning models — when thinking time pays for itself

European considerations — data sovereignty and the AI Act

EU AI Act timeline

Which LLM provider has EU data residency?

Multi-provider strategy — how production teams actually do it

The practitioner angle — what I use and why

What we chose for PAJ by Imparato — and why

What I reach for on new projects

Building something with AI?

Frequently asked questions

Pricing & documentation

Market research & benchmarks

Regulatory

Keep reading

Let's talk about what you're building.