Key Takeaways
- One Model, Four Capabilities. Mistral Small 4 combines what were previously four separate Mistral products — Mistral Small (instruct), Magistral (reasoning), Pixtral (vision), and Devstral (coding) — into a single 119B-parameter Mixture-of-Experts model.
- Apache 2.0, No Strings. The most permissive open-source AI licence available. Unlike Meta’s LLaMA licence, Apache 2.0 places no restrictions on commercial use regardless of scale.
- 6.5B Active Parameters Per Token. The MoE architecture activates only 6.5 billion of the 119 billion parameters per inference — meaning frontier-class knowledge at small-model compute cost.
- Not Yet on Ollama. llama.cpp and Ollama support were not finalised at launch. The recommended paths are the Mistral API ($0.15/1M input tokens) or self-hosted vLLM.
Introduction: One Endpoint to Replace Four
Mistral AI launched Mistral Small 4 on March 16, 2026, and it represents a meaningful architectural shift in how European open-source AI is packaged.
Until now, developers working with Mistral’s model portfolio faced a familiar enterprise AI problem: different tasks required different models. You routed general chat to Mistral Small, hard reasoning to Magistral, image analysis to Pixtral, and code generation to Devstral. Four models meant four API endpoints, four deployment targets, four sets of infrastructure, and four cost lines to manage.
Mistral Small 4 eliminates this entirely. A single model. A single endpoint. One deployment. The reasoning_effort parameter lets you dial the model’s behaviour per request — set it to none for fast chat-style responses, low for balanced reasoning, high for deep multi-step analysis. The model adapts accordingly.
Direct Answer: What is Mistral Small 4? Mistral Small 4 is a 119-billion-parameter Mixture-of-Experts language model released by Mistral AI on March 16, 2026, under the Apache 2.0 open-source licence. It combines reasoning, multimodal (text and image) input, and agentic coding capabilities into a single model. It activates only 6.5 billion parameters per token via a 128-expert MoE architecture, providing frontier-class reasoning at a fraction of the inference cost of a dense 119B model. It has a 256,000-token context window and is available via the Mistral API at $0.15/1M input tokens, on NVIDIA NIM, and for self-hosting via vLLM.
The Architecture: 128 Experts, 4 Active Per Token
Mixture-of-Experts is the architecture that makes Mistral Small 4’s numbers possible. The model contains 128 specialised neural networks — “experts” — each trained to handle different kinds of inputs. When a token arrives, a learned routing layer picks the 4 most relevant experts for that specific token and activates only those.
The result: the model contains the knowledge of 119 billion parameters, but computes like a 6.5 billion parameter model per token. For a user, this means:
- Speed approaching a small model
- Quality approaching a large model
- Infrastructure cost between the two
Mistral claims 40% lower end-to-end latency and 3× higher throughput compared to Mistral Small 3. Independent benchmarks show it generating 142 tokens per second via the Mistral API — more than twice the median for comparable open-weight models.
What’s Actually New vs Small 3
| Feature | Mistral Small 3 | Mistral Small 4 |
|---|---|---|
| Parameters (total) | ~24B | 119B |
| Parameters (active) | ~24B | ~6.5B per token |
| Architecture | Dense | MoE (128 experts, 4 active) |
| Multimodal | Via separate Pixtral | Native (text + images) |
| Reasoning | Via separate Magistral | Built-in, configurable |
| Code/agents | Via separate Devstral | Built-in |
| Context window | 128k | 256k |
| Licence | Apache 2.0 | Apache 2.0 |
| Throughput | Baseline | +3× |
| Latency | Baseline | −40% |
The 256k context window is the practical standout for enterprise use. It directly reduces the need for aggressive chunking, retrieval orchestration, and context pruning in tasks like long-document analysis, codebase exploration, and multi-file agentic workflows.
The Sovereign Case: Apache 2.0 Matters
Apache 2.0 is the most permissive open-source AI licence in widespread use. For sovereignty-focused users and organisations, the distinction from Meta’s LLaMA licence matters:
Meta LLaMA licence: Commercial use is permitted, but restricted for companies above 700 million monthly active users. Also requires attribution. Meaning changes or modifications must be disclosed.
Apache 2.0: No restrictions on commercial scale. No user-count ceiling. Modifications can be kept private. No mandatory attribution beyond the licence text inclusion. You can fine-tune, modify, and deploy without reporting to Mistral.
For an EU company self-hosting Mistral Small 4 on European infrastructure, this is genuinely sovereign AI: open weights, open licence, no cloud dependency, no data leaving your jurisdiction.
How to Access and Deploy
Option 1: Mistral API (Easiest, Cheapest at Low Volume)
from mistralai import Mistral
client = Mistral(api_key="your_api_key")
# Standard inference (fast, no reasoning)
response = client.chat.complete(
model="mistral-small-latest",
messages=[{"role": "user", "content": "Analyse this contract clause..."}]
)
# With configurable reasoning (slower, deeper)
response = client.chat.complete(
model="mistral-small-latest",
messages=[{"role": "user", "content": "Solve this multi-step problem..."}],
extra_body={"reasoning_effort": "high"}
)
Pricing: $0.15/1M input tokens, $0.60/1M output tokens. This is 5× cheaper than GPT-5.4 Mini on input and 7.5× cheaper on output. For teams currently paying frontier API costs, this is a meaningful reduction.
Option 2: NVIDIA NIM (Self-Host with NVIDIA Infrastructure)
Available day-0 via NVIDIA’s NIM containers, with free prototype access at build.nvidia.com. Optimised checkpoints for H100, H200, and B200 GPUs in NVFP4 format.
# Pull and run via NVIDIA NIM
docker pull nvcr.io/nim/mistralai/mistral-small-4:latest
docker run --gpus all -p 8000:8000 nvcr.io/nim/mistralai/mistral-small-4:latest
Option 3: Self-Hosted vLLM (Maximum Sovereignty)
For GDPR compliance, data sovereignty requirements, or any case where data must not leave your infrastructure, vLLM is the recommended self-hosting path.
Infrastructure requirement: Minimum 4× NVIDIA HGX H100. The full model in BF16 is approximately 242GB — this is not consumer hardware. For most organisations, the Mistral API is more practical until TurboQuant-style compression makes large MoE models feasible on smaller rigs.
# Self-host via vLLM (requires 4× H100 minimum)
pip install vllm
vllm serve mistralai/Mistral-Small-3.1-24B-Instruct-2503 \
--tensor-parallel-size 4 \
--max-model-len 65536
Option 4: llama.cpp / Ollama (Not Ready at Launch)
A PR to add Mistral Small 4 support to llama.cpp is open at the time of writing. Once merged, Ollama will follow. This is the path that enables running it on consumer hardware with quantisation — but it is not available yet.
Watch: github.com/ggerganov/llama.cpp for the merge. Subscribe to the Sovereign Brief to be notified when it hits Ollama’s model library.
Honest Assessment: Strong But Not the Leader
Mistral’s internal benchmarks position Small 4 favourably. Independent testing tells a more nuanced story.
On coding and reasoning benchmarks, Mistral Small 4 matches or beats GPT-OSS 120B while producing significantly shorter outputs — a genuine efficiency win for production deployments where token costs scale.
However, CTOL’s independent evaluation showed Mistral Small 4 scoring below Qwen3.5 122B on several metrics. Performance at the full 256k context depth showed degradation in some tests. The model is strong — but it is not the unambiguous leader in its class.
For most enterprise use cases — document analysis, code generation, multi-step reasoning at scale — Mistral Small 4 is competitive and materially cheaper than frontier alternatives. For cutting-edge research requiring absolute top performance, Gemini 3.1 Pro or Claude Opus 4 retain advantages.
The honest sovereign recommendation: If your data must stay on European infrastructure and you need a capable, unrestricted open-weight model today, Mistral Small 4 is the best option available. If you can use the API and cost is the primary concern, it is also the cheapest multimodal model with configurable reasoning from a major provider.
The European AI Sovereignty Signal
Mistral Small 4 matters beyond its technical specifications. It is a signal about the state of European AI competitiveness in 2026.
When Mistral launched in 2023 with €105 million in funding, European AI was aspirational. Today, Mistral is shipping models that genuinely compete with US frontier lab offerings on a per-task, per-cost basis. The NVIDIA Nemotron Coalition membership announced alongside Small 4 brings Mistral into the same infrastructure optimisation networks as the dominant US labs.
The EU’s Cloud and AI Development Act, currently passing through the European Parliament, aims to mandate preference for EU-developed AI in public procurement. If that legislation passes in its current form, Mistral Small 4 — Apache 2.0, self-hostable, GDPR-compatible by architecture — is exactly the model that would benefit.
FAQ
Can I use Mistral Small 4 commercially without paying Mistral? Yes. The Apache 2.0 licence allows unrestricted commercial use, including self-hosting, fine-tuning for proprietary applications, and building commercial products. No royalties, no user-count ceiling, no reporting requirements.
Why is it called “Small” if it has 119B parameters? The name reflects active parameters per token (6.5B) rather than total parameters (119B). The MoE architecture means inference cost scales with active parameters, not total parameters. “Small” refers to inference cost, not model size.
When will Ollama support Mistral Small 4? A pull request is open on the llama.cpp repository. Once merged, Ollama will add support. Timeline is not confirmed — follow the llama.cpp releases on GitHub.
What hardware do I need to self-host it? Minimum: 4× NVIDIA HGX H100 (approximately $400,000 in hardware). This is data-centre territory, not consumer hardware. For most teams, the Mistral API or NVIDIA NIM cloud deployment is more practical until quantised versions are available.
How does the reasoning_effort parameter work?
It is passed as a per-request parameter in the API call. "none" gives fast, low-reasoning responses comparable to Small 3.2. "low" adds light reasoning. "high" enables deep multi-step chain-of-thought reasoning. You pay for more output tokens when using higher reasoning effort.
Related Articles
- How to Run AI Locally With Ollama: Complete 2026 Guide
- TurboQuant Explained: Google’s Extreme Compression for Local AI
- Europe’s Digital Divorce: Why the EU Is Building Its Own Tech Stack
- ChatGPT vs Claude vs Gemini vs Local LLMs: 2026 Ranked
- Ollama Hits 52 Million Downloads: Local AI Is No Longer Niche