Local LLM with 1M Context: MiniMax M3 Goes Open Weight

minimax-m3 local-llm open-weight

For the past two years, the practical ceiling for a self-hosted language model has been around 128k tokens of context. Long enough for most tasks, yet frustratingly short when you need to analyse a full contract archive, a year of customer support history, or an entire product codebase in a single pass. MiniMax M3, which released its open weights to HuggingFace around June 13, 2026, shifts that ceiling to 1 million tokens โ€” and makes the model available for local deployment for the first time.

What MiniMax M3 Is

M3 launched via API on June 1, 2026, and committed to releasing open weights within ten days. Those weights are now available under MiniMaxAI/MiniMax-M3 on HuggingFace, along with GGUF quantizations via unsloth/MiniMax-M3-GGUF.

The model uses a Mixture-of-Experts architecture. MiniMax's technical documentation describes approximately 428 billion total parameters with roughly 23 billion active per token โ€” the MoE design means each inference call activates only a fraction of the model, keeping per-token compute closer to a 23B model than a 428B one.

The context window is 1 million tokens, powered by what MiniMax calls the MSA (MiniMax Sparse Attention) architecture. A guaranteed minimum of 512k tokens is stated in the model documentation. That is 8ร— the context of most models widely available through Ollama a year ago.

Native multimodality is built in from day one: M3 accepts image and video input alongside text. A single model call can handle a scanned invoice, a PDF with embedded diagrams, and a follow-up text question โ€” no separate vision model required.

On SWE-Bench Pro, MiniMax reports a score of 59%. That places M3 among the strongest open-weight coding models available in mid-2026, though practitioners across the community are still running independent validations of those figures.

API pricing, as listed on MiniMax's platform, is $0.60 per million input tokens and $2.40 per million output tokens. At that rate, a team making moderate use of the 1M context โ€” say, 200 requests per day averaging 10k tokens in and 2k out โ€” would spend roughly $40โ€“50 per month.

The Open-Weight Moment

The weight release matters for a specific reason that has little to do with benchmark leaderboards: when weights are public, no data needs to leave your infrastructure. Before June 13, using M3 meant sending prompts to MiniMax's API servers. Now, organisations with sufficient hardware can run the entire model inside their own network.

For European companies, this is not a small distinction. Under our reading of GDPR, sending prompts containing personal data โ€” employee names, customer details, legal case content โ€” to a third-party API creates a data-processing relationship that requires a Data Processing Agreement and potentially a transfer mechanism if the server is outside the EEA. Running local weights eliminates that obligation entirely.

MiniMax also offers M3 through Ollama's cloud layer (ollama run minimax-m3:cloud), where they state zero data retention on Ollama's infrastructure. That claim is relevant but insufficient for many regulated workloads: the servers are US-based, and US legal jurisdiction applies regardless of retention policies. Our data sovereignty guide covers this trade-off in detail.

Hardware: What Self-Hosting Actually Requires

MiniMax M3 is not a model you load on a developer workstation. Memory requirements, as reported by practitioners running the model on HuggingFace and NVIDIA developer forums:

  • FP16 (full precision): approximately 931 GB VRAM
  • INT4 quantization: approximately 233 GB
  • UD-IQ1_M GGUF (most aggressive compression): approximately 128โ€“133 GB RAM

The UD-IQ1_M figure is the relevant one for Mac hardware buyers. A Mac Studio M3 Ultra with 192 GB of unified memory can load this variant via Ollama's llama.cpp layer โ€” the model fits in memory, though quality is noticeably reduced compared to higher-precision variants. Community reports suggest 8โ€“18 tokens/second at this quantization level on Apple Silicon hardware.

For better inference quality, the INT4 variant needs around 233 GB โ€” beyond a single Mac Studio M3 Ultra but reachable with a multi-node exo cluster. An NVIDIA DGX Spark (128 GB VRAM) sits at the boundary of the most-compressed variant; practitioners report it is feasible with careful memory management. Comfortable INT4 serving is documented for 4ร— H100 80 GB setups using vLLM with tensor parallelism.

The short version for most SMBs: MiniMax M3 self-hosting requires either a high-spec Apple Silicon cluster, a DGX Spark, or a multi-GPU server โ€” not the single-Mac-Studio setup that works for 70B models.

When the API Wins

Despite the privacy argument, the API path is the right starting point for most organisations:

Validate before investing. Testing whether 1M context actually improves your legal review or codebase analysis workflow costs cents in API tokens. Hardware decisions should follow proven use cases, not precede them.

Moderate volume stays affordable. Under approximately โ‚ฌ300โ€“400 per month in token spend, cloud API has a better unit economics than a multi-GPU server factoring in energy, maintenance, and capital cost.

Not all data is sensitive. Internal strategy documents, public regulatory texts, open-source codebases โ€” these can be processed via API without triggering GDPR obligations around personal data.

The decision point to move on-premise typically arrives when monthly API spend crosses โ‚ฌ400โ€“600, when the legal team confirms that prompt content is regularly personal data, or when latency from a cloud round-trip creates a user experience problem.

The Three Use Cases Worth Testing First

Full-Archive Document Analysis

The clearest argument for 1M context is the ability to eliminate chunking for large document corpora. A compliance team can load an entire regulatory filing history. A procurement department can analyse three years of supplier contracts in one session. Ask follow-up questions and the model's answers are consistent across the whole corpus โ€” no retrieval step, no chunk-boundary artefacts.

Multimodal Document Intake

M3's vision capability means a single model handles scanned invoices, technical drawings, and handwritten annotations without preprocessing. For manufacturing or logistics companies dealing with mixed document types, this reduces pipeline complexity significantly.

Codebase-Wide Code Review

At 1M context, an entire mid-sized application โ€” frontend, backend, tests, configuration โ€” fits in one call. Code quality assessments, security reviews, and onboarding documentation can treat the codebase as a unified object rather than a series of file-by-file queries.

A Starting Path

For most SMBs considering MiniMax M3:

  1. Start with ollama run minimax-m3:cloud to test your actual workflow โ€” no hardware commitment, results within minutes
  2. Track monthly token spend over 30 days; if it approaches โ‚ฌ400, run the TCO calculation for on-premise hardware
  3. For any workflow touching personal data under GDPR, plan for on-premise hardware from the start rather than as a later migration

Our local AI service page includes hardware sizing guidance for different model scales, including where M3 falls relative to Llama 4, Qwen 3, and DeepSeek V3.

If you want an independent assessment of whether MiniMax M3 โ€” or a smaller, more hardware-friendly model โ€” fits your compliance requirements and budget, book a pilot scoping session with us.