Local AI Models on Your Own Hardware
Open-weights models like Qwen3.5 and Kimi 2.5 now run on hardware that fits under your desk. For businesses, that raises a concrete question: Is local inference a viable alternative or complement to the cloud?
What has changed
Just a year ago, local LLM inference was mostly frustrating. The models were noticeably worse than their commercial counterparts, the hardware expensive or loud, the setup cumbersome. Anyone serious about working with AI had no choice but to rely on OpenAI, Anthropic, or Google.
2026 looks different. Models like Qwen3.5-35B from Alibaba deliver results on many standard tasks that approach commercial cloud models. At the same time, dedicated inference devices like NVIDIA's DGX Spark or the Asus GX10 are available from around EUR 3,000. Small, quiet, with pre-installed Linux. Plug in, load a model, done.
Open-weights means: the model weights are freely available. No API key needed, no subscription, no third-party terms of service. The model runs on your own device. Data never leaves your own network.
Where local inference works
For a range of tasks, local inference is already practical for daily use:
Code analysis and refactoring
Source code reviews, refactoring suggestions, and documentation run reliably on local models. Especially with proprietary code, a clear advantage: nothing leaves your network.
Text work
Summaries, reviews, brainstorming, and drafts for internal documents. Many everyday tasks that previously went to ChatGPT or Claude can be handled locally.
Agentic workflows
Automated processes with clearly defined context deliver usable results. Particularly suited for recurring, structured tasks.
Everyday tasks
Email drafts, meeting summaries, research notes. The bulk of daily AI usage can be covered locally.
Where the cloud still leads
Local models hit their limits when it comes to highly complex reasoning tasks with long context . Multimodal applications (image analysis, video) are barely usable locally. And if you need the best available model for a specific task, you will still end up with the major cloud providers.
Claude, GPT, or Gemini in their strongest variants are still ahead of local alternatives for demanding tasks.
That is not a flaw. It describes the current state of the art. The question is not whether local models can fully replace the cloud. Rather, it is about what share of daily work can sensibly be handled locally.
The arguments for local inference
Data sovereignty
Every request to an external API transfers data to the provider. With coding assistants, that means your entire source code. With chat tools, the full conversation history. With agents, files and system contexts on top. Many users are not aware of how much data is actually transmitted. With local inference, the question does not arise.
Predictable costs
Cloud inference is billed per token. With heavy use, costs scale up. A local device has fixed acquisition costs and manageable operating costs. A GB10 device runs at about EUR 500 per year in electricity under full load. Significantly less in normal operation.
Availability
No rate limiting, no API outages, no unilateral price changes. Your model runs when you need it, as often as you need it.
GDPR compliance
Local processing eliminates data processing agreements, third-country transfers, and the complexity of DPAs. This simplifies the data protection assessment considerably, especially for businesses in regulated industries.
Counterarguments and limitations
Local inference is not a turnkey solution. A few points that tend to get overlooked in the enthusiasm:
Maintenance and operations
A local device needs administration. OS updates, model changes, monitoring, network configuration. This requires expertise that not every company has in-house. Cloud APIs abstract away this complexity.
Model quality is a spectrum
That Qwen3.5-35B approaches commercial models on benchmarks does not mean the results are equivalent in every situation. In daily work with complex prompts or niche topics, the gaps can be more noticeable.
Pace of development
Local hardware is an investment in today's state of the art. Cloud providers continuously roll out new models without requiring you to swap hardware. In two years, the hardware may be outdated.
Scaling
A single device is enough for a team of two. Not for 50 concurrent users. Cluster solutions like exo exist but increase both cost and complexity significantly.
Benchmarks measure specific capabilities under defined conditions. In everyday work with unusual contexts or complex prompts, the differences between local and cloud models can be more pronounced than the numbers suggest.
Geopolitical dimension
An argument gaining weight: the major AI providers are based in the US and China. Regulatory interventions, export restrictions, or political conflicts can affect the availability of AI services.
Anyone who builds their business processes on a single provider operating in a foreign jurisdiction takes on a risk that is not technical in nature.
The counterargument applies equally: Open-weights models currently come predominantly from China (Alibaba, Moonshot AI, MiniMax). Using local inference reduces operational dependency. The strategic dependency on the manufacturers of models and hardware (NVIDIA, Apple) remains.
Hardware overview
The market for local inference hardware is moving fast. An overview of the relevant options:
| Hardware | Price from | Memory | Suited for | Limitations |
|---|---|---|---|---|
| NVIDIA GB10 (DGX Spark, Asus GX10) | approx. EUR 3,000 | 128 GB | LLM inference, entry-level, teams | Linux knowledge helpful |
| Apple Mac (M-chip, 16-64 GB) | approx. EUR 1,500 | 16-64 GB | Smaller models up to 14B parameters | Limited to smaller models |
| Apple Mac Studio (256-512 GB) | approx. EUR 8,000 | 256-512 GB | Large models, high bandwidth | High price |
| AMD Strix Halo Mini-PCs | approx. EUR 2,000 | variable | Experimental, early adopters | No CUDA, immature ecosystem |
| Used RTX 3090 (2-3x) | approx. EUR 1,500 | 48-72 GB VRAM | Startups, Linux-experienced teams | Loud, power-hungry, high-maintenance |
| exo cluster | approx. EUR 15,000 | variable | Very large models, teams | High cost and complexity |
For getting started, NVIDIA GB10 devices are currently the most practical option: compact, quiet, optimised for LLM inference, and with a comparatively low entry barrier.
A sensible split
Local inference does not have to replace cloud AI. A pragmatic split based on data classification works better in practice than an either-or approach:
Process locally
HR data, contracts, customer data, internal strategy documents, proprietary code. Everything your company would not want in someone else's hands.
Keep in the cloud
Public research, marketing copy, generic code tasks without sensitive context. Tasks where the best available model makes the difference.
European cloud alternatives
If you want cloud inference but prefer to avoid US providers: Nebius (data centres in France and Finland) or AKI.IO (German and European servers) offer open-weights models via API in full GDPR compliance.
Conclusion without euphoria
Local AI inference is practical in 2026. Not for everything, but for a relevant share of daily AI usage in businesses. The hardware is affordable, the models are good enough, and the arguments for data sovereignty and cost control are real.
At the same time, local inference is neither a turnkey solution nor a silver bullet. It requires technical expertise, ties up resources for operations and maintenance, and cannot match the best cloud models for complex tasks.
The strategically smart decision is probably not an either-or choice. Rather, it is a deliberate split: local inference for everyday work and sensitive data, cloud models for the heavy lifting. With clear rules about which data goes where.
If you are interested in getting started, do not wait for perfect hardware. The GB10 devices are a good starting point for gaining experience. The real work is not the setup. It is deciding which tasks and data should be processed locally going forward.
Further reading
Frequently asked questions
For getting started, NVIDIA GB10 devices like the DGX Spark or Asus GX10 are available from around EUR 3,000. They offer 128 GB of memory and are optimised for LLM inference. Apple Macs with M-chips work for smaller models up to 14B parameters, while larger models require a Mac Studio with 256 or 512 GB of Unified Memory.
For many standard tasks such as code analysis, text work, or brainstorming, open-weights models like Qwen3.5-35B deliver comparable results. However, for very complex reasoning, long contexts, or multimodal tasks, cloud models from OpenAI, Anthropic, and Google still lead the field.
Electricity costs for a GB10 device run at about EUR 500 per year under full load, significantly less in normal operation. Add one-time acquisition costs starting from around EUR 3,000. Compared to cloud APIs that charge per token, costs are often lower with heavy use and, more importantly, much more predictable.
Yes, local processing simplifies GDPR compliance considerably. It eliminates data processing agreements, third-country transfers, and the complexity of DPAs. Data never leaves your own network, making the data protection assessment significantly easier.
No, a complete replacement is not practical at this point. The best strategy is a deliberate split: sensitive data such as HR records, contracts, or proprietary code is processed locally. For public research, marketing copy, or particularly demanding tasks, cloud models remain the better choice.
Open-weights models are AI language models whose trained weights are freely available. You can download them and run them on your own hardware without an API key, subscription, or third-party terms of service. Well-known examples include Qwen3.5 from Alibaba, Kimi 2.5 from Moonshot AI, and Llama from Meta.