AIHardware

Local LLMs: What's Possible on Edge Hardware in 2026

Blue Neon15 January 20269 min read

The default assumption for running large language models is still "call an API." OpenAI, Anthropic, Google: pick your provider, send your tokens, get your response. For a lot of use cases, that's fine. But for defence, field operations, air-gapped networks, privacy-sensitive healthcare data, and any environment where internet connectivity is unreliable or forbidden, the cloud API model falls apart. The question becomes: what can you run locally in 2026?

The answer, as of early 2026, is "a lot more than you think, but less than the hype suggests." Let's get specific.

The Hardware Landscape

The meaningful options for local LLM inference fall into a few categories. NVIDIA's Jetson Orin series remains the workhorse for edge AI. The AGX Orin with 64GB unified memory can run quantised 13B-parameter models at usable speeds. Apple Silicon (M3 Pro/Max/Ultra and the newer M4 line) offers strong inference performance thanks to unified memory and the Neural Engine, with an M4 Max running quantised 30B models at around 20 tokens per second.

For more constrained deployments (ruggedised tablets, embedded systems, IoT gateways) the Intel Meteor Lake NPU and Qualcomm Cloud AI 100 provide AI acceleration in lower power envelopes. You're not running a 70B model on these, but 3B-7B models quantised to 4-bit work well for specific tasks.

The real breakthrough has been in quantisation. GGUF format models via llama.cpp, combined with Q4_K_M or Q5_K_M quantisation, deliver 90-95% of full-precision quality at a fraction of the memory footprint. A model that needs 140GB of VRAM in FP16 fits in 40GB at Q4, making it runnable on hardware you can buy and deploy.

"The question isn't whether you can run an LLM on edge hardware. It's whether you can run the right LLM for your task on the hardware you're allowed to deploy."

Proven Use Cases

Document Processing and Classification

This is the sweet spot for local LLMs. Models like Mistral 7B and Llama 3.1 8B, fine-tuned for specific document types, run comfortably on a Jetson Orin or a modern laptop. We've deployed systems that classify incoming intelligence reports, extract key entities, and generate structured summaries, all running on hardware under a desk in a secure facility, with no data leaving the room.

The key is task-specific fine-tuning. A general-purpose 7B model is mediocre at most things. A 7B model fine-tuned with LoRA on 10,000 examples of your specific document type performs well at that one thing. We use Unsloth for efficient fine-tuning. It's fast enough that you can iterate on fine-tuning runs in hours, not days.

Code Assistance in Air-Gapped Environments

Defence software development often happens on air-gapped networks where GitHub Copilot isn't an option. Local code models fill this gap. DeepSeek Coder V2 at 16B, or CodeLlama 13B, run on a workstation with a single NVIDIA RTX 4090 and provide useful code completion and generation. The output is not GPT-4-level, but it's a massive productivity boost over no assistance at all.

We integrate these with VS Code using the Continue extension, which supports local model backends via Ollama. The developer experience is close to Copilot for common languages and frameworks.

Conversational Interfaces for Field Operations

Field teams need to query operational data without learning SQL or navigating complex GIS interfaces. A local LLM that converts natural language to structured queries against a local database is a capability multiplier. "Show me all sensor readings above threshold in the last 6 hours within 50km of checkpoint alpha" is a lot easier to type than the equivalent database query.

We implement this with a RAG (Retrieval-Augmented Generation) pattern where the LLM has access to a local vector database (ChromaDB or LanceDB, both run well on edge hardware) containing operational documentation, SOPs, and structured data schemas. The model doesn't need to know everything. It needs to know how to find things in the local knowledge base.

Current Limits

Some tasks don't work well locally yet. Complex multi-step reasoning tasks that GPT-4 or Claude handle well require 70B+ parameter models that need serious hardware: multiple GPUs or specialised inference accelerators. Open-ended creative generation at cloud-model quality isn't there yet at the 7-13B scale. And real-time conversational AI with the fluency of ChatGPT requires models that are still too large for most edge deployments.

Multimodal inference (vision + language) is improving fast but still demands more compute than pure text. LLaVA-based models at 7B can do basic image description and visual question answering on a Jetson Orin, but detailed image analysis or document understanding with complex layouts still needs larger models.

Deployment Architecture

Our standard edge LLM deployment uses Ollama as the model runtime (it handles model management, quantisation selection, and provides a clean API), behind a thin application layer that handles prompt templating, guardrails, and response validation. We containerise the whole stack with Docker and deploy via k3s for environments that need orchestration, or a simple systemd service for single-node deployments.

The critical design decision is the model selection matrix. We don't run one model for everything. A 3B model handles fast classification tasks (sub-second latency). A 7-8B model handles document processing and summarisation. A 13B model handles complex query generation and analysis. The application routes requests to the appropriate model based on the task, optimising for the latency-quality tradeoff.

Local LLMs complement cloud APIs for environments where cloud isn't an option. The capability is real, it's growing fast, and organisations that start building expertise now will have a significant advantage as the models continue to improve. The hardware you can buy today will run the models that come out in six months even better. It's a capability investment that compounds.