Edge AI on Raspberry Pi: local LLMs with Ollama and llama.cpp

In 2024 quantised LLM execution becomes practical on Raspberry Pi 5 and Jetson Orin Nano thanks to llama.cpp and Ollama. Typical uses: offline voice assistants, NLP-equipped sensors, local classification, document RAG.

HardwareAIOpen SourcenozeR&D Raspberry PiEdge AIOllamallama.cppLocal LLMsIoTMaker

The 2024 context

Between 2023 and 2024 aggressive quantisation of LLMs (Q4, Q3, Q2 in GGUF format) and CPU/GPU optimisation in llama.cpp make 1-7B parameter models runnable on unexpected hardware: Raspberry Pi 5, Jetson Orin Nano, entry-level laptops. Ollama provides a user-friendly runtime hiding model download, quantisation and HTTP serving complexity.

Raspberry Pi 5 + Ollama

Relevant specs of Raspberry Pi 5 (October 2023):

  • Cortex-A76 quad-core at 2.4 GHz (ARMv8.2)
  • 4 or 8 GB LPDDR4X RAM
  • NVMe PCIe available via HAT
  • Neon SIMD instructions support

Typical performance with Ollama + llama.cpp:

  • Llama 3.2 1B Q4: 15-20 tokens/s
  • Phi-3 mini 3.8B Q4: 5-8 tokens/s
  • Qwen 2.5 3B Q4: 4-6 tokens/s
  • Llama 3.1 8B Q4: 1-2 tokens/s (usability threshold)

1-3B models are usable real-time for simple tasks (classification, extraction, conversational routing). 7-8B are slow but functional for batch processing.

Jetson Orin Nano for more serious edge AI

The Jetson Orin Nano (2023) with 40 TOPS INT8 is a step up:

  • Llama 3.2 3B Q4 via CUDA: 30-40 tokens/s
  • Phi-3 mini 3.8B with TensorRT-LLM: 50+ tokens/s
  • Llama 3.1 8B Q4: 10-15 tokens/s usable for chat

Higher cost (~€500) but sufficient performance for commercial devices.

Practical use cases

Local edge AI on SBCs enables scenarios that weren’t practical before:

  • Offline voice assistants — speech-to-text (Whisper) + NLU (small LLM) + text-to-speech, no cloud
  • IoT with linguistic reasoning — sensors that explain readings in natural language
  • Local document classification — sorting emails, invoices, tickets without sending data out
  • Assistants in private contexts — medical, legal, financial firms where data can’t leave
  • Offline document RAG — conversational search over local archives (manuals, procedures, policies)
  • Tech support chatbots for physical products without connectivity

For standard edge AI projects:

  • OS: Raspberry Pi OS 64-bit or Ubuntu 24.04 for Pi 5; JetPack for Jetson
  • LLM runtime: Ollama (simpler) or llama.cpp direct (more efficient)
  • Models: Llama 3.2 1B/3B, Phi-3 mini, Qwen 2.5 1.5B/3B, Gemma 2 2B — all open and quantisable
  • Embeddings: nomic-embed-text, bge-small (Ollama)
  • RAG: ChromaDB, LanceDB, SQLite-vec (lightweight)
  • Speech: whisper.cpp, Piper TTS
  • Orchestration: LangChain, LlamaIndex (somewhat heavy for the Pi)

noze support

noze uses Raspberry Pi 5 and Jetson Orin Nano in R&D projects in digital health (offline voice assistants for nursing homes, NLP-equipped sensors for home care), industry (machine monitoring with linguistic diagnostics), public administration (local chatbots for territorial services). The devices integrate with Admina for local model governance and with AIHealth / CyberScan for specific domains.

For heavier workloads, the pattern is: Pi / Jetson as local edge for simple and privacy-critical tasks, NVIDIA GB10 workstation for mid-size models (30-70B), DGX server for training and heavy batch.


References: Raspberry Pi 5 (October 2023). Ollama. llama.cpp with CPU SIMD Neon backend. GGUF format (Q4/Q3/Q2 quantisation). Jetson Orin Nano (2023). Models: Llama 3.2 1B/3B, Phi-3 mini, Qwen 2.5, Gemma 2 2B. Runtime stack: Whisper.cpp, Piper TTS, ChromaDB.

Need support? Under attack? Service Status
Need support? Under attack? Service Status