The 2024 context
Between 2023 and 2024 aggressive quantisation of LLMs (Q4, Q3, Q2 in GGUF format) and CPU/GPU optimisation in llama.cpp make 1-7B parameter models runnable on unexpected hardware: Raspberry Pi 5, Jetson Orin Nano, entry-level laptops. Ollama provides a user-friendly runtime hiding model download, quantisation and HTTP serving complexity.
Raspberry Pi 5 + Ollama
Relevant specs of Raspberry Pi 5 (October 2023):
- Cortex-A76 quad-core at 2.4 GHz (ARMv8.2)
- 4 or 8 GB LPDDR4X RAM
- NVMe PCIe available via HAT
- Neon SIMD instructions support
Typical performance with Ollama + llama.cpp:
- Llama 3.2 1B Q4: 15-20 tokens/s
- Phi-3 mini 3.8B Q4: 5-8 tokens/s
- Qwen 2.5 3B Q4: 4-6 tokens/s
- Llama 3.1 8B Q4: 1-2 tokens/s (usability threshold)
1-3B models are usable real-time for simple tasks (classification, extraction, conversational routing). 7-8B are slow but functional for batch processing.
Jetson Orin Nano for more serious edge AI
The Jetson Orin Nano (2023) with 40 TOPS INT8 is a step up:
- Llama 3.2 3B Q4 via CUDA: 30-40 tokens/s
- Phi-3 mini 3.8B with TensorRT-LLM: 50+ tokens/s
- Llama 3.1 8B Q4: 10-15 tokens/s usable for chat
Higher cost (~€500) but sufficient performance for commercial devices.
Practical use cases
Local edge AI on SBCs enables scenarios that weren’t practical before:
- Offline voice assistants — speech-to-text (Whisper) + NLU (small LLM) + text-to-speech, no cloud
- IoT with linguistic reasoning — sensors that explain readings in natural language
- Local document classification — sorting emails, invoices, tickets without sending data out
- Assistants in private contexts — medical, legal, financial firms where data can’t leave
- Offline document RAG — conversational search over local archives (manuals, procedures, policies)
- Tech support chatbots for physical products without connectivity
Recommended stack
For standard edge AI projects:
- OS: Raspberry Pi OS 64-bit or Ubuntu 24.04 for Pi 5; JetPack for Jetson
- LLM runtime: Ollama (simpler) or llama.cpp direct (more efficient)
- Models: Llama 3.2 1B/3B, Phi-3 mini, Qwen 2.5 1.5B/3B, Gemma 2 2B — all open and quantisable
- Embeddings: nomic-embed-text, bge-small (Ollama)
- RAG: ChromaDB, LanceDB, SQLite-vec (lightweight)
- Speech: whisper.cpp, Piper TTS
- Orchestration: LangChain, LlamaIndex (somewhat heavy for the Pi)
noze support
noze uses Raspberry Pi 5 and Jetson Orin Nano in R&D projects in digital health (offline voice assistants for nursing homes, NLP-equipped sensors for home care), industry (machine monitoring with linguistic diagnostics), public administration (local chatbots for territorial services). The devices integrate with Admina for local model governance and with AIHealth / CyberScan for specific domains.
For heavier workloads, the pattern is: Pi / Jetson as local edge for simple and privacy-critical tasks, NVIDIA GB10 workstation for mid-size models (30-70B), DGX server for training and heavy batch.
References: Raspberry Pi 5 (October 2023). Ollama. llama.cpp with CPU SIMD Neon backend. GGUF format (Q4/Q3/Q2 quantisation). Jetson Orin Nano (2023). Models: Llama 3.2 1B/3B, Phi-3 mini, Qwen 2.5, Gemma 2 2B. Runtime stack: Whisper.cpp, Piper TTS, ChromaDB.
