Artificial Intelligence

On-premise AI solutions, local LLMs, AI governance, EU AI Act compliance. From prototyping to production deployment.

The 2024 context

Between 2023 and 2024 aggressive quantisation of LLMs (Q4, Q3, Q2 in GGUF format) and CPU/GPU optimisation in llama.cpp make 1-7B parameter models runnable on unexpected hardware: Raspberry Pi 5, Jetson Orin Nano, entry-level laptops. Ollama provides a user-friendly runtime hiding model download, quantisation and HTTP serving complexity.

Raspberry Pi 5 + Ollama

Relevant specs of Raspberry Pi 5 (October 2023):

Cortex-A76 quad-core at 2.4 GHz (ARMv8.2)
4 or 8 GB LPDDR4X RAM
NVMe PCIe available via HAT
Neon SIMD instructions support

Typical performance with Ollama + llama.cpp:

Llama 3.2 1B Q4: 15-20 tokens/s
Phi-3 mini 3.8B Q4: 5-8 tokens/s
Qwen 2.5 3B Q4: 4-6 tokens/s
Llama 3.1 8B Q4: 1-2 tokens/s (usability threshold)

1-3B models are usable real-time for simple tasks (classification, extraction, conversational routing). 7-8B are slow but functional for batch processing.

Jetson Orin Nano for more serious edge AI

The Jetson Orin Nano (2023) with 40 TOPS INT8 is a step up:

Llama 3.2 3B Q4 via CUDA: 30-40 tokens/s
Phi-3 mini 3.8B with TensorRT-LLM: 50+ tokens/s
Llama 3.1 8B Q4: 10-15 tokens/s usable for chat

Higher cost (~€500) but sufficient performance for commercial devices.

Practical use cases

Local edge AI on SBCs enables scenarios that weren’t practical before:

Offline voice assistants — speech-to-text (Whisper) + NLU (small LLM) + text-to-speech, no cloud
IoT with linguistic reasoning — sensors that explain readings in natural language
Local document classification — sorting emails, invoices, tickets without sending data out
Assistants in private contexts — medical, legal, financial firms where data can’t leave
Offline document RAG — conversational search over local archives (manuals, procedures, policies)
Tech support chatbots for physical products without connectivity

Recommended stack

For standard edge AI projects:

OS: Raspberry Pi OS 64-bit or Ubuntu 24.04 for Pi 5; JetPack for Jetson
LLM runtime: Ollama (simpler) or llama.cpp direct (more efficient)
Models: Llama 3.2 1B/3B, Phi-3 mini, Qwen 2.5 1.5B/3B, Gemma 2 2B — all open and quantisable
Embeddings: nomic-embed-text, bge-small (Ollama)
RAG: ChromaDB, LanceDB, SQLite-vec (lightweight)
Speech: whisper.cpp, Piper TTS
Orchestration: LangChain, LlamaIndex (somewhat heavy for the Pi)

noze support

noze uses Raspberry Pi 5 and Jetson Orin Nano in R&D projects in digital health (offline voice assistants for nursing homes, NLP-equipped sensors for home care), industry (machine monitoring with linguistic diagnostics), public administration (local chatbots for territorial services). The devices integrate with Admina for local model governance and with AgenticHealth / CyberAgent for specific domains.

For heavier workloads, the pattern is: Pi / Jetson as local edge for simple and privacy-critical tasks, NVIDIA GB10 workstation for mid-size models (30-70B), DGX server for training and heavy batch.

References: Raspberry Pi 5 (October 2023). Ollama. llama.cpp with CPU SIMD Neon backend. GGUF format (Q4/Q3/Q2 quantisation). Jetson Orin Nano (2023). Models: Llama 3.2 1B/3B, Phi-3 mini, Qwen 2.5, Gemma 2 2B. Runtime stack: Whisper.cpp, Piper TTS, ChromaDB.

Company

Actions

Links

Products

Solutions

Industries

Edge AI on Raspberry Pi: local LLMs with Ollama and llama.cpp