Clinical RAG with open source LLMs: on-premise architectures for diagnostic support

Retrieval-Augmented Generation applied to diagnostic support with open source LLMs (Llama, Mistral, Gemma, DeepSeek, BioMistral, Meditron), integration with FHIR and DICOM, on-premise deployment, EU AI Act and MDR compliance.

Digital HealthR&DOpen SourceAI RAGLLMLlamaMistralBioMistralMeditronFHIRDICOMOpen SourceDigital HealthAI

LLMs in medicine: from ChatGPT to open models

Since the release of ChatGPT (November 2022) and GPT-4 (March 2023), large language models have become the subject of intense medical attention — for promises (literature summarisation, clinical question answering, documentation support, patient assistance) and for limits (hallucinations, absence of verifiable citations, data privacy). Healthcare use of proprietary cloud-only LLMs is hindered by structural constraints:

  • Privacy: clinical data cannot leave the controller’s perimeter without rigorous legal bases (GDPR art. 9)
  • Knowledge update: a model trained with 2023 data cutoff doesn’t know 2024-2025 guidelines
  • Citability: a clinician needs to know from which source a recommendation comes
  • Auditability: every recommendation must be reconstructible and contestable
  • EU AI Act compliance (Regulation 2024/1689, in force since August 2024): medical device AI are high-risk systems with specific obligations

The emerging response in 2024-2025 is a combination: open source LLMs (run on-premise on controlled infrastructure) + RAG (Retrieval-Augmented Generation) (grounding responses on verifiable sources, dynamically retrieved) + structured healthcare data (FHIR, DICOM, OMOP) as contextual input. As of July 2025 this stack is mature.

Available open source LLMs

The open-weights ecosystem as of 2025 offers credible choices:

Llama family (Meta)

  • Llama 3.1 (July 2024): 8B, 70B, 405B parameter variants — Meta LLaMA 3.1 Community License (open-weights with some scale-use restrictions)
  • Llama 3.2 (September 2024): 1B, 3B, 11B, 90B models, with multi-modal variants
  • Llama 3.3 (December 2024): 70B version with optimised performance
  • Llama 4 (2025): variants with Mixture-of-Experts architectures, superior performance

Mistral AI (France)

  • Mistral 7B (September 2023): entry baseline
  • Mixtral 8x7B and Mixtral 8x22B (2023-2024): MoE
  • Mistral Small, Mistral Large (2024-2025): flagship models
  • Apache 2.0 licence for open-weights versions

Gemma (Google)

  • Gemma 2 (June 2024): 2B, 9B, 27B versions
  • Gemma 3 (2025): multi-modal variants, substantial improvements
  • Gemma terms of use licence (open-weights)

DeepSeek

  • DeepSeek-V3 (December 2024): 671B parameter MoE, ~37B active
  • DeepSeek-R1 (January 2025): step-by-step reasoning model, MIT-licensed public release
  • Reasoning quality comparable to top proprietary models

Qwen (Alibaba)

  • Qwen 2.5 (late 2024): 0.5B-72B with specialist variants (math, coder)
  • Apache 2.0

Specialised biomedical models

  • Meditron (EPFL, 2023): Llama-2 fine-tuned on PubMed + international clinical guidelines
  • Med42 (M42 AI, 2024): Llama-based, trained on clinical datasets
  • BioMistral (2024): Mistral fine-tuning on biomedical literature
  • MedGemma (2025): announced by Google, Gemma fine-tuning on clinical tasks

RAG: Retrieval-Augmented Generation

The RAG paradigm — formalised in 2020 by Lewis et al. (FAIR/Meta) in “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” — separates knowledge and reasoning:

  • An external knowledge base (documents, guidelines, literature, clinical records) is indexed in a vector database
  • On user question, a retriever fetches the most relevant documents (embedding-based semantic similarity)
  • The LLM generates the answer conditioned on retrieved documents, citing them explicitly

For the clinical context, RAG offers fundamental advantages:

  • Update without retraining: adding documents to the knowledge base is immediate
  • Citability: every answer can be backed by identifiable sources
  • Hallucination reduction: the model is conditioned on real documents rather than pure parametric prior
  • Auditability: every step is logged and verifiable

Technical stack

A typical on-premise clinical RAG implementation uses open source components:

Local LLM inference

  • vLLM (UC Berkeley, 2023+) — high-throughput inference server, paged attention support
  • llama.cpp (Georgi Gerganov, 2023+) — C++ inference with aggressive quantisation (GGUF), runs on CPU and consumer GPUs
  • Ollama — user-friendly packaging of llama.cpp with REST API
  • Text Generation Inference (Hugging Face) — HF server with multi-GPU, tensor parallelism support

Embedding

  • sentence-transformers — Python library with embedding models
  • BGE (Beijing Academy of AI), E5 (Microsoft), Jina Embeddings — competitive open source embeddings
  • BioLORD, MedEmbed, SapBERT — specialist biomedical embeddings

Vector database

  • Qdrant (Rust, MIT) — performant vector DB with advanced filtering
  • Chroma (Python, Apache 2.0) — simplicity-oriented for prototyping
  • Milvus (Apache 2.0) — scale-out production
  • pgvector (PostgreSQL extension) — SQL + vector hybrid, suited to existing systems

Orchestration

  • LangChain (Python/JS, MIT) — LLM pipeline framework
  • LlamaIndex (Python, MIT) — RAG and indexing focus
  • Haystack (Apache 2.0) — production-grade NLP framework with RAG

Healthcare use patterns

Question answering on guidelines

A clinician asks “what is the first-line therapy for metastatic prostate carcinoma in a 65-year-old patient with cardiovascular comorbidity?”. The system retrieves from guidelines (AIOM, EAU, NCCN) the relevant pages, the LLM generates an anchored response with pinpoint citations. The clinician sees the answer and the sources.

Patient clinical history summary

Given the patient’s FSE (CDA/FHIR documents), the system generates a structured summary: active problems, therapies, recent significant events, parameter trends. The summary is presented to the treating clinician before the visit.

DICOM interpretation + text

Multimodal pipeline: the vision model (like Gemma 3 multi-modal or MedGemma) analyses DICOM images; the LLM integrates the output with textual report and clinical history to produce an integrated opinion. Always with citations.

Coding assistance

On a discharge letter, the system suggests most appropriate ICD-10/ICD-9-CM codes, retrieving similar cases and coding guidelines.

Active pharmacovigilance

Monitoring of clinical notes for adverse drug event signals; RAG on AIFA Gazzetta/FDA FAERS to contextualise.

Patient Q&A (controlled)

Chatbot on educational health information, with responses anchored to institutional sources (Ministry of Health, ISS, AIFA); strictly non-diagnostic, educational.

Literature review

Synthesis of recent clinical trials for a therapeutic area, with pinpoint PubMed/Cochrane citations.

On-premise deployment: typical architecture

A clinical on-premise implementation typically has:

  • Enterprise GPU cluster — 4-8 A100/H100 80GB GPUs or similar. Cost between €80K-€400K depending on scale, amortisable over deployment lifetime
  • Storage — large volumes for vector indices (hundreds of GB) + document archive
  • Segmented networking — dedicated GDPR-compliant network, separated from general network
  • FSE/EHR integration — via Sogei FSE Gateway FHIR APIs or enterprise record APIs
  • Logging/audit trail — EU AI Act-compliant for high-risk systems: every interaction tracked with input, output, citations, model version
  • Knowledge base update — periodic ETL pipeline reindexing updated guidelines, literature, internal documents

Regulatory compliance

As of July 2025 the landscape is:

EU AI Act

  • Entered into force on 1 August 2024
  • For medical device AI: application of high-risk requirements from 2 August 2026 (Art. 6(1)) for AI devices with Notified Body certification
  • Obligations: risk management, quality management, transparency, human oversight, post-market monitoring
  • Important: GPAI models (General-Purpose AI) like Llama/Mistral have their own obligations; a clinical product integrating them inherits obligations both as downstream GPAI integrator and as medical device

MDR (EU Regulation 2017/745)

  • Fully applicable since May 2021
  • Clinical AI software typically Class IIa (Rule 11 for diagnostic support)
  • A certified product with LLM components requires full qualification of the model: risk management, clinical validation, post-market surveillance

EHDS (EU Regulation 2025/327)

  • In force since 26 March 2025
  • Mandatory primary use from 26 March 2027, secondary use from 26 March 2029
  • Clinical LLM systems integrated with FSE must be compatible with interoperability and access requirements

GDPR

  • Healthcare data remain under art. 9(2)(h)
  • DPIA mandatory for large-scale deployment of clinical AI systems
  • Extra-EU transfer naturally avoided with on-premise deployment

Italy

  • Ministry of Health Decree of 7 September 2023 on FSE 2.0
  • Garante guidelines on healthcare AI evolving

Advantages of open source + on-premise

The open source LLM + RAG on-premise pattern solves many of the problems preventing proprietary cloud models from entering healthcare:

  • Privacy: no data leaves enterprise perimeter
  • Lifecycle control: the model is under the organisation’s control; managed updates
  • Full auditability: all behaviour is inspectable
  • Customisation: fine-tuning on local data possible
  • Economics: no per-token cost; initial GPU investment amortised
  • Certification: easier to qualify a locally controlled system

Limits and challenges

  • Performance — top open-weights models (Llama 3.3 70B, DeepSeek V3) are now competitive with GPT-4, but residual gap on some very hard tasks
  • Infrastructure — requires non-trivial GPU systems expertise
  • Knowledge base quality — RAG’s value depends on indexing curation; garbage in, garbage out
  • Clinical evaluation — measuring medical RAG system quality requires metrics (accuracy, hallucination rate, citation correctness) and clinical studies
  • Bias management — GPAI models can reflect general-training-corpus bias; mitigation requires evaluation and possibly fine-tuning
  • Complex clinical reasoning — LLMs excel at information retrieval; multi-step reasoning with complex clinical constraints is still a development area. DeepSeek-R1 and reasoning models are shifting this limit

The Italian context

As of 2025 Italian adoption:

  • IRCCS and large hospital organisations — first experimental clinical RAG deployments
  • Italian healthcare software companies — LLM integration in management products
  • Research projects — clinical validation studies on RAG systems, with accuracy control and clinician comparison
  • Universities — Turin, Milan, Pavia, Bologna, Pisa active on medical LLM evaluation
  • Partnership with Sogei — FSE 2.0 as structured data source for RAG pipelines

The Italian LLMs topic — models pre-trained specifically on Italian clinical language — is still open. Initiatives like IT5 (Italian generalist), BioBIT, LLaMAntino (Italian Llama) are starting points, but production-quality Italian clinical models are under construction.

Outlook

Expected directions in the coming months/years:

  • Reasoning models more integrated with RAG — structured clinical reasoning with retrieval
  • Full multimodality — image + text + signals in a single pipeline
  • Clinical agents — LLMs orchestrating multiple queries (FHIR, DICOM, guidelines) to answer complex questions
  • Prospective clinical evaluation — first randomised prospective studies with standard-practice control to measure clinical outcome impact
  • Certification — first CE and FDA-certified medical device AI with LLM components (some emerging 2025 announcements)
  • MONAI + LLM — clinical LLM integration in the MONAI framework for imaging intelligence
  • Italian contributions — production-quality Italian LLMs for Italian clinical language

The pattern RAG + open LLM + on-premise represents in 2025 the viable path to bring large-scale LLMs into healthcare without breaching privacy, governance and regulatory constraints. It is an architecture growing in maturity each month and that will shape the next layer of diagnostic support and clinical knowledge management for years to come.


References: Llama 3.x (Meta AI, 2024-2025), Mistral (Mistral AI), Gemma (Google DeepMind), DeepSeek-V3/R1 (DeepSeek AI, 2024-2025), Qwen 2.5 (Alibaba). Meditron (EPFL, 2023), BioMistral (2024). Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”, NeurIPS 2020. vLLM, llama.cpp, Ollama, Hugging Face TGI. LangChain, LlamaIndex, Haystack. Qdrant, Chroma, Milvus, pgvector. Regulation (EU) 2024/1689 (EU AI Act). Regulation (EU) 2025/327 (EHDS).

Need support? Under attack? Service Status
Need support? Under attack? Service Status