Contents

LLMs in medicine: from ChatGPT to open models
Available Open Source LLMs
RAG: Retrieval-Augmented Generation
Technical stack
Healthcare use patterns
On-premise deployment: typical architecture
Regulatory compliance
Advantages of Open Source + on-premise
Limits and challenges
The Italian context
Outlook

AgenticHealth

On-premise clinical platform with local LLMs, RAG on FHIR/DICOM data, diagnostic support, remote follow-up. Architecture designed for the MDR pathway.

Discover AgenticHealth →

Digital Health

Medical software development compliant with CE and MDR regulatory standards. Clinical decision support systems, AI integration in clinical workflows.

Discover →

LLMs in medicine: from ChatGPT to open models

Since the release of ChatGPT (November 2022) and GPT-4 (March 2023), large language models have become the subject of intense medical attention — for promises (literature summarisation, clinical question answering, documentation support, patient assistance) and for limits (hallucinations, absence of verifiable citations, data privacy). Healthcare use of proprietary cloud-only LLMs is hindered by structural constraints:

Privacy: clinical data cannot leave the controller’s perimeter without rigorous legal bases (GDPR art. 9)
Knowledge update: a model trained with 2023 data cutoff doesn’t know 2024-2025 guidelines
Citability: a clinician needs to know from which source a recommendation comes
Auditability: every recommendation must be reconstructible and contestable
EU AI Act compliance (Regulation 2024/1689, in force since August 2024): medical device AI are high-risk systems with specific obligations

The emerging response in 2024-2025 is a combination: Open Source LLMs (run on-premise on controlled infrastructure) + RAG (Retrieval-Augmented Generation) (grounding responses on verifiable sources, dynamically retrieved) + structured healthcare data (FHIR, DICOM, OMOP) as contextual input. As of July 2025 this stack is mature.

Available Open Source LLMs

The open-weights ecosystem as of 2025 offers credible choices:

Llama family (Meta)

Llama 3.1 (July 2024): 8B, 70B, 405B parameter variants — Meta LLaMA 3.1 Community License (open-weights with some scale-use restrictions)
Llama 3.2 (September 2024): 1B, 3B, 11B, 90B models, with multi-modal variants
Llama 3.3 (December 2024): 70B version with optimised performance
Llama 4 (2025): variants with Mixture-of-Experts architectures, superior performance

Mistral AI (France)

Mistral 7B (September 2023): entry baseline
Mixtral 8x7B and Mixtral 8x22B (2023-2024): MoE
Mistral Small, Mistral Large (2024-2025): flagship models
Apache 2.0 licence for open-weights versions

Gemma (Google)

Gemma 2 (June 2024): 2B, 9B, 27B versions
Gemma 3 (2025): multi-modal variants, substantial improvements
Gemma terms of use licence (open-weights)

DeepSeek

DeepSeek-V3 (December 2024): 671B parameter MoE, ~37B active
DeepSeek-R1 (January 2025): step-by-step reasoning model, MIT-licensed public release
Reasoning quality comparable to top proprietary models

Qwen (Alibaba)

Qwen 2.5 (late 2024): 0.5B-72B with specialist variants (math, coder)
Apache 2.0

Specialised biomedical models

Meditron (EPFL, 2023): Llama-2 fine-tuned on PubMed + international clinical guidelines
Med42 (M42 AI, 2024): Llama-based, trained on clinical datasets
BioMistral (2024): Mistral fine-tuning on biomedical literature
MedGemma (2025): announced by Google, Gemma fine-tuning on clinical tasks

RAG: Retrieval-Augmented Generation

The RAG paradigm — formalised in 2020 by Lewis et al. (FAIR/Meta) in “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” — separates knowledge and reasoning:

An external knowledge base (documents, guidelines, literature, clinical records) is indexed in a vector database
On user question, a retriever fetches the most relevant documents (embedding-based semantic similarity)
The LLM generates the answer conditioned on retrieved documents, citing them explicitly

For the clinical context, RAG offers fundamental advantages:

Update without retraining: adding documents to the knowledge base is immediate
Citability: every answer can be backed by identifiable sources
Hallucination reduction: the model is conditioned on real documents rather than pure parametric prior
Auditability: every step is logged and verifiable

Technical stack

A typical on-premise clinical RAG implementation uses Open Source components:

Local LLM inference

vLLM (UC Berkeley, 2023+) — high-throughput inference server, paged attention support
llama.cpp (Georgi Gerganov, 2023+) — C++ inference with aggressive quantisation (GGUF), runs on CPU and consumer GPUs
Ollama — user-friendly packaging of llama.cpp with REST API
Text Generation Inference (Hugging Face) — HF server with multi-GPU, tensor parallelism support

Embedding

sentence-transformers — Python library with embedding models
BGE (Beijing Academy of AI), E5 (Microsoft), Jina Embeddings — competitive Open Source embeddings
BioLORD, MedEmbed, SapBERT — specialist biomedical embeddings

Vector database

Qdrant (Rust, MIT) — performant vector DB with advanced filtering
Chroma (Python, Apache 2.0) — simplicity-oriented for prototyping
Milvus (Apache 2.0) — scale-out production
pgvector (PostgreSQL extension) — SQL + vector hybrid, suited to existing systems

Orchestration

LangChain (Python/JS, MIT) — LLM pipeline framework
LlamaIndex (Python, MIT) — RAG and indexing focus
Haystack (Apache 2.0) — production-grade NLP framework with RAG

Healthcare use patterns

Question answering on guidelines

A clinician asks “what is the first-line therapy for metastatic prostate carcinoma in a 65-year-old patient with cardiovascular comorbidity?”. The system retrieves from guidelines (AIOM, EAU, NCCN) the relevant pages, the LLM generates an anchored response with pinpoint citations. The clinician sees the answer and the sources.

Patient clinical history summary

Given the patient’s FSE (CDA/FHIR documents), the system generates a structured summary: active problems, therapies, recent significant events, parameter trends. The summary is presented to the treating clinician before the visit.

DICOM interpretation + text

Multimodal pipeline: the vision model (like Gemma 3 multi-modal or MedGemma) analyses DICOM images; the LLM integrates the output with textual report and clinical history to produce an integrated opinion. Always with citations.

Coding assistance

On a discharge letter, the system suggests most appropriate ICD-10/ICD-9-CM codes, retrieving similar cases and coding guidelines.

Active pharmacovigilance

Monitoring of clinical notes for adverse drug event signals; RAG on AIFA Gazzetta/FDA FAERS to contextualise.

Patient Q&A (controlled)

Chatbot on educational health information, with responses anchored to institutional sources (Ministry of Health, ISS, AIFA); strictly non-diagnostic, educational.

Literature review

Synthesis of recent clinical trials for a therapeutic area, with pinpoint PubMed/Cochrane citations.

On-premise deployment: typical architecture

A clinical on-premise implementation typically has:

Enterprise GPU cluster — 4-8 A100/H100 80GB GPUs or similar. Cost between €80K-€400K depending on scale, amortisable over deployment lifetime
Storage — large volumes for vector indices (hundreds of GB) + document archive
Segmented networking — dedicated GDPR-compliant network, separated from general network
FSE/EHR integration — via Sogei FSE Gateway FHIR APIs or enterprise record APIs
Logging/audit trail — EU AI Act-compliant for high-risk systems: every interaction tracked with input, output, citations, model version
Knowledge base update — periodic ETL pipeline reindexing updated guidelines, literature, internal documents

Regulatory compliance

As of July 2025 the landscape is:

EU AI Act

Entered into force on 1 August 2024
For medical device AI: application of high-risk requirements from 2 August 2026 (Art. 6(1)) for AI devices with Notified Body certification
Obligations: risk management, quality management, transparency, human oversight, post-market monitoring
Important: GPAI models (General-Purpose AI) like Llama/Mistral have their own obligations; a clinical product integrating them inherits obligations both as downstream GPAI integrator and as medical device

MDR (EU Regulation 2017/745)

Fully applicable since May 2021
Clinical AI software typically Class IIa (Rule 11 for diagnostic support)
A certified product with LLM components requires full qualification of the model: risk management, clinical validation, post-market surveillance

EHDS (EU Regulation 2025/327)

In force since 26 March 2025
Mandatory primary use from 26 March 2027, secondary use from 26 March 2029
Clinical LLM systems integrated with FSE must be compatible with interoperability and access requirements

Healthcare data remain under art. 9(2)(h)
DPIA mandatory for large-scale deployment of clinical AI systems
Extra-EU transfer naturally avoided with on-premise deployment

Italy

Ministry of Health Decree of 7 September 2023 on FSE 2.0
Garante guidelines on healthcare AI evolving

Advantages of Open Source + on-premise

The Open Source LLM + RAG on-premise pattern solves many of the problems preventing proprietary cloud models from entering healthcare:

Privacy: no data leaves enterprise perimeter
Lifecycle control: the model is under the organisation’s control; managed updates
Full auditability: all behaviour is inspectable
Customisation: fine-tuning on local data possible
Economics: no per-token cost; initial GPU investment amortised
Certification: easier to qualify a locally controlled system

Limits and challenges

Performance — top open-weights models (Llama 3.3 70B, DeepSeek V3) are now competitive with GPT-4, but residual gap on some very hard tasks
Infrastructure — requires non-trivial GPU systems expertise
Knowledge base quality — RAG’s value depends on indexing curation; garbage in, garbage out
Clinical evaluation — measuring medical RAG system quality requires metrics (accuracy, hallucination rate, citation correctness) and clinical studies
Bias management — GPAI models can reflect general-training-corpus bias; mitigation requires evaluation and possibly fine-tuning
Complex clinical reasoning — LLMs excel at information retrieval; multi-step reasoning with complex clinical constraints is still a development area. DeepSeek-R1 and reasoning models are shifting this limit

The Italian context

As of 2025 Italian adoption:

IRCCS and large hospital organisations — first experimental clinical RAG deployments
Italian healthcare software companies — LLM integration in management products
Research projects — clinical validation studies on RAG systems, with accuracy control and clinician comparison
Universities — Turin, Milan, Pavia, Bologna, Pisa active on medical LLM evaluation
Partnership with Sogei — FSE 2.0 as structured data source for RAG pipelines

The Italian LLMs topic — models pre-trained specifically on Italian clinical language — is still open. Initiatives like IT5 (Italian generalist), BioBIT, LLaMAntino (Italian Llama) are starting points, but production-quality Italian clinical models are under construction.

Outlook

Expected directions in the coming months/years:

Reasoning models more integrated with RAG — structured clinical reasoning with retrieval
Full multimodality — image + text + signals in a single pipeline
Clinical agents — LLMs orchestrating multiple queries (FHIR, DICOM, guidelines) to answer complex questions
Prospective clinical evaluation — first randomised prospective studies with standard-practice control to measure clinical outcome impact
Certification — first CE and FDA-certified medical device AI with LLM components (some emerging 2025 announcements)
MONAI + LLM — clinical LLM integration in the MONAI framework for imaging intelligence
Italian contributions — production-quality Italian LLMs for Italian clinical language

The pattern RAG + open LLM + on-premise represents in 2025 the viable path to bring large-scale LLMs into healthcare without breaching privacy, governance and regulatory constraints. It is an architecture growing in maturity each month and that will shape the next layer of diagnostic support and clinical knowledge management for years to come.

References: Llama 3.x (Meta AI, 2024-2025), Mistral (Mistral AI), Gemma (Google DeepMind), DeepSeek-V3/R1 (DeepSeek AI, 2024-2025), Qwen 2.5 (Alibaba). Meditron (EPFL, 2023), BioMistral (2024). Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”, NeurIPS 2020. vLLM, llama.cpp, Ollama, Hugging Face TGI. LangChain, LlamaIndex, Haystack. Qdrant, Chroma, Milvus, pgvector. Regulation (EU) 2024/1689 (EU AI Act). Regulation (EU) 2025/327 (EHDS).

Clinical RAG with Open Source LLMs: on-premise diagnostics