AIHealth
On-premise clinical platform with local LLMs, RAG on FHIR/DICOM data, diagnostic support, remote follow-up. Architecture designed for the MDR pathway.
Discover AIHealth →
Digital Health
Medical software development compliant with CE and MDR regulatory standards. Clinical decision support systems, AI integration in clinical workflows.
Discover →LLMs in medicine: from ChatGPT to open models
Since the release of ChatGPT (November 2022) and GPT-4 (March 2023), large language models have become the subject of intense medical attention — for promises (literature summarisation, clinical question answering, documentation support, patient assistance) and for limits (hallucinations, absence of verifiable citations, data privacy). Healthcare use of proprietary cloud-only LLMs is hindered by structural constraints:
- Privacy: clinical data cannot leave the controller’s perimeter without rigorous legal bases (GDPR art. 9)
- Knowledge update: a model trained with 2023 data cutoff doesn’t know 2024-2025 guidelines
- Citability: a clinician needs to know from which source a recommendation comes
- Auditability: every recommendation must be reconstructible and contestable
- EU AI Act compliance (Regulation 2024/1689, in force since August 2024): medical device AI are high-risk systems with specific obligations
The emerging response in 2024-2025 is a combination: open source LLMs (run on-premise on controlled infrastructure) + RAG (Retrieval-Augmented Generation) (grounding responses on verifiable sources, dynamically retrieved) + structured healthcare data (FHIR, DICOM, OMOP) as contextual input. As of July 2025 this stack is mature.
Available open source LLMs
The open-weights ecosystem as of 2025 offers credible choices:
Llama family (Meta)
- Llama 3.1 (July 2024): 8B, 70B, 405B parameter variants — Meta LLaMA 3.1 Community License (open-weights with some scale-use restrictions)
- Llama 3.2 (September 2024): 1B, 3B, 11B, 90B models, with multi-modal variants
- Llama 3.3 (December 2024): 70B version with optimised performance
- Llama 4 (2025): variants with Mixture-of-Experts architectures, superior performance
Mistral AI (France)
- Mistral 7B (September 2023): entry baseline
- Mixtral 8x7B and Mixtral 8x22B (2023-2024): MoE
- Mistral Small, Mistral Large (2024-2025): flagship models
- Apache 2.0 licence for open-weights versions
Gemma (Google)
- Gemma 2 (June 2024): 2B, 9B, 27B versions
- Gemma 3 (2025): multi-modal variants, substantial improvements
- Gemma terms of use licence (open-weights)
DeepSeek
- DeepSeek-V3 (December 2024): 671B parameter MoE, ~37B active
- DeepSeek-R1 (January 2025): step-by-step reasoning model, MIT-licensed public release
- Reasoning quality comparable to top proprietary models
Qwen (Alibaba)
- Qwen 2.5 (late 2024): 0.5B-72B with specialist variants (math, coder)
- Apache 2.0
Specialised biomedical models
- Meditron (EPFL, 2023): Llama-2 fine-tuned on PubMed + international clinical guidelines
- Med42 (M42 AI, 2024): Llama-based, trained on clinical datasets
- BioMistral (2024): Mistral fine-tuning on biomedical literature
- MedGemma (2025): announced by Google, Gemma fine-tuning on clinical tasks
RAG: Retrieval-Augmented Generation
The RAG paradigm — formalised in 2020 by Lewis et al. (FAIR/Meta) in “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” — separates knowledge and reasoning:
- An external knowledge base (documents, guidelines, literature, clinical records) is indexed in a vector database
- On user question, a retriever fetches the most relevant documents (embedding-based semantic similarity)
- The LLM generates the answer conditioned on retrieved documents, citing them explicitly
For the clinical context, RAG offers fundamental advantages:
- Update without retraining: adding documents to the knowledge base is immediate
- Citability: every answer can be backed by identifiable sources
- Hallucination reduction: the model is conditioned on real documents rather than pure parametric prior
- Auditability: every step is logged and verifiable
Technical stack
A typical on-premise clinical RAG implementation uses open source components:
Local LLM inference
- vLLM (UC Berkeley, 2023+) — high-throughput inference server, paged attention support
- llama.cpp (Georgi Gerganov, 2023+) — C++ inference with aggressive quantisation (GGUF), runs on CPU and consumer GPUs
- Ollama — user-friendly packaging of llama.cpp with REST API
- Text Generation Inference (Hugging Face) — HF server with multi-GPU, tensor parallelism support
Embedding
- sentence-transformers — Python library with embedding models
- BGE (Beijing Academy of AI), E5 (Microsoft), Jina Embeddings — competitive open source embeddings
- BioLORD, MedEmbed, SapBERT — specialist biomedical embeddings
Vector database
- Qdrant (Rust, MIT) — performant vector DB with advanced filtering
- Chroma (Python, Apache 2.0) — simplicity-oriented for prototyping
- Milvus (Apache 2.0) — scale-out production
- pgvector (PostgreSQL extension) — SQL + vector hybrid, suited to existing systems
Orchestration
- LangChain (Python/JS, MIT) — LLM pipeline framework
- LlamaIndex (Python, MIT) — RAG and indexing focus
- Haystack (Apache 2.0) — production-grade NLP framework with RAG
Healthcare use patterns
Question answering on guidelines
A clinician asks “what is the first-line therapy for metastatic prostate carcinoma in a 65-year-old patient with cardiovascular comorbidity?”. The system retrieves from guidelines (AIOM, EAU, NCCN) the relevant pages, the LLM generates an anchored response with pinpoint citations. The clinician sees the answer and the sources.
Patient clinical history summary
Given the patient’s FSE (CDA/FHIR documents), the system generates a structured summary: active problems, therapies, recent significant events, parameter trends. The summary is presented to the treating clinician before the visit.
DICOM interpretation + text
Multimodal pipeline: the vision model (like Gemma 3 multi-modal or MedGemma) analyses DICOM images; the LLM integrates the output with textual report and clinical history to produce an integrated opinion. Always with citations.
Coding assistance
On a discharge letter, the system suggests most appropriate ICD-10/ICD-9-CM codes, retrieving similar cases and coding guidelines.
Active pharmacovigilance
Monitoring of clinical notes for adverse drug event signals; RAG on AIFA Gazzetta/FDA FAERS to contextualise.
Patient Q&A (controlled)
Chatbot on educational health information, with responses anchored to institutional sources (Ministry of Health, ISS, AIFA); strictly non-diagnostic, educational.
Literature review
Synthesis of recent clinical trials for a therapeutic area, with pinpoint PubMed/Cochrane citations.
On-premise deployment: typical architecture
A clinical on-premise implementation typically has:
- Enterprise GPU cluster — 4-8 A100/H100 80GB GPUs or similar. Cost between €80K-€400K depending on scale, amortisable over deployment lifetime
- Storage — large volumes for vector indices (hundreds of GB) + document archive
- Segmented networking — dedicated GDPR-compliant network, separated from general network
- FSE/EHR integration — via Sogei FSE Gateway FHIR APIs or enterprise record APIs
- Logging/audit trail — EU AI Act-compliant for high-risk systems: every interaction tracked with input, output, citations, model version
- Knowledge base update — periodic ETL pipeline reindexing updated guidelines, literature, internal documents
Regulatory compliance
As of July 2025 the landscape is:
EU AI Act
- Entered into force on 1 August 2024
- For medical device AI: application of high-risk requirements from 2 August 2026 (Art. 6(1)) for AI devices with Notified Body certification
- Obligations: risk management, quality management, transparency, human oversight, post-market monitoring
- Important: GPAI models (General-Purpose AI) like Llama/Mistral have their own obligations; a clinical product integrating them inherits obligations both as downstream GPAI integrator and as medical device
MDR (EU Regulation 2017/745)
- Fully applicable since May 2021
- Clinical AI software typically Class IIa (Rule 11 for diagnostic support)
- A certified product with LLM components requires full qualification of the model: risk management, clinical validation, post-market surveillance
EHDS (EU Regulation 2025/327)
- In force since 26 March 2025
- Mandatory primary use from 26 March 2027, secondary use from 26 March 2029
- Clinical LLM systems integrated with FSE must be compatible with interoperability and access requirements
GDPR
- Healthcare data remain under art. 9(2)(h)
- DPIA mandatory for large-scale deployment of clinical AI systems
- Extra-EU transfer naturally avoided with on-premise deployment
Italy
- Ministry of Health Decree of 7 September 2023 on FSE 2.0
- Garante guidelines on healthcare AI evolving
Advantages of open source + on-premise
The open source LLM + RAG on-premise pattern solves many of the problems preventing proprietary cloud models from entering healthcare:
- Privacy: no data leaves enterprise perimeter
- Lifecycle control: the model is under the organisation’s control; managed updates
- Full auditability: all behaviour is inspectable
- Customisation: fine-tuning on local data possible
- Economics: no per-token cost; initial GPU investment amortised
- Certification: easier to qualify a locally controlled system
Limits and challenges
- Performance — top open-weights models (Llama 3.3 70B, DeepSeek V3) are now competitive with GPT-4, but residual gap on some very hard tasks
- Infrastructure — requires non-trivial GPU systems expertise
- Knowledge base quality — RAG’s value depends on indexing curation; garbage in, garbage out
- Clinical evaluation — measuring medical RAG system quality requires metrics (accuracy, hallucination rate, citation correctness) and clinical studies
- Bias management — GPAI models can reflect general-training-corpus bias; mitigation requires evaluation and possibly fine-tuning
- Complex clinical reasoning — LLMs excel at information retrieval; multi-step reasoning with complex clinical constraints is still a development area. DeepSeek-R1 and reasoning models are shifting this limit
The Italian context
As of 2025 Italian adoption:
- IRCCS and large hospital organisations — first experimental clinical RAG deployments
- Italian healthcare software companies — LLM integration in management products
- Research projects — clinical validation studies on RAG systems, with accuracy control and clinician comparison
- Universities — Turin, Milan, Pavia, Bologna, Pisa active on medical LLM evaluation
- Partnership with Sogei — FSE 2.0 as structured data source for RAG pipelines
The Italian LLMs topic — models pre-trained specifically on Italian clinical language — is still open. Initiatives like IT5 (Italian generalist), BioBIT, LLaMAntino (Italian Llama) are starting points, but production-quality Italian clinical models are under construction.
Outlook
Expected directions in the coming months/years:
- Reasoning models more integrated with RAG — structured clinical reasoning with retrieval
- Full multimodality — image + text + signals in a single pipeline
- Clinical agents — LLMs orchestrating multiple queries (FHIR, DICOM, guidelines) to answer complex questions
- Prospective clinical evaluation — first randomised prospective studies with standard-practice control to measure clinical outcome impact
- Certification — first CE and FDA-certified medical device AI with LLM components (some emerging 2025 announcements)
- MONAI + LLM — clinical LLM integration in the MONAI framework for imaging intelligence
- Italian contributions — production-quality Italian LLMs for Italian clinical language
The pattern RAG + open LLM + on-premise represents in 2025 the viable path to bring large-scale LLMs into healthcare without breaching privacy, governance and regulatory constraints. It is an architecture growing in maturity each month and that will shape the next layer of diagnostic support and clinical knowledge management for years to come.
References: Llama 3.x (Meta AI, 2024-2025), Mistral (Mistral AI), Gemma (Google DeepMind), DeepSeek-V3/R1 (DeepSeek AI, 2024-2025), Qwen 2.5 (Alibaba). Meditron (EPFL, 2023), BioMistral (2024). Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”, NeurIPS 2020. vLLM, llama.cpp, Ollama, Hugging Face TGI. LangChain, LlamaIndex, Haystack. Qdrant, Chroma, Milvus, pgvector. Regulation (EU) 2024/1689 (EU AI Act). Regulation (EU) 2025/327 (EHDS).