From experiment tracking to MLOps platform
When Databricks released MLflow in 2018, the focus was clear: track machine learning experiments with a simple API that recorded parameters, metrics and artefacts. Seven years later, MLflow 2.x is a platform covering the entire ML lifecycle, from first experiment to model in production. The transformation did not happen through feature accumulation, but through a logical progression: those who track experiments need a model registry, those with a registry need deployment, those deploying need systematic evaluation.
Tracking and Model Registry
The tracking server remains the core: each run records parameters, metrics, tags and artefacts in a configurable backend storage — relational database for metadata, object storage for artefacts. The Model Registry adds a governance layer: each registered model has numbered versions, transition stages (Staging, Production, Archived) and descriptive metadata. Teams can promote a model from staging to production with a tracked, reversible operation.
Integration with major frameworks — PyTorch, TensorFlow, scikit-learn, Hugging Face Transformers, LangChain — is native: MLflow serialises models in the appropriate format and automatically records dependencies.
MLflow Deployments and model serving
MLflow Deployments (formerly MLflow AI Gateway) extends the platform to serving. Registered models can be exposed as REST endpoints with a single command, served locally for testing or deployed on Kubernetes via official Helm charts. Integration with major cloud providers — AWS SageMaker, Azure ML, Databricks Model Serving — allows moving from registry to deployment without changing tooling.
Serving handles endpoint versioning, traffic routing between different versions and monitoring of inference metrics — latency, throughput, prediction distribution.
Evaluation and LLMs
The evaluation component of MLflow 2.x introduces standardised metrics for assessing models before and after deployment. For traditional models, classic metrics such as accuracy, F1 and RMSE. For large language models, specific metrics such as toxicity, relevance and faithfulness, calculated automatically on evaluation datasets.
Integration with LangChain and LLM application frameworks positions MLflow as a management layer for the new generation of AI applications as well, where the “model” is a chain of prompts, retrieval and generation.
A de facto standard
The open source MLOps ecosystem has consolidated around a few tools. MLflow, with its modular approach — one can use only tracking, or only the registry, without adopting the entire platform — has established itself as the standard for organisations that want to manage the ML lifecycle without depending on a single cloud provider.
Link: mlflow.org
