scikit-learn: accessible machine learning in Python

scikit-learn 0.1 brings classification, regression and clustering to Python with a uniform fit/predict/transform API, built on NumPy and SciPy.

Open SourceAI Open Sourcescikit-learnMachine LearningPythonData Science

From Google Summer of Code to the reference library

In 2007 David Cournapeau starts a project within the Google Summer of Code with the goal of creating a machine learning library for Python. The project is called scikit-learn — where “scikit” denotes a satellite package of the SciPy project — and in 2010 it reaches version 0.1, the first stable release coordinated by Fabian Pedregosa, Gaël Varoquaux and Alexandre Gramfort at INRIA, the French national institute for research in computer science.

The objective is to make machine learning accessible to anyone who can program in Python, without requiring complex enterprise frameworks or advanced mathematical knowledge to get started.

A uniform API

The most influential architectural choice of scikit-learn is the uniform API that spans all algorithms. Every model — whether classification, regression or clustering — exposes three fundamental methods: fit to train the model on data, predict to generate predictions on new data, and transform to transform data (normalisation, dimensionality reduction, encoding).

This uniformity means that switching algorithms often requires just one line of code. A decision tree and a Support Vector Machine are trained and queried in the same way. The same holds for preprocessing and feature extraction methods: each transformation composes with others in a sequential pipeline, where the output of one step becomes the input of the next.

Algorithms and foundations

Version 0.1 includes classification algorithms (SVM, k-nearest neighbours, decision trees), regression (linear, Ridge, Lasso), clustering (k-means, hierarchical clustering) and dimensionality reduction methods such as PCA. Every algorithm is implemented in Python with extensions in C and Cython for performance-critical sections.

The library is built on NumPy for multidimensional array handling and SciPy for linear algebra and optimisation routines. It does not reinvent these foundations but composes them, focusing on learning algorithms and their usability.

Documentation includes executable examples for every algorithm, with reference datasets bundled in the library. Scikit-learn is released under the BSD licence, compatible with commercial use and integration into proprietary products.

Link: scikit-learn.org

Need support? Under attack? Service Status
Need support? Under attack? Service Status