WEKA: open source data mining and applications in biomedical research

The release of WEKA 3.4 (June 2003), the University of Waikato, the Witten and Frank book, the included classifiers (J48, naive Bayes, SMO) and typical uses in biomedical research — sample classification, gene expression, outcome prediction.

Digital HealthR&DOpen Source WEKAData MiningMachine LearningJavaClassificationBiomedical ResearchOpen SourceDigital Health

An integrated data mining environment

WEKAWaikato Environment for Knowledge Analysis — is an open source data mining and machine learning environment developed at the University of Waikato (New Zealand) by the group of Ian H. Witten and Eibe Frank. Born in the first half of the 1990s as a Tcl/Tk project and rewritten in Java at the end of the decade, WEKA is today one of the most widespread tools for applied machine learning — widely used in teaching, research and industry.

Version 3.4, released in June 2003, is an important release: it stabilises the base code, consolidates the API and will become the reference version for the community for years to come. The licence is the GNU General Public License.

The environment is paired with the book “Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations” by Witten and Frank (Morgan Kaufmann, 2000) — a reference text describing the algorithms implemented in WEKA, which has become standard in university machine learning courses.

The interface

WEKA offers multiple ways to use it:

  • Explorer — graphical GUI for interactive dataset exploration, algorithm application, result evaluation. Tabular view, per-attribute histograms, scatter plots, graphically rendered decision trees
  • Experimenter — environment for rigorous comparative experiments: repeated cross-validation, statistical tests, classifier comparisons with significance
  • KnowledgeFlow — graph-based interface to build processing pipelines visually, a modern alternative to scripting
  • Command line — direct algorithm invocation, useful for automation and batch pipelines
  • Java API — WEKA classes can be used as a library within custom Java applications

The canonical data format is ARFF (Attribute-Relation File Format), a CSV extended with attribute type and category metadata — which has become a shared format in the ML literature for reproducible dataset exchange.

Included algorithms

3.4 includes a broad catalogue of algorithms classified in families:

  • Decision treesJ48 (Java implementation of Ross Quinlan’s C4.5), ID3, RandomTree, REPTree
  • Bayesian — NaiveBayes, BayesNet
  • Instance-based — IBk (k-nearest neighbours), IB1
  • Functions — SMO (Sequential Minimal Optimization for Support Vector Machines), Logistic, LeastMedSq
  • Neural networks — MultilayerPerceptron with backpropagation
  • Meta-classifiers — Bagging, AdaBoost, Stacking, Vote, RandomCommittee
  • Regression — LinearRegression, M5P (regression trees)
  • Clustering — SimpleKMeans, EM, DBSCAN, Cobweb
  • Association rules — Apriori, Tertius
  • Attribute selection — CfsSubsetEval, InfoGainAttributeEval, Wrapper, ChiSquaredAttributeEval

For each classifier the trained model, per-class probability outputs (where applicable), performance metrics (accuracy, precision, recall, F1, AUC ROC, kappa) are available, with stratified cross-validation or configurable holdout.

Uses in biomedical research

WEKA has become a recurring tool in quantitative biomedical literature. The most common use cases:

  • Classification of clinical samples on molecular data — in particular gene expression microarray data. The Golub leukaemia dataset (1999) — with 72 samples and 7129 genes — is one of the classic benchmarks on which several published works have compared WEKA algorithms to distinguish AML from ALL
  • Differential diagnosis from clinical parameters — datasets like Wisconsin Breast Cancer Diagnostic (UCI Machine Learning Repository) are used to compare malignant/benign classification models from cellular measurements
  • Clinical outcome prediction — ICU mortality, hospital readmission, treatment response, progression of chronic disease; datasets built from enterprise clinical records
  • Physiological signal analysis — ECG classification (arrhythmias), EEG (sleep stages, epilepsy), EMG. WEKA typically runs downstream of a signal feature extraction phase
  • Pharmacological data mining — adverse effect analysis, drug-drug interactions, dose optimisation
  • Imaging diagnostics — after extraction of features (texture, moments, shape descriptors) from ROIs in CT/MR/histological images, WEKA classifies benign/malignant, tumour subtypes, treatment response

The typical pattern is: pre-processing in a specialist tool (R/Bioconductor for microarrays, MATLAB for signals, ITK/VTK for imaging) → feature extraction → export to ARFF → classification in WEKA → comparative algorithm evaluation.

Systematic comparison

One of WEKA’s most relevant contributions to scientific practice is not a specific algorithm, but the standardisation of comparison. The Experimenter allows one to:

  • Define a set of datasets
  • Define a set of classifiers with hyperparameters
  • Run N repeated cross-validations with the same random splits for all classifiers
  • Apply statistical tests (corrected paired t-tests, Friedman test, Nemenyi post-hoc) to determine whether an accuracy difference is significant

Before WEKA, published comparisons were often performed with ad-hoc, not always replicable protocols. The availability of a common validation framework has contributed to professionalising experimental practice in machine learning, including in biomedical publications.

Limits

WEKA in 2003 has recognised limits:

  • Scalability — the Java implementation, designed for didactic clarity, is not optimised for datasets of millions of records; as size grows, algorithms like SMO and MultilayerPerceptron become slow
  • No deep learning — available neural networks are shallow MLPs with classic backpropagation; deep learning as a paradigm has not yet emerged (first AlexNet publications will come in 2012)
  • Pre-computed features — WEKA requires attributes to be already extracted; it does not offer complex representation transformations (embedding, automatic feature learning)
  • Imbalanced classes — support for imbalanced datasets (frequent in medicine, where the class of interest is in the minority) requires classifiers and metrics to be chosen manually (cost-sensitive learning is supported but not automatic)

For moderate-sized datasets and well-designed features — the most common situation in quantitative clinical research — these limits are largely navigable.

The context of open source ML tools

WEKA sits, in mid-2003, in an open source machine learning tool landscape still limited:

  • R with packages e1071 (SVM), rpart (trees), randomForest (random forests just published by Breiman in 2001) — main environment for statisticians
  • Torch (original Lua version from IDIAP, Switzerland) — C++ library with Lua bindings for neural networks
  • Shogun — C++ library for kernel methods, in development
  • PyML — early Python SVM library

WEKA stands out for maturity, completeness, graphical interface and didactic documentation. For a biomedical researcher approaching data mining without strong computer science training, it is the most natural entry point.

Derivative applications and tools

Alongside direct use, WEKA is incorporated as a library into other tools:

  • Knime (Konstanz Information Miner) — data mining platform with visual workflow, will include WEKA nodes in subsequent releases
  • RapidMiner (then YALE) — another visual system integrating WEKA algorithms
  • Integration with Pentaho BI Suite — for business intelligence scenarios

Various clinical research projects build small applicative interfaces on top of the WEKA API, specialised for specific pathologies — a recurring reuse pattern.

Outlook

WEKA will continue development over the coming years with core algorithm improvements, new classifiers (Random Forest, advanced ensemble methods), better scalability support. The advent of deep learning frameworks — which will start to change the field from around 2010 — will require a repositioning: WEKA will remain fundamental for “classical” learning scenarios with tabular datasets, which still represent the majority of quantitative clinical applications. For “raw” image and physiological signal analysis with deep learning, other tools will emerge.

For quantitative clinical research today, WEKA 3.4 is a robust, open source, well-documented tool. Compatibility with R (via ARFF and export scripts), with MATLAB (interop plugins), with application Java (API) makes it suited to integration in broader systems.


References: WEKA 3.4, University of Waikato (www.cs.waikato.ac.nz/ml/weka), released June 2003. Ian H. Witten, Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques (Morgan Kaufmann, 2000). GNU GPL licence. ARFF format. UCI Machine Learning Repository. Golub leukaemia dataset (1999).

Need support? Under attack? Service Status
Need support? Under attack? Service Status