From Berkeley’s lab to Apache top-level
Apache Spark originated at the University of Berkeley’s AMPLab as a research project aimed at overcoming the limitations of Hadoop MapReduce in iterative processing of large datasets. In February 2014 the project was promoted to a top-level project of the Apache Software Foundation, a signal of maturity and growing adoption across the Big Data community.
The founding idea is straightforward: if an algorithm must traverse the same data multiple times — as happens in machine learning, graph analysis and interactive queries — rewriting that data to disk at every iteration is wasteful. Spark keeps data in RAM between iterations, drastically reducing latency.
RDDs and lazy transformations
Spark’s central abstraction is the Resilient Distributed Dataset (RDD): an immutable, fault-tolerant collection of elements partitioned across a cluster. RDDs are created from data on HDFS, from local files, or from transformations of other RDDs.
Operations on RDDs fall into two categories: transformations — such as map, filter, flatMap, groupByKey — and actions — such as count, collect, saveAsTextFile. Transformations are lazy: they are not executed immediately but recorded in a directed acyclic graph (DAG) of dependencies. Only when an action is invoked does Spark’s DAG scheduler analyse the graph, optimise it and split it into stages executed in parallel across cluster nodes.
This lazy evaluation enables optimisations that MapReduce cannot perform: Spark fuses consecutive transformations, avoids unnecessary shuffles and plans execution globally rather than operation by operation.
A unified framework
Beyond the core engine, Spark integrates specialised libraries within the same framework:
- Spark SQL: querying structured data with SQL syntax, integration with Hive
- MLlib: a distributed machine learning library with classification, regression, clustering and recommendation algorithms
- GraphX: graph processing and parallel computation on graph structures
- Spark Streaming: stream processing via micro-batches
The advantage of having everything in a single framework is that data can flow between SQL, machine learning and graph analysis without intermediate serialisation. For iterative workloads, benchmarks show speeds up to 100 times faster than MapReduce, thanks to the combination of in-memory execution and DAG optimisation.
Link: spark.apache.org
