Apache Spark: in-memory distributed processing

Apache Spark introduces the RDD model and in-memory execution to distributed processing, achieving speeds up to 100x faster than Hadoop MapReduce for iterative workloads.

Open SourceR&D Open SourceApache SparkBig DataIn-MemoryDistributed Computing

From Berkeley’s lab to Apache top-level

Apache Spark originated at the University of Berkeley’s AMPLab as a research project aimed at overcoming the limitations of Hadoop MapReduce in iterative processing of large datasets. In February 2014 the project was promoted to a top-level project of the Apache Software Foundation, a signal of maturity and growing adoption across the Big Data community.

The founding idea is straightforward: if an algorithm must traverse the same data multiple times — as happens in machine learning, graph analysis and interactive queries — rewriting that data to disk at every iteration is wasteful. Spark keeps data in RAM between iterations, drastically reducing latency.

RDDs and lazy transformations

Spark’s central abstraction is the Resilient Distributed Dataset (RDD): an immutable, fault-tolerant collection of elements partitioned across a cluster. RDDs are created from data on HDFS, from local files, or from transformations of other RDDs.

Operations on RDDs fall into two categories: transformations — such as map, filter, flatMap, groupByKey — and actions — such as count, collect, saveAsTextFile. Transformations are lazy: they are not executed immediately but recorded in a directed acyclic graph (DAG) of dependencies. Only when an action is invoked does Spark’s DAG scheduler analyse the graph, optimise it and split it into stages executed in parallel across cluster nodes.

This lazy evaluation enables optimisations that MapReduce cannot perform: Spark fuses consecutive transformations, avoids unnecessary shuffles and plans execution globally rather than operation by operation.

A unified framework

Beyond the core engine, Spark integrates specialised libraries within the same framework:

  • Spark SQL: querying structured data with SQL syntax, integration with Hive
  • MLlib: a distributed machine learning library with classification, regression, clustering and recommendation algorithms
  • GraphX: graph processing and parallel computation on graph structures
  • Spark Streaming: stream processing via micro-batches

The advantage of having everything in a single framework is that data can flow between SQL, machine learning and graph analysis without intermediate serialisation. For iterative workloads, benchmarks show speeds up to 100 times faster than MapReduce, thanks to the combination of in-memory execution and DAG optimisation.

Link: spark.apache.org

Need support? Under attack? Service Status
Need support? Under attack? Service Status