Hadoop: the framework for large-scale distributed processing

Inspired by Google's GFS and MapReduce papers, Hadoop offers distributed storage (HDFS) and parallel computation (MapReduce) on commodity hardware with native fault tolerance.

Open SourceR&D Open SourceHadoopMapReduceHDFSBig DataJava

From Google’s papers to open source software

In 2003 and 2004 Google published two academic papers describing the internal architecture of its own systems: the Google File System (GFS) for distributed storage and MapReduce for parallel data processing. The papers are public, but the code is not. Hadoop was born as an open source implementation of these ideas, initially developed by Doug Cutting and Mike Cafarella as a subproject of Nutch, an open source web search engine.

In 2006 the project entered the Apache Software Foundation as an independent project. It is written in Java and designed to run on clusters of commodity hardware — inexpensive servers with no specialised components.

HDFS: distributed storage

The Hadoop Distributed File System (HDFS) is a file system designed to store large files — in the order of gigabytes and terabytes — distributing them across tens or hundreds of machines. The architecture follows the master/worker model:

  • The NameNode is the master node that manages file system metadata: directory structure, block locations, replicas. It does not store data, but knows where every block resides
  • DataNodes are the worker nodes that physically store data blocks. Each file is split into fixed-size blocks (typically 64 MB) and each block is replicated across three different nodes by default

Fault tolerance is an architectural property, not an exception. HDFS assumes that hardware failures are the norm: when a DataNode becomes unreachable, the NameNode automatically initiates re-replication of the lost blocks onto other available nodes.

MapReduce: the computation model

The MapReduce framework enables parallel processing of large data volumes across an HDFS cluster. The programmer defines two functions:

  • Map: receives a key-value pair as input and produces a set of intermediate key-value pairs. Each node in the cluster runs the map function on its local data blocks, avoiding data movement across the network
  • Reduce: aggregates all intermediate pairs sharing the same key, producing the final result

The framework automatically handles work distribution, intermediate data transfer (shuffle), task monitoring and restart of failed tasks. The programmer focuses on processing logic; the complexity of distribution is hidden by the infrastructure.

An architecture for data at scale

Hadoop is not a database and does not replace relational systems. It is an infrastructure for batch processing of datasets too large to be handled by a single server. The approach is simple but effective: bring computation to where the data resides, rather than moving data to the computation. On a cluster of hundreds of commodity nodes, Hadoop enables the processing of petabytes of data with contained hardware costs and fault tolerance built into the design.

Link: hadoop.apache.org

Need support? Under attack? Service Status
Need support? Under attack? Service Status