Hadoop vs Berkeley

The big-data environment at the moment is really “collaborative”. Each project is ready to run on almost every available platform and this is good. However, recently two factions are forming: people who use the Hadoop 2.0 Stack and people who use the BDAS.

Hadoop 2.0 Stack

hadoop_stack

The most important difference between Hadoop 2.0 and previous versions is YARN, the new cluster resource manager and next generation MapReduce. It can run almost every kind of big-data project:

  • Traditional Map Reduce (new version is backward compatible) with Hive or Pig or Cascading for query.
  • Interactive near real-time Map Reduce (using Tez)
  • HBase and Accumulo
  • Storm and S4 for stream processing
  • Giraph for graph processing
  • OpenMPI for message passing
  • Spark as In-memory Map Reduce
  • HDFS as distributed filesystem
  • more…

Most interesting companies here are IntelCloudera, MapR and Hortonworks.

BDAS (Berkely Data Analytics Stack)

berkeley_stack

On the BDAS everything is built around Mesos: the cluster resource manager. It’a relative new project is already widely used. Traditional HDFS is accelerated by Tachyon (in-memory file system). The main integration is around Spark which is the base for:

Mesos can also run traditional Hadoop environment and other projects (such as Storm and OpenMPI). You can also run traditional applications (also Rails apps) using Marathon.

The most interesting companies here are Databricks and Mesosphere.

Who will win? 😀