data_science

During the last year I refined my RSS collection about big-data, data science and analytics. I usually check it everyday in order to discover a ton of new cool technologies and have fun. Here is the updated list.

Bloggers

News about emerging technologies, scalability and data

Data companies, social networks and search engines

Companies supporting e distributing big-data processing products

Recently I discovered the awesome data science list that contains a list of interesting blogger I haven’t time to check yet. You can surely find something more in it. I’ll try to publish an update when I’ll check it.

[UPDATE 2014-09-22 11:35]

Thanks to @onurakpolat for correcting my link to awesome data science list. Previous link was to his fork, the original repo is https://github.com/okulbilisim/awesome-datascience by @okulbilisim

The big-data environment at the moment is really “collaborative”. Each project is ready to run on almost every available platform and this is good. However, recently two factions are forming: people who use the Hadoop 2.0 Stack and people who use the BDAS.

Hadoop 2.0 Stack

hadoop_stack

The most important difference between Hadoop 2.0 and previous versions is YARN, the new cluster resource manager and next generation MapReduce. It can run almost every kind of big-data project:

  • Traditional Map Reduce (new version is backward compatible) with Hive or Pig or Cascading for query.
  • Interactive near real-time Map Reduce (using Tez)
  • HBase and Accumulo
  • Storm and S4 for stream processing
  • Giraph for graph processing
  • OpenMPI for message passing
  • Spark as In-memory Map Reduce
  • HDFS as distributed filesystem
  • more…

Most interesting companies here are IntelCloudera, MapR and Hortonworks.

BDAS (Berkely Data Analytics Stack)

berkeley_stack

On the BDAS everything is built around Mesos: the cluster resource manager. It’a relative new project is already widely used. Traditional HDFS is accelerated by Tachyon (in-memory file system). The main integration is around Spark which is the base for:

Mesos can also run traditional Hadoop environment and other projects (such as Storm and OpenMPI). You can also run traditional applications (also Rails apps) using Marathon.

The most interesting companies here are Databricks and Mesosphere.

Who will win? 😀