I few days ago I have been at Codemotion in Milan and I had the opportunity to discover some insights about technologies used by two of our main competitor in Italy: BlogMeter and Datalytics. It’s quite interesting because, also if technical challenges are almost the same, each company use a differente approach with a different stack.

datalytics_logo

Datalytics a is relatively new company founded 4 months ago. They had a desk at Codemotion to show theirs products and recruit new people. I chatted with Marco Caruso, the CTO (who probably didn’t know who I am, sorry Marco, I just wanted to avoid hostility 😉 ), about technologies they use and developer profile they were looking for. Requires skills was:

Their tech team is composed by 4 developers (including the CTO) and main products are: Datalytics Monitoring™ (a sort of statistical dashboard that shows buzz stats in real time) and Datalytics Engage™ (a real time analytics dashboard for live events). I have no technical insights about how they systems works but I can guess some details inferring them from the buzz words they use.

Supported sources are Twitter, Facebook (only public data), Instagram, Youtube, Vine (logos are on their website) and probably Pinterest.

They use DataSift as data source in addition to standard APIs. I suppose their processing pipeline uses Storm to manage streaming input, maybe with an importing layer before. Data is crunched using Hadoop and Java and results are stored on MongoDB (Massimo Brignoli, Italian MongoDB evangelist, advertise their company during his presentation so I suppose they largely use it).

Node.js should be used for frontend. Is fast enough for near real time application (also using websockets) and play really well both with Angular.js and MongoDB (the MEAN stack). D3.js is obviously the only choice for complex dynamic charts.

I’m not so happy when I discover a new competitor in our market segment. Competition gets harder and this is not fun. Anyway guys at Datalytics seems smart (and nice) and compete with them would be a pleasure and will push me to do my best.

Now I’m curios to know if Datalytics is monitoring buzz on the web around its company name. I’m going to tweet about this article using #Datalytics hashtag. If you find this article please tweet me “Yes, we found it bwahaha” 😛

[UPDATE 2014-12-27 21:18 CET]

@DatalyticsIT favorite my tweet on December 1st. This probably means they found my article but the didn’t read it! 😀

After GFS and MapReduce, Google solve again big data problems designing BigTable: a compressed, high performance, and proprietary data storage system that forms the basis for most of its projects. HBase and Cassandra are inspired from it.


google_logo

Title: Bigtable: A Distributed Storage System for Structured Data (PDF), November 2006
Authors: Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber

Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers.

Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products.

In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.


Check out the list of interesting papers and projects (Github).

For several years I thought MapReduce was the only paradigm for distributed data processing. Only a few month ago, watching “Clash of the Titans Beyond MapReduce” (a really interesting talk at The Hive meetupMatei Zaharia (@matei_zaharia), CTO of Databricks and co-creator of Spark, cited Dryad, a programming paradigm developed by Microsoft. I don’t know any open project which implement it but was widely used by Microsoft Research.


microsoft_research_logo

Title: Dryad: Distributed Data-Parallel Programs from Sequential
Building Blocks (PDF), March 2007
Authors: Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly

Abstract

Dryad is a general-purpose distributed execution engine for coarse-grain data-parallel applications. A Dryad application combines computational “vertices” with communication “channels” to form a dataflow graph.

Dryad runs the application by executing the vertices of this graph on a set of available computers, communicating as appropriate through files, TCP pipes, and shared-memory FIFOs. The vertices provided by the application developer are quite simple and are usually written as sequential programs with no thread creation or locking. Concurrency arises from Dryad scheduling vertices to run simultaneously on multiple computers, or on multiple CPU cores within a computer. The application can discover the size and placement of data at run time, and modify the graph as the computation progresses to make efficient use of the available resources.

Dryad is designed to scale from powerful multi-core single computers, through small clusters of computers, to data centers with thousands of computers. The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.


Check out the list of interesting papers and projects (Github).

Hadoop would not be real without this paper. MapReduce is the most famous and still most used processing paradigm for big data. It is not suitable for everything and there are several improvements (Dryad, Spark, …) but Google, Facebook, Twitter and many other has million rows of code deployed into their systems.


google_logoTitle: MapReduce: Simplified Data Processing on Large Clusters (PDF), December 2004
Authors: Jeffrey Dean and Sanjay Ghemawat

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.

Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google’s clusters every day.


Check out the list of interesting papers and projects (Github).

crate_logo

I usually don’t trust cutting edge datastore. They promise a lot of stunning features (and use a lot of superlatives to describe them) but almost every time they are too young and have so much problems to run in production to be useless. I thought the same also about Crate Data.

“Massively scalable data store. It requires zero administration”

First time I read these words (take from the home page of Crate Data) I wasn’t impressed. I simply didn’t think was true. Some months later I read some articles and the overview of the project and I found something more interesting:

It includes solid established open source components (Presto, Elasticsearch, Lucene, Netty)

I used both Lucene and Elasticsearch in production for several years and I really like Presto. Combine some production-ready components can definitely be a smart way to create something great. I decided to give it a try.

They offer a quick way to test it:

bash -c "$(curl -L try.crate.io)"

But I don’t like self install scripts so I decided to download it a run from bin. It simply require JVM. I unpacked it on my desktop on OS X and I launched ./bin/crate. The process bind the port 4200 (or first available between 4200 and 4300) and if you go to http://127.0.0.1:4200/admin you found the admin interface (there is no authentication). You also had a command line interface: ./bin/crash. Is similar to MySQL client and you are familiar with any other SQL client you will be familiar with crash too.

I created a simple table with semi-standard SQL code (data types are a bit different)

create table items (id integer, title string)

Then I search for a Ruby client and I found crate_ruby, the official Ruby client. I started to fill the table using a Ruby script and a million record CSV as input. Inserts go by 5K per second and the meantime I did some aggregation query on database using standard SQL (GROUP BY, ORDER BY and so on) to test performances and response was quite fast.

CSV.foreach("data.csv", col_sep: ";").each do |row|
client.execute("INSERT INTO items (id, title) VALUES (\$1, \$2)", [row[0], row[9]])
end

Finally I decided to inspect cluster features by running another process on the same machine. After a couple of seconds the admin interface shows a new node and after a dozen informs me data was fully replicated. I also tried to shut down both process to see what happen and data seems ok. I was impressed.

crate_admin

I still have many doubts about Crate. I don’t know how to manage users and privileges, I don’t know how to create a custom topology for a cluster and I don’t know how difficult is to use advanced features (like full text search or blob upload). But at the moment I’m impressed because administration seems really easy and scalability seems easy too.

Next step will be test it in production under a Rails application (I found an interesting activerecord-crate-adapter) and test advanced features to implement a real time search. I don’t know if I’ll use it but beginning looks very good.

Next week O’Reilly will host a webcast about Crate. I’m really looking forward to discover more about the project.

The big-data environment at the moment is really “collaborative”. Each project is ready to run on almost every available platform and this is good. However, recently two factions are forming: people who use the Hadoop 2.0 Stack and people who use the BDAS.

Hadoop 2.0 Stack

hadoop_stack

The most important difference between Hadoop 2.0 and previous versions is YARN, the new cluster resource manager and next generation MapReduce. It can run almost every kind of big-data project:

  • Traditional Map Reduce (new version is backward compatible) with Hive or Pig or Cascading for query.
  • Interactive near real-time Map Reduce (using Tez)
  • HBase and Accumulo
  • Storm and S4 for stream processing
  • Giraph for graph processing
  • OpenMPI for message passing
  • Spark as In-memory Map Reduce
  • HDFS as distributed filesystem
  • more…

Most interesting companies here are IntelCloudera, MapR and Hortonworks.

BDAS (Berkely Data Analytics Stack)

berkeley_stack

On the BDAS everything is built around Mesos: the cluster resource manager. It’a relative new project is already widely used. Traditional HDFS is accelerated by Tachyon (in-memory file system). The main integration is around Spark which is the base for:

Mesos can also run traditional Hadoop environment and other projects (such as Storm and OpenMPI). You can also run traditional applications (also Rails apps) using Marathon.

The most interesting companies here are Databricks and Mesosphere.

Who will win? 😀

A few hours after I posted about DataSift architecture, @choult, one of the about 25 ninjas who develop DataSift platform, tweet me.

The following SlideShare presentation by @stuherbert, another ninja, talks about the use of PHP in DataSift. Unlike what you may think, PHP is widely used in data processing.

datasift_repo_languges

System is decomposable in three major data pipelines:

  • Data Archiving (Adds new data to Historic Archive)
  • Filtering Pipeline (Filtering and delivery data in realtime)
  • Playback Pipeline (Filtering and delivery data from Historic Archive)

And PHP is used for many parts of these.

datasift_php

They use a custom build of PHP 5.3.latest with several optimizations and compiled-in extensions (ZeroMQ, APC, XHProof, Redis, XDebug). The also develop some internal components:

  • Frink, tweetrmeme’s framework
  • Stone: foundation of in-house test tools, Hornet and Storyteller (they probably open source a fork named Storyplayer).

Unfortunately I wasn’t able to find more details about these. Anyway, here is the presentation:


Everything started while I was writing my first post about the Hadoop Ecosystem. I was relatively new to Hadoop and I wanted to discover all useful projects. I started collecting projects for about 9 months building a simple index.

About a month ago I found an interesting thread posted on the Hadoop Users Group on LinkedIn written by Javi Roman, High Performance Computing Manager at CEDIANT (UAX). He talks about a table which maps the Hadoop ecosystem likewise I did on my list.

He published his list on Github a couple of day later and called it the Hadoop Ecosystem Table. It was an HTML table, really interesting but really hard to use for other purpose. I wanted to merge my list with this table so I decided to fork it and add more abstractions.

I wrote a couple of Ruby scripts (thanks Nokogiri) to extract data from my list and Javi’s table and put in an agnostic container. After a couple of days spent hacking on these parsers I found a simple but elegant solution: JSON.

Information about each project is stored in a separated JSON file:

{
"name": "Apache HDFS",
"description": "The Hadoop Distributed File System (HDFS) offers a way to store large files across \nmultiple machines. Hadoop and HDFS was derived from Google File System (GFS) paper. \nPrior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. \nWith Zookeeper the HDFS High Availability feature addresses this problem by providing \nthe option of running two redundant NameNodes in the same cluster in an Active/Passive \nconfiguration with a hot standby. ",
"abstract": "a way to store large files across multiple machines",
"category": "Distributed Filesystem",
"tags": [
],
"links": [
{
"text": "hadoop.apache.org",
"url": "http://hadoop.apache.org/"
},
{
"text": "Google FileSystem - GFS Paper",
"url": "http://research.google.com/archive/gfs.html"
},
{
"text": "Cloudera Why HDFS",
"url": "http://blog.cloudera.com/blog/2012/07/why-we-build-our-platform-on-hdfs/"
},
{
"text": "Hortonworks Why HDFS",
"url": "http://hortonworks.com/blog/thinking-about-the-hdfs-vs-other-storage-technologies/"
}
]
}

It includes: project name, long and short description, category, tags and links.

I merged data into these files, and wrote a couple of generator in order to put data into different templates. Now i can generate code for my WordPress page and an update version of Javi’s table.

Finally I added more data into more generic categories not strictly related to Hadoop (like MySQL forks, Memcached forks and Search Engine platforms) and build a new version of the table: The Big Data Ecosystem table. JSON files are available to everyone and will be served directly from a CDN located under same domain of table.

This is how I built an open source big data map 🙂

vkontakte_logo

Informations about VKontakte, the largest european social network, and its infrastructure are very few and fragmented. The only recent insights, in english, about its technology is a BTI’s press release which talks about VK migration on their infrastructure. Everything was top secret.

Only on 2011 at Moscow HighLoad++Pavel Durov and Oleg Illarionov told something about the architecture of the social network and insights are collected into this post (in russian). 

VK seems not different than any other popular social network: is over a LAMP stack and uses many other open source technologies.

  • Debian is the base for their custom Linux distro.
  • nginx mange load balancing in front of Apache who runs PHP using mod_php and XCache as opcode cacher.
  • MySQL is the main datastore but a custom DBMS (written using C and based on memcached protocol) is used for some magics. memcached helps also page caching.
  • XMPP is used for messages and chats and runs over node.js. Availability is granted by HAProxy who handle the node’s fragility.
  • Multimedia files are stored using xfs and media encoding is made using ffmpeg.
  • Everything is distributed over more than 4 datacenters

vk_logoThe main difference betweek VK and other social network is about server functions: VK servers are multifunctional. There is no clear distinction between database servers or file servers, they are used simultaneously in several roles.

Load balancing between servers occurs on a layered circuit which includes at balancing DNS, as well as routing requests within the system, wherein the different servers are used for different types of requests. 

For example, microblogging is working on a tricky circuit using memcached protocol capability for parallel sending requests for data on a large number of keys. In the absence of data in the cache, the same request is sent to the storage system, and the results are subjected to sorting, filtering and discarding the excess at the level of PHP-code.

The custom database is still a secret and is widely used in VKontakte. Many services use it: private messages, messages on the walls, statuses, search, privacy, friends lists and probably more. It uses a non-relational data model, and most operations are performed in memory. Access interface is an advanced protocol memcached. Specially compiled keys return the results of complex queries. They said is developed “best minds” of Russia.

I wasn’t able to find any other insight about VK infrastructure after this speech. They are like KGB 😀

From the home page

Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. […] Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.”

Introduction

Storm enables you to define a Topology (an abstraction of cluster computation) in order to describe how to handle data flow. In a topology you can define some Spouts (entry point for your data with basic preprocessing) and some Bolts (a single step of data manipulation). This simple strategy enable you to define a complex processing of streams of data.

Storm nodes are of two kinds: master and worker. Master node runs Nimbus, it is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures. Worker nodes run Supervisor. The Supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it. Everything is done through ZooKeeper.

Libraries

Resources

Books