How it Works: Datalytics

I few days ago I have been at Codemotion in Milan and I had the opportunity to discover some insights about technologies used by two of our main competitor in Italy: BlogMeter and Datalytics. It’s quite interesting because, also if technical challenges are almost the same, each company use a differente approach with a different stack.

datalytics_logo

Datalytics a is relatively new company founded 4 months ago. They had a desk at Codemotion to show theirs products and recruit new people. I chatted with Marco Caruso, the CTO (who probably didn’t know who I am, sorry Marco, I just wanted to avoid hostility 😉 ), about technologies they use and developer profile they were looking for. Requires skills was:

Their tech team is composed by 4 developers (including the CTO) and main products are: Datalytics Monitoring™ (a sort of statistical dashboard that shows buzz stats in real time) and Datalytics Engage™ (a real time analytics dashboard for live events). I have no technical insights about how they systems works but I can guess some details inferring them from the buzz words they use.

Supported sources are Twitter, Facebook (only public data), Instagram, Youtube, Vine (logos are on their website) and probably Pinterest.

They use DataSift as data source in addition to standard APIs. I suppose their processing pipeline uses Storm to manage streaming input, maybe with an importing layer before. Data is crunched using Hadoop and Java and results are stored on MongoDB (Massimo Brignoli, Italian MongoDB evangelist, advertise their company during his presentation so I suppose they largely use it).

Node.js should be used for frontend. Is fast enough for near real time application (also using websockets) and play really well both with Angular.js and MongoDB (the MEAN stack). D3.js is obviously the only choice for complex dynamic charts.

I’m not so happy when I discover a new competitor in our market segment. Competition gets harder and this is not fun. Anyway guys at Datalytics seems smart (and nice) and compete with them would be a pleasure and will push me to do my best.

Now I’m curios to know if Datalytics is monitoring buzz on the web around its company name. I’m going to tweet about this article using #Datalytics hashtag. If you find this article please tweet me “Yes, we found it bwahaha” 😛

[UPDATE 2014-12-27 21:18 CET]

@DatalyticsIT favorite my tweet on December 1st. This probably means they found my article but the didn’t read it! 😀

Was I wrong about Crate Data?

crate_logo

I usually don’t trust cutting edge datastore. They promise a lot of stunning features (and use a lot of superlatives to describe them) but almost every time they are too young and have so much problems to run in production to be useless. I thought the same also about Crate Data.

“Massively scalable data store. It requires zero administration”

First time I read these words (take from the home page of Crate Data) I wasn’t impressed. I simply didn’t think was true. Some months later I read some articles and the overview of the project and I found something more interesting:

It includes solid established open source components (Presto, Elasticsearch, Lucene, Netty)

I used both Lucene and Elasticsearch in production for several years and I really like Presto. Combine some production-ready components can definitely be a smart way to create something great. I decided to give it a try.

They offer a quick way to test it:

bash -c "$(curl -L try.crate.io)"

But I don’t like self install scripts so I decided to download it a run from bin. It simply require JVM. I unpacked it on my desktop on OS X and I launched ./bin/crate. The process bind the port 4200 (or first available between 4200 and 4300) and if you go to http://127.0.0.1:4200/admin you found the admin interface (there is no authentication). You also had a command line interface: ./bin/crash. Is similar to MySQL client and you are familiar with any other SQL client you will be familiar with crash too.

I created a simple table with semi-standard SQL code (data types are a bit different)

create table items (id integer, title string)

Then I search for a Ruby client and I found crate_ruby, the official Ruby client. I started to fill the table using a Ruby script and a million record CSV as input. Inserts go by 5K per second and the meantime I did some aggregation query on database using standard SQL (GROUP BY, ORDER BY and so on) to test performances and response was quite fast.

CSV.foreach("data.csv", col_sep: ";").each do |row|
  client.execute("INSERT INTO items (id, title) VALUES (\, \)", [row[0], row[9]])
end

Finally I decided to inspect cluster features by running another process on the same machine. After a couple of seconds the admin interface shows a new node and after a dozen informs me data was fully replicated. I also tried to shut down both process to see what happen and data seems ok. I was impressed.

crate_admin

I still have many doubts about Crate. I don’t know how to manage users and privileges, I don’t know how to create a custom topology for a cluster and I don’t know how difficult is to use advanced features (like full text search or blob upload). But at the moment I’m impressed because administration seems really easy and scalability seems easy too.

Next step will be test it in production under a Rails application (I found an interesting activerecord-crate-adapter) and test advanced features to implement a real time search. I don’t know if I’ll use it but beginning looks very good.

Next week O’Reilly will host a webcast about Crate. I’m really looking forward to discover more about the project.

How it Works: DataSift – PHP details

A few hours after I posted about DataSift architecture, @choult, one of the about 25 ninjas who develop DataSift platform, tweet me.

The following SlideShare presentation by @stuherbert, another ninja, talks about the use of PHP in DataSift. Unlike what you may think, PHP is widely used in data processing.

datasift_repo_languges

System is decomposable in three major data pipelines:

  • Data Archiving (Adds new data to Historic Archive)
  • Filtering Pipeline (Filtering and delivery data in realtime)
  • Playback Pipeline (Filtering and delivery data from Historic Archive)

And PHP is used for many parts of these.

datasift_php

They use a custom build of PHP 5.3.latest with several optimizations and compiled-in extensions (ZeroMQ, APC, XHProof, Redis, XDebug). The also develop some internal components:

  • Frink, tweetrmeme’s framework
  • Stone: foundation of in-house test tools, Hornet and Storyteller (they probably open source a fork named Storyplayer).

Unfortunately I wasn’t able to find more details about these. Anyway, here is the presentation:

[slideshare id=17557205&doc=phpandthefirehose2013-130323121454-phpapp01]

How it works: VKontakte

vkontakte_logo

Informations about VKontakte, the largest european social network, and its infrastructure are very few and fragmented. The only recent insights, in english, about its technology is a BTI’s press release which talks about VK migration on their infrastructure. Everything was top secret.

Only on 2011 at Moscow HighLoad++Pavel Durov and Oleg Illarionov told something about the architecture of the social network and insights are collected into this post (in russian). 

VK seems not different than any other popular social network: is over a LAMP stack and uses many other open source technologies.

  • Debian is the base for their custom Linux distro.
  • nginx mange load balancing in front of Apache who runs PHP using mod_php and XCache as opcode cacher.
  • MySQL is the main datastore but a custom DBMS (written using C and based on memcached protocol) is used for some magics. memcached helps also page caching.
  • XMPP is used for messages and chats and runs over node.js. Availability is granted by HAProxy who handle the node’s fragility.
  • Multimedia files are stored using xfs and media encoding is made using ffmpeg.
  • Everything is distributed over more than 4 datacenters

vk_logoThe main difference betweek VK and other social network is about server functions: VK servers are multifunctional. There is no clear distinction between database servers or file servers, they are used simultaneously in several roles.

Load balancing between servers occurs on a layered circuit which includes at balancing DNS, as well as routing requests within the system, wherein the different servers are used for different types of requests. 

For example, microblogging is working on a tricky circuit using memcached protocol capability for parallel sending requests for data on a large number of keys. In the absence of data in the cache, the same request is sent to the storage system, and the results are subjected to sorting, filtering and discarding the excess at the level of PHP-code.

The custom database is still a secret and is widely used in VKontakte. Many services use it: private messages, messages on the walls, statuses, search, privacy, friends lists and probably more. It uses a non-relational data model, and most operations are performed in memory. Access interface is an advanced protocol memcached. Specially compiled keys return the results of complex queries. They said is developed “best minds” of Russia.

I wasn’t able to find any other insight about VK infrastructure after this speech. They are like KGB 😀

How it works: Klout

klout_logo

According to WikipediaKlout is “a website and mobile app that uses social media analytics to rank its users according to online social influence via the “Klout Score“, which is a numerical value between 1 and 100“.

This is not so different from what I try to do everyday. They get signals from social networks, process them in order to extract relevant data and show some diagrams and a synthetic index of user influence. It’s really interesting for me observe how their data is stored and processed.

At Hadoop Summit 2012, Dave Mariani (by Klout) and Denny Lee (by Microsoft) presented the Klout architecture and shown the following diagram:

klout_architecture

It shows many different technologies, a great example of polyglot persistence 🙂

Klout uses a lot of Hadoop. It’s used to collect signals coming from different Signal Collectors (one for each social network i suppose). Procedure to enhance data are written using Pig and Hive used also for data warehouse.

Currently MySQL is used only to collect user registrations, ingested into the data warehouse system. In the past they use it as bridge between the data warehouse and their “Cube“, a Microsoft SQL Server Analysis Services (SSAS). They use it for Business Intelligence with Excel and other custom apps. On 2011 data were migrated using Sqoop. Now they can leverage on Microsoft’s Hive ODBC driver and MySQL isn’t used anymore.

Website and mobile app are based on the Klout API. Data is collected from the data warehouse and stored into HBase (users profile and score) and MongoDB (interaction between users). ElasticSearch is used as search index.

Most of custom components are written in Scala. The only exception is the website, written in Javascript/Node.js.

In the end Klout is probably the biggest company working both using open source tools coming from the Hadoop ecosystem and Microsoft tools. The Hadoop version for Windows Azure, developed in pair with Hortonworks, is probably the first product of this collaboration.

Sources