How it Works: BlogMeter

BlogMeter is our main competitor in the Social Reputation segment in Italy. It was founded in 2007, the headquarters are in Milan, Turin, Rome and Madrid and developer team consist of about 10 coders. At Codemotion Vittorio Di Tomaso (CEO at BlogMeter) and Roberto Franchini (Chief Architect at BlogMeter) talk about infrastructure behind the company. Speaks were very interesting and, as I said for Datalytics, is funny to discover how approaches could be different when you face the same problem.

blogmeter_logo

First talk was by Roberto Franchini about GlusterFS. I never used this distributed filesystem but seems really interesting and completely different than HDFS.

They use it to store the daily production of more than 10GB of Lucene inverted indexes (more than 200GB/month) Their platform search stored indexes to extract different sets of documents for every customer. Seems crazy but they open indexes directly on storage. Hardware grows from 4TB on 8 non-dedicated server in 2010 to 28TB on 2 dedicated server in 2014 and they plan to grow more. Outages were caused by misconfiguration of storage limits but there was no data loss.

Here is the slides:


 

Second talk was by Vittorio Di Tomaso about the BlogMeter‘s infrastructure (with a bit of advertising and marketing about his company). Here is the overall schema (taken from his presentation and cropped to remove Italian title):

blogmeter_infrastructure

Platform leverage on PostgreSQL, Java and GlusterFS. Stream data come mostly from Twitter (they use both Streaming API and Gnip as data provider) and is processed on Hazelcast data grid using Kestrel to manage incoming data, Redis to deduplicate data and Drools to route (and avoid unnecessary processing). They optimize their process, moving from batch to near real time, avoiding processing on duplicated contents and optimizing processing pipeline from a linear flow to a DAG (directed acyclic graph) flow.

To process data they use a Spring based application that use Apache IUMA and theirs closed-source Sophia Semantic Engine and store data using Lucene. A few more product are used: Ubuntu as operating system, Jenkins for Continuos Integration and Jasig for authentication and security.

Visualization layer uses standard-de-facto libraries like D3.js, jQuery and Fusion Charts. Informations about hardware list: 300 cores, 1.2TB of RAM and 29TB of storage.

Here is the slides:


 

In the end their process is not different than ours. They handle incoming data, process it, store it and visualize it. Probably their system are more oriented to quantity than quality but logic is similar and everything seems cool 🙂

As for Datalytics I’m really interested to know if they monitor their name on Twitter so I’m going to tweet about this article using #BlogMeter hashtag. If you find this article please tweet me “Yes, we found it bwahaha” 😛

[UPDATE 2014-12-27 21:22 CET]

After about 4 weeks since I tweeted about this post I didn’t receive any answer yet. As I said before they probably focus on quantity over quality 😉