Today’s challenge: write my first program using Haskell. Let’s start!

haskell-logo

Searching “Hello World Haskell” on Google gives me the following tutorial: Haskell in 5 steps.

Install Haskell

First step is to install the Haskell Platform. Main components are the GHC (Glasgow Haskell Compiler) and Cabal (Common Architecture for Building Applications and Libraries). I decided to use brew for simplicity.

brew update
brew install ghc
brew install cabal-install

Using the REPL

You can run Haskell REPL using ghci command:

$ ghci
GHCi, version 7.10.1: http://www.haskell.org/ghc/  :? for help
Prelude>

Here you can run any command:

Prelude> "Hello, World!"
"Hello, World!"
Prelude> putStrLn "Hello World"
Hello World
Prelude> 3 ^ 5
243
Prelude> :quit
Leaving GHCi.

Compile

Create hello.hs.

putStrLn "Hello World"

Then run ghc compiler.

$ ghc hello.hs
[1 of 1] Compiling Main             ( hello.hs, hello.o )
Linking hello ...

Output executable is named hello. You can run it as any other executable.

$ ./hello
Hello, World!

Real code

Now a bit more code: factorial calculator. First step is to define factorial function. You can do in a single line:

let fac n = if n == 0 then 1 else n * fac (n-1)

Or split definition on multiple lines:

fac 0 = 1
fac n = n * fac (n-1)

And put everything into factorial.hs. Now you can load the function inside the console:

Prelude> :load factorial.hs
[1 of 1] Compiling Main             ( factorial.hs, interpreted )
Ok, modules loaded: Main.
*Main> fac 42
1405006117752879898543142606244511569936384000000000

Or write a main function then compile and run your executable:

fac 0 = 1
fac n = n * fac (n-1)
main :: IO ()
main = print (fac 42)

N.B. First time I compiled the source code I needed to add main :: IO () to avoid compiling error: The IO action 'main' is not defined in module 'Main'.

Now your executable runs well:

./factorial
1405006117752879898543142606244511569936384000000000

Going parallel

Haskell is cool because the pure functional nature and the parallel/multicore vocation so this beginner’s tutorial add some tips about that.

First of all you need to get the Parallel lib:

$ cabal update
Downloading the latest package list from hackage.haskell.org
$ cabal install parallel
Resolving dependencies...
Downloading parallel-3.2.1.0...
Configuring parallel-3.2.1.0...
Building parallel-3.2.1.0...
Installed parallel-3.2.1.0

Then you can write your parallel software using `par` expression.

import Control.Parallel
 
main = a `par` b `par` c `pseq` print (a + b + c)
where
a = ack 3 10
b = fac 42
c = fib 34
 
fac 0 = 1
fac n = n * fac (n-1)
 
ack 0 n = n+1
ack m 0 = ack (m-1) 1
ack m n = ack (m-1) (ack m (n-1))
 
fib 0 = 0
fib 1 = 1
fib n = fib (n-1) + fib (n-2)

Compile it with -threaded flag.

$ ghc -O2 --make parallel.hs -threaded -rtsopts
[1 of 1] Compiling Main             ( parallel.hs, parallel.o )
Linking parallel ...

And run it

./parallel +RTS -N2
1405006117752879898543142606244511569936384005711076

Now, many details aren’t clear to me. From the meaning of `pseq` to the level of parallelism used. Moreover I have no idea of what are the options used with compiler and the ones for the executable. There is definitely more then “Hello, World!”.

Next step would be Haskell in 10 minutes, another beautiful introduction to the language with links for more complex topics, or Real World Haskell (by Bryan O’Sullivan, Don Stewart, and John Goerzen) or any other useful tutorial listed on 5 steps guide.

See you to the next¬†language ūüôā

python_logo

Yesterday, after a beautiful (and greedy) Christmas, I decided to start learning basic of new tech stuff. First choice was Python programming language because seems quite similar to Ruby, several friends are skilled and is widely used in Google and for Spark so is probably one of the best language to learn at the moment.

I’m quite fluent with Ruby, PHP and Javascript but I have no skills about Python. I choose CodeAcademy for a basic introduction and I’m at about 50%¬†of the course. Here is what I learned by now.

Python is dynamic typed (you have not to declare it) and types are almost the same of Ruby and PHP:

my_int = 42
my_float = 108.0
my_string = "The answer to life, the universe and everything"

There is no parenthesis, no begin/end structure. Everything is related to indentation (I absolutely love it). Function are defined as follow:

def my_function(argument): # colon start a new indentation level
return argument * 2

Control structures are always the same. Conditionals:

if condition1:
return 1
# elif is like elsif in Ruby and elseif in PHP
elif condition2:
return 2
else:
return 3

Loops:

for item in my_list
print item

Instead Hash and Array the use Dictionaries and Lists but they are almost the same:

my_list = ["daniele", "luca", "michele"]
my_dictionary = {"marco": 1, "matteo": 7, "michele": 4}

I started a few hours ago and I’m just a newbie. I hope the rest of the course on CodeAcademy will teach me about more complex topics and, talking about real world usage, I probably need¬†to learn how to handle version management and¬†discover most powerful libaries. Anyway I can already say is pretty cool to program using Python ūüôā

BlogMeter is our main competitor in the Social Reputation segment in Italy. It was founded in 2007, the headquarters are in Milan, Turin, Rome and Madrid and developer team consist of about 10 coders. At Codemotion Vittorio Di Tomaso (CEO at BlogMeter) and Roberto Franchini (Chief Architect at BlogMeter) talk about infrastructure behind the company. Speaks were very interesting and, as I said for Datalytics, is funny to discover how approaches could be different when you face the same problem.

blogmeter_logo

First talk was by Roberto Franchini about GlusterFS. I never used this distributed filesystem but seems really interesting and completely different than HDFS.

They use it to store the daily production of more than 10GB of Lucene inverted indexes (more than 200GB/month) Their platform search stored indexes to extract different sets of documents for every customer. Seems crazy but they open indexes directly on storage. Hardware grows from 4TB on 8 non-dedicated server in 2010 to 28TB on 2 dedicated server in 2014 and they plan to grow more. Outages were caused by misconfiguration of storage limits but there was no data loss.

Here is the slides:


 

Second talk was by¬†Vittorio Di Tomaso about the BlogMeter‘s infrastructure (with a bit of advertising and marketing about his company). Here is the overall schema (taken from his presentation and cropped¬†to remove Italian title):

blogmeter_infrastructure

Platform leverage on PostgreSQL, Java and GlusterFS. Stream data come mostly from Twitter (they use both Streaming API and Gnip as data provider) and is processed on Hazelcast data grid using Kestrel to manage incoming data, Redis to deduplicate data and Drools to route (and avoid unnecessary processing). They optimize their process, moving from batch to near real time, avoiding processing on duplicated contents and optimizing processing pipeline from a linear flow to a DAG (directed acyclic graph) flow.

To process data they use a Spring based application that use Apache IUMA and theirs closed-source Sophia Semantic Engine and store data using Lucene. A few more product are used: Ubuntu as operating system, Jenkins for Continuos Integration and Jasig for authentication and security.

Visualization layer uses standard-de-facto libraries like D3.js, jQuery and Fusion Charts. Informations about hardware list: 300 cores, 1.2TB of RAM and 29TB of storage.

Here is the slides:


 

In the end their process is not different than ours. They handle¬†incoming data, process it, store it and visualize it. Probably their system are more oriented to quantity than quality but logic is similar¬†and¬†everything seems cool ūüôā

As for Datalytics I’m really interested to know if they monitor their name on Twitter so¬†I‚Äôm going to tweet about this article using #BlogMeter¬†hashtag. If you find this article please tweet me ‚ÄúYes, we found it bwahaha‚ÄĚ ūüėõ

[UPDATE 2014-12-27 21:22 CET]

After about 4 weeks since I tweeted about this post I didn’t receive any answer yet. As I said before they probably focus on quantity over quality ūüėČ

I few days ago I have been at Codemotion in Milan and I had the opportunity to discover some insights¬†about technologies used by two of our main competitor in Italy: BlogMeter and Datalytics. It’s quite interesting because, also if technical challenges are almost the same, each company use a differente approach with a different stack.

datalytics_logo

Datalytics a is relatively new company founded 4 months ago. They had¬†a desk at Codemotion to show theirs products and recruit new people. I chatted with Marco Caruso, the CTO (who probably didn’t know who I am, sorry Marco, I just wanted to avoid¬†hostility ūüėČ ), about technologies they use and¬†developer profile they were looking for. Requires¬†skills was:

Their tech team is composed by 4 developers (including the CTO) and¬†main products are:¬†Datalytics Monitoring‚ĄĘ (a sort of statistical dashboard that shows¬†buzz stats in real time) and¬†Datalytics Engage‚ĄĘ (a real time analytics dashboard for live events). I have no technical insights about how they systems works but I can guess some details inferring them from the¬†buzz words they use.

Supported sources are Twitter, Facebook (only public data), Instagram, Youtube, Vine (logos are on their website) and probably Pinterest.

They use DataSift as data source in addition to standard APIs. I suppose their processing pipeline uses Storm to manage streaming input, maybe with an importing layer before. Data is crunched using Hadoop and Java and results are stored on MongoDB (Massimo Brignoli, Italian MongoDB evangelist, advertise their company during his presentation so I suppose they largely use it).

Node.js should be used for frontend. Is fast enough for near real time application (also using websockets) and play really well both with Angular.js and MongoDB (the MEAN stack). D3.js is obviously the only choice for complex dynamic charts.

I’m not so happy when I discover a new competitor in our market segment. Competition gets harder and this is not fun. Anyway guys at Datalytics seems smart (and nice) and compete with them would be a pleasure and will push me to do my best.

Now I’m curios to know if Datalytics is monitoring buzz on the web around its company name. I’m going to tweet about this article using #Datalytics hashtag. If you find this article please tweet me “Yes, we found it bwahaha” ūüėõ

[UPDATE 2014-12-27 21:18 CET]

@DatalyticsIT favorite my tweet on December 1st. This probably means they found my article but the didn’t read it! ūüėÄ

A few hours after I posted about DataSift architecture, @choult, one of the about 25 ninjas who develop DataSift platform, tweet me.

The following SlideShare presentation by @stuherbert, another ninja, talks about the use of PHP in DataSift. Unlike what you may think, PHP is widely used in data processing.

datasift_repo_languges

System is decomposable in three major data pipelines:

  • Data Archiving (Adds new data to Historic Archive)
  • Filtering Pipeline (Filtering and delivery data in realtime)
  • Playback Pipeline¬†(Filtering and delivery data from Historic Archive)

And PHP is used for many parts of these.

datasift_php

They use a custom build of PHP 5.3.latest with several optimizations and compiled-in extensions (ZeroMQ, APC, XHProof, Redis, XDebug). The also develop some internal components:

  • Frink, tweetrmeme’s framework
  • Stone: foundation of in-house test tools, Hornet and Storyteller (they probably open source a fork named Storyplayer).

Unfortunately I wasn’t able to find more details about these. Anyway, here is the presentation:


datasift_logo

DataSift, as they said on their home page, “aggregate, process and deliver social data“. It is one of the oldest Twitter certified partners and offers data coming from almost every existing social network. I use it everyday to “listen” the net and find data I need for my analysis.

It’s impressive to watch how fast they collect data from external sources and deliver it to your chosen destination. When I tweet, a couple of minutes ago a JSON file land my S3 bucket.

To create an Internet scale filtering is not easy. Their infrastructure is really complex and optimized. This is a 2011 diagram of their workflow.

datasift_infrastructure

Twitter generates more than 500 million tweets per day and is only one of the available resources. The DataSift system performs 250+ million sentiment analysis with sub 100ms latency, and several TB of augmented (includes gender, sentiment, etc) data transits the platform daily. Data Filtering Nodes Can process up to 10,000 unique streams. Can do data-lookup’s on 10,000,000+ username lists in real-time. Links Augmentation Performs 27 million link resolves + lookups plus 15+ million full web page aggregations per day.

C++ is used for the performance-critical components, like the core filtering engine and PHP is for the site, external API server, most of the internal web services, and a custom-built, high performance job queue manager. Java/Scala for batch processing with HBase and MapReduce jobs. Kafka is used as queuing system and Ruby is used for deploys and provisioning. Thrift is widely used.

MySQL (Percona server) on SSD drives is used as primary storage, HBase cluster over more than 30 Hadoop nodes provides a place to store historical data and Memcached and Redis are used for caching.

Here is a schema of the processing unit which build the historical database.

datasift_historical

Message queues are another critical component of the infrastructure. 0mq (custom build from latest alpha branch, with some stability fixes, to use publisher-side filtering), used in different configurations:

  • PUB-SUB for replication / message broadcasting;
  • PUSH-PULL for round-robin workload distribution;
  • REQ-REP for health checks of different components.

Kafka for high-performance persistent queues. In both cases they’re working with the developers and contributing bug reports / traces / fixes / client libraries.

All code is pulled from the repo from Jenkins every 5 mins, automatically tested and verified with several QA tools, packaged as an RPM and moved to the dev package repo. Chef is used to automate deployments and manage configuration. All services emit StatsD events, which are combined with other system-level checks, added to Zenoss and displayed with Graphite.

The biggest challenge IMHO is filtering. Filtering at this scale requires a different approach. They started with work they did at TweetMeme. The core filter engine is in C++ and is called the Pickle Matrix.¬†Over three years they’ve developed a compiler and their own virtual machine. We don’t know what their technology is exactly, but it might be something like Distributed Complex Event Processing with Query Rewriting.

Sources

Almost all content of this post come from the wonderful article “DataSift Architecture: Realtime Datamining At 120,000 Tweets Per Second” posted on HighScalability. Some details also from “Historical Architecture – Data Mining Billions of Tweets” from DataSift blog.

vkontakte_logo

Informations about VKontakte, the largest european social network, and its infrastructure are very few and fragmented. The only recent insights, in english, about its technology is a BTI’s press release which talks about VK migration on their infrastructure. Everything was top secret.

Only on 2011 at Moscow HighLoad++, Pavel Durov and Oleg Illarionov told something about the architecture of the social network and insights are collected into this post (in russian). 

VK seems not different than any other popular social network: is over a LAMP stack and uses many other open source technologies.

  • Debian is the base for their custom Linux distro.
  • nginx mange load balancing in front of Apache who runs PHP using mod_php and XCache as opcode cacher.
  • MySQL is the main datastore but a custom DBMS (written using C and based on memcached protocol) is used for some magics. memcached helps also page caching.
  • XMPP is used for messages and chats and runs over node.js. Availability is granted by HAProxy who handle the node’s fragility.
  • Multimedia files are stored using xfs and media encoding is made using ffmpeg.
  • Everything is distributed over more than 4 datacenters

vk_logoThe main difference betweek VK and other social network is about server functions: VK servers are multifunctional. There is no clear distinction between database servers or file servers, they are used simultaneously in several roles.

Load balancing between servers occurs on a layered circuit which includes at balancing DNS, as well as routing requests within the system, wherein the different servers are used for different types of requests. 

For example, microblogging is working on a tricky circuit using memcached protocol capability for parallel sending requests for data on a large number of keys. In the absence of data in the cache, the same request is sent to the storage system, and the results are subjected to sorting, filtering and discarding the excess at the level of PHP-code.

The custom database is still a secret and is widely used in VKontakte. Many services use it: private messages, messages on the walls, statuses, search, privacy, friends lists and probably more. It uses a¬†non-relational data model, and most operations are performed in memory.¬†Access interface is an advanced protocol memcached.¬†Specially compiled keys return the results of complex queries. They said is¬†developed “best minds” of Russia.

I wasn’t able to find any other insight about VK infrastructure after this speech. They are like KGB ūüėÄ

From the home page

Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. […] Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.”

Introduction

Storm enables you to define a Topology (an abstraction of cluster computation) in order to describe how to handle data flow. In a topology you can define some Spouts (entry point for your data with basic preprocessing) and some Bolts (a single step of data manipulation). This simple strategy enable you to define a complex processing of streams of data.

Storm nodes are of two kinds: master and worker. Master node runs Nimbus, it is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures. Worker nodes run Supervisor. The Supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it. Everything is done through ZooKeeper.

Libraries

Resources

Books

klout_logo

According to Wikipedia,¬†Klout is “a website and mobile app that uses social media analytics to rank its users according to online social influence via the “Klout Score“, which is a numerical value between 1 and 100“.

This is not so different from what I try to do everyday. They get signals from social networks, process them in order to extract relevant data and show some diagrams and a synthetic index of user influence.¬†It’s really interesting for me observe how their data is stored and processed.

At Hadoop Summit 2012, Dave Mariani (by Klout) and Denny Lee (by Microsoft) presented the Klout architecture and shown the following diagram:

klout_architecture

It shows many different technologies, a great example of polyglot persistence ūüôā

Klout uses a lot of Hadoop. It’s used to collect signals coming from different Signal Collectors (one for each social network i suppose). Procedure to enhance data are written using Pig and Hive¬†used also for data warehouse.

Currently MySQL is used only to collect user registrations, ingested into the data warehouse system. In the past they use it as bridge between the data warehouse and their “Cube“, a¬†Microsoft SQL Server Analysis Services (SSAS). They use it for Business Intelligence with Excel and other custom apps. On 2011 data were migrated using Sqoop. Now they can leverage on¬†Microsoft‚Äôs Hive ODBC driver and MySQL isn’t used anymore.

Website and mobile app are based on the Klout API. Data is collected from the data warehouse and stored into HBase (users profile and score) and MongoDB (interaction between users). ElasticSearch is used as search index.

Most of custom components are written in Scala. The only exception is the website, written in Javascript/Node.js.

In the end Klout is probably the biggest company working both using open source tools coming from the Hadoop ecosystem and Microsoft tools. The Hadoop version for Windows Azure, developed in pair with Hortonworks, is probably the first product of this collaboration.

Sources

YouPorn is one of the most visited porn site on the web. I applied for a job as developer in 2009 because of a post who talk about its technological stack.

I heard again about its infrastructure because, about an year ago, @ErikPickupYP spoke about the great switchover at CooFoo and @antirez tweet some details regarding datastore. The development team rewrote the entire site using Redis as primary database.

Original stack was based on Perl and Catalyst and powered the site from 2006 to 2011. After acquisition they rewrote the site using a well designed LAMP stack.

The chosen framework is Symfony2 (which uses Doctrine as ORM) running over nginx with PHP-FPM helped by Varnish (speed up requests, manage cache and check servers status) and HAProxy (load balance and health check of servers). Syslog-ng handle logs. They maintain two pools of servers: a write pool with a fail-over to backup-Master and a read pool will servers except the master.

Datastore is the most interesting part. Initially they used MySQL but more than 200 million of pageviews and 300K query per second are too much to be handled using only MySQL. First try was to add ActiveMQ to enqueue writes but a separate Java infrastructure is too expensive to be maintained. Finally they add Redis in front of MySQL and use it as main datastore.

Now all reads come from Redis.¬†MySQL is used to allow the building new sorted sets as requirements change and it’s highly normalized because it‚Äôs not used directly for the site.¬†After the switchover additional Redis nodes were added, not because Redis was overworked, but because the network cards couldn’t keep up with Redis ūüėÄ

Lists are stored in a sorted set and MySQL is used as source to rebuild them when needed. Pipelining allows Redis to be faster and Append-only-file (AOF) is an efficient strategy to easily backup data.

In the end YouPorn uses a LAMP stack “on-steroids” which smartly uses Redis and other modern¬†middlewares.

Sources: