data_science

During the last year I refined my RSS collection about big-data, data science and analytics. I usually check it everyday in order to discover a ton of new cool technologies and have fun. Here is the updated list.

Bloggers

News about emerging technologies, scalability and data

Data companies, social networks and search engines

Companies supporting e distributing big-data processing products

Recently I discovered the awesome data science list that contains a list of interesting blogger I haven’t time to check yet. You can surely find something more in it. I’ll try to publish an update when I’ll check it.

[UPDATE 2014-09-22 11:35]

Thanks to @onurakpolat for correcting my link to awesome data science list. Previous link was to his fork, the original repo is https://github.com/okulbilisim/awesome-datascience by @okulbilisim

A  couple of weeks ago I was playing with Hack looking for online resources. I found on Youtube the playlist of the Hack Dev Day 2014 where Hack was officially presented to the world.

Introduction to the language by Julien Verlaguet is really interesting, it show the advantages of static typing and how the HHVM is able to preserve the rapid development cycle of PHP.

Also talk by Josh Watzman is interesting. He talks about how to convert PHP code to Hack code and years of experience at Facebook are extremely useful.

The conference also talks about how to run HHVM on Heroku, gives an overview of library and common use cases of Hack and talks about HHVM strong optimization.

If you are playing with Hack I absolutely recommend these videos.

hiphop_logoHipHop was one of the most notable thing came from the Facebook labs about PHP development. PHP is slow and limited. They can’t rewrite theirs entire codebase so they decided to make PHP better. HipHop is a simply PHP to C++ compiler (HPHPc). Converted code is compiled into a binary and performance improvements are about 6x.

Unfortunately HipHop has several downsides. For all the performance gains that HPHPc provided, the curve for further performance improvements had flattened. HPHPc did not fully support the PHP language, including the create_function() and eval() constructs. HPHPc required a very different push process, requiring a bigger than 1 GB binary to be compiled and distributed to many machines in short order.

hhvm_logoTo overcome these problems Facebook develops, starting from early 2010, the HHVM: a PHP virtual machine. HHVM builds on top of HPHPc, using the same runtime and extension function implementations. HHVM converts PHP code into a high-level bytecode. This bytecode is then translated into x64 machine code dynamically at runtime by a just-in-time (JIT) compiler similarly to C#/CLR or Java/JVM.

hack_logoFacebook also released Hack, a programming language for HHVM that can be seen as a new version of PHP which it allows programmers to use both dynamic typing and static typing.

HHVM supports major PHP open source projects like WordPress. Running this project on seems really easy. A little modification was needed but last version (3.9) no longer need this. HHVM can also run on Heroku using a custom buildpack available here: https://github.com/hhvm/heroku-buildpack-hhvm.

My first experiment was to run WordPress on Heroku using HHVM. First step is create a Heroku app using HHVM buildpack:

heroku create --buildpack https://github.com/hhvm/heroku-buildpack-hhvm

Then you can deploy a standard WordPress installation adding the following config.hdf (the HHVM configuration file)

Server {
DefaultDocument = index.php
}
Eval {
Jit = true
}
VirtualHost {
* {
Pattern = .*
RewriteRules {
dirindex {
pattern = ^/(.*)/$
to = $1/index.php
qsa = true
}
}
}
}
StaticFile {
FilesMatch {
* {
pattern = .*.(dll|exe)
headers {
* = Content-Disposition: attachment
}
}
}
Extensions {
css = text/css
gif = image/gif
html = text/html
jpe = image/jpeg
jpeg = image/jpeg
jpg = image/jpeg
png = image/png
tif = image/tiff
tiff = image/tiff
txt = text/plain
}
}

Warning: don’t miss a newline character on the last line or linter will fail and you will going to hate this project 😉

Everything works fine. You can add you favorite MySQL hosted service and run your WordPress 5 minutes installation. Almost every plugin seems 100% compatible, I tested most popular with no problem. Performances are better and you also have the opportunity to use Hack to develop new custom plugins.

Now I’m curious about how HHVM can improve my production installations of WordPress. About this I’m looking for an OpenShift cartridge for HHVM or someone want to collaborate to create a new one (the only I found on Github seems “young”). Anyone interested? Let me know!

Recently I had to analyze interactions on a Facebook page. I need to fetch all the contents from the stream and analyze user actions. Retrive interactions count foreach post can be hard because Facebook APIs are like hell: they change very fast, return a lot of errors, have understandable limits and give you many headache.

Anyway after a lot of tries I found a way to fetch quantitative informations about posts and photos on the stream. First of all you need the contents.

Get the contents

Graph endpoint is: https://graph.facebook.com/. You can fetch page data (I use the BBCNews page as example) at:

https://graph.facebook.com/bbcnews/posts?access_token=your_access_token

To get a valid access token you have different ways and Facebook let you choose many different kind of access tokens, each one with a different rate limit.

Returned data is a JSON array of elements. Each elements has a lot of properties which describe items on the timeline. The returned element included into the stream has just a subset of these properties (last comments, last likes, some counters). Here you can find text content, pictures and links. To get more data you need three more properties: id, type and object_id.

Status updates are identified by type “status”Photos are identified by type “photo” and Video by type “video”. The id field is used as identifier for the entry on the stream. The object_id instead is used to identify object inside the Facebook graph.

Actions: comments, likes and shares

Comments are returned paginated and sometimes APIs doesn’t return the entire list. To get the total count you need to specify the parameter summary=true.

https://graph.facebook.com/228735667216_10151700273382217/comments?summary=true&access_token=your_access_token

At the end of response you can find additional informations about comments feed. total_count displays the count.

"summary": {
"order": "ranked",
"total_count": 100
}

Likes are similar to comments. They have similar limitations and have a similar endpoint to retrive data with the same parameter summary=true.

https://graph.facebook.com/228735667216_10151700273382217/likes?summary=true&access_token=your_access_token

This time summary shows only total count:

"summary": {
"total_count": 949
}

Shares count can be found as part of the object detail.

https://graph.facebook.com/228735667216_10151700273382217/?access_token=your_access_token

After created and updated date you find shares property:

"shares": {
"count": 238
}

Convert the object_id

Depending on you data feed, sometimes id is not available and you have to handle the object_id. To be able to use previous methods you need to query the Facebook database using FQL looking for the story_id.

SELECT page_story_id
FROM photo
WHERE object_id = '10151700273362217'

https://api.facebook.com/method/fql.query?format=json&acces_token=#your_access_token&query=SELECT%20page_story_id%20FROM%20photo%20WHERE%20object_id%20%3D%20%2710151700273362217%27

The result is the page_story_id (the id of the post on the feed) of the object.

"data": [
{
"page_story_id": "228735667216_10151700273382217"
}
]

Now you can use this to retrieve counters and data.

Facebook has one of the biggest hardware infrastructure in the world with more than 60.000 servers. But servers are worthless if you can’t efficiently use them. So how does Facebook works?

Facebook born in 2004 on a LAMP stack. A PHP website with MySQL datastore. Now is powered by a complex set of interconnected systems grown over the original stack. Main components are:

  1. Frontend
  2. Main persistence layer
  3. Data warehouse layer
  4. Facebook Images
  5. Facebook Messages and Chat
  6. Facebook Search

1. Frontend

hiphopThe frontend is written in PHP and compiled using HipHop (open sourced on github) and g++. HipHop converts PHP in C++ you can compile. Facebook frontend is a single 1,5 GB binary file deployed using BitTorrent. It provide logic and template system. Some parts are external and use HipHop Interpreter (to develop and debug) and HipHop Virtual Machine to run HipHop “ByteCode”.

Is not clear which web server they are using. Apache was used in the beginning and probably is still used for main fronted. In 2009 they acquired (and open sourced) Tornado from FriendFeed and are using it for Real-Time updates feature of the frontend. Static contents (such as fronted images) are served through Akamai.

Everything not in the main binary (or in the ByteCode source) is served using using Thrift protocol. Java services are served without Tomcat or Jetty. Overhead required by these application server is worthless for Facebook architecture. Varnish is used to accelerate responses to HTTP requests.

To speed up frontend rendering they use BigPipe, a sort of mod_pagespeed. After a HTTP request servers fetch data and build HTML structure. HTML is sent before data retrieval and is rendered by the browser. After render, Javascript retrieve and display data (already available) in asynchronous way.

memcachedThe Memcached infrastructure is based on more than 800 servers with 28TB of memory. They use a modded version of Memcached which reimplement the connection pool buffer (and rely on some mods on the network layer of Linux) to better manage hundred of thousand of connections.

Insights and Sources:

2. Main Persistence layer

Facebook rely on MySQL as main datastore. I know, I can’t believe it too…

mysql_clusterThey have one of the largest MySQL Cluster deploy in the world and use the standard InnoDB engine. As many other people they designd the system to easìly handle MySQL failure, they simply enforce backup. Recently Facebook Engineers published a post about their 3-level backup stack:

  • Stage 1: each node (all the replica set) of cluster has 2 rack backup. One of these backup binary log (to speed up slave promotion) and one with mysqldump snapshot taken every night (to handle node failure).
  • Stage 2: each binlog and backup are copied in a larger Hadoop cluster mapped using HDFS. This layer provide a fast and distributed source of data useful to restore nodes also in differenti locations.
  • Stage 3: provide a long-term storage copying data from Hadoop cluster to a discrete storage in a separate region.

Facebook collect several statistics about failure, traffic and backups. These statistics contribute to build a “score” of a node. If a node fails or has a score too low an automatic system provide to restore it from Stage 1 or Stage 2. They don’t try to avoid problem, they simply handle them as fast as possible.

Insights and Sources:

3. Data warehouse layer

hadoopMySQL data is also moved to a “cold” storage to be analyzed. This storage is based on Hadoop HDFS (which leverage on HBase) and is queried using Hive. Data such as logging, clicks and feeds transit using Scribe and are aggregating and stored in Hadoop HDFS using Scribe-HDFS.

Standard Hadoop HDFS implementation has some limitations. Facebook adds a different kind of NameNodes called AvatarNodes which use Apache ZooKeeper to handle node failure. They also adds RaidNode that use XOR-parity files to decrease storage usage.

hadoop_cluster

Analysis are made using MapReduce. Anyway analyze several petabytes of data is hard and standard Hadoop MapReduce framework became a limit on 2011. So they developed Corona, a new scheduling framework that separates cluster resource management from job coordination. Results are stored into Oracle RAC (Real Application Cluster).

Insights and Sources:

In the next post I’m going to analyze the other Facebook services: Images, Message and Search.

Recently I helped @olinicola to match URLs using a regular expression in order to find links shared on a Facebook page using APIs. After several attempts we found one with a mostly perfect coverage of the Facebook behavior on Daring Fireball (look at the post for a better explanation about how it works).

This is the “extended” version for all valid uris.

(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

There is also a “simpler” version who match only http://*, https://* and www.*

(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

Fortunately are both public domain. Many thanks to John Gruber