How it works: VKontakte

vkontakte_logo

Informations about VKontakte, the largest european social network, and its infrastructure are very few and fragmented. The only recent insights, in english, about its technology is a BTI’s press release which talks about VK migration on their infrastructure. Everything was top secret.

Only on 2011 at Moscow HighLoad++Pavel Durov and Oleg Illarionov told something about the architecture of the social network and insights are collected into this post (in russian). 

VK seems not different than any other popular social network: is over a LAMP stack and uses many other open source technologies.

  • Debian is the base for their custom Linux distro.
  • nginx mange load balancing in front of Apache who runs PHP using mod_php and XCache as opcode cacher.
  • MySQL is the main datastore but a custom DBMS (written using C and based on memcached protocol) is used for some magics. memcached helps also page caching.
  • XMPP is used for messages and chats and runs over node.js. Availability is granted by HAProxy who handle the node’s fragility.
  • Multimedia files are stored using xfs and media encoding is made using ffmpeg.
  • Everything is distributed over more than 4 datacenters

vk_logoThe main difference betweek VK and other social network is about server functions: VK servers are multifunctional. There is no clear distinction between database servers or file servers, they are used simultaneously in several roles.

Load balancing between servers occurs on a layered circuit which includes at balancing DNS, as well as routing requests within the system, wherein the different servers are used for different types of requests. 

For example, microblogging is working on a tricky circuit using memcached protocol capability for parallel sending requests for data on a large number of keys. In the absence of data in the cache, the same request is sent to the storage system, and the results are subjected to sorting, filtering and discarding the excess at the level of PHP-code.

The custom database is still a secret and is widely used in VKontakte. Many services use it: private messages, messages on the walls, statuses, search, privacy, friends lists and probably more. It uses a non-relational data model, and most operations are performed in memory. Access interface is an advanced protocol memcached. Specially compiled keys return the results of complex queries. They said is developed “best minds” of Russia.

I wasn’t able to find any other insight about VK infrastructure after this speech. They are like KGB 😀

How it works: Facebook – Part 1

Facebook has one of the biggest hardware infrastructure in the world with more than 60.000 servers. But servers are worthless if you can’t efficiently use them. So how does Facebook works?

Facebook born in 2004 on a LAMP stack. A PHP website with MySQL datastore. Now is powered by a complex set of interconnected systems grown over the original stack. Main components are:

  1. Frontend
  2. Main persistence layer
  3. Data warehouse layer
  4. Facebook Images
  5. Facebook Messages and Chat
  6. Facebook Search

1. Frontend

hiphopThe frontend is written in PHP and compiled using HipHop (open sourced on github) and g++. HipHop converts PHP in C++ you can compile. Facebook frontend is a single 1,5 GB binary file deployed using BitTorrent. It provide logic and template system. Some parts are external and use HipHop Interpreter (to develop and debug) and HipHop Virtual Machine to run HipHop “ByteCode”.

Is not clear which web server they are using. Apache was used in the beginning and probably is still used for main fronted. In 2009 they acquired (and open sourced) Tornado from FriendFeed and are using it for Real-Time updates feature of the frontend. Static contents (such as fronted images) are served through Akamai.

Everything not in the main binary (or in the ByteCode source) is served using using Thrift protocol. Java services are served without Tomcat or Jetty. Overhead required by these application server is worthless for Facebook architecture. Varnish is used to accelerate responses to HTTP requests.

To speed up frontend rendering they use BigPipe, a sort of mod_pagespeed. After a HTTP request servers fetch data and build HTML structure. HTML is sent before data retrieval and is rendered by the browser. After render, Javascript retrieve and display data (already available) in asynchronous way.

memcachedThe Memcached infrastructure is based on more than 800 servers with 28TB of memory. They use a modded version of Memcached which reimplement the connection pool buffer (and rely on some mods on the network layer of Linux) to better manage hundred of thousand of connections.

Insights and Sources:

2. Main Persistence layer

Facebook rely on MySQL as main datastore. I know, I can’t believe it too…

mysql_clusterThey have one of the largest MySQL Cluster deploy in the world and use the standard InnoDB engine. As many other people they designd the system to easìly handle MySQL failure, they simply enforce backup. Recently Facebook Engineers published a post about their 3-level backup stack:

  • Stage 1: each node (all the replica set) of cluster has 2 rack backup. One of these backup binary log (to speed up slave promotion) and one with mysqldump snapshot taken every night (to handle node failure).
  • Stage 2: each binlog and backup are copied in a larger Hadoop cluster mapped using HDFS. This layer provide a fast and distributed source of data useful to restore nodes also in differenti locations.
  • Stage 3: provide a long-term storage copying data from Hadoop cluster to a discrete storage in a separate region.

Facebook collect several statistics about failure, traffic and backups. These statistics contribute to build a “score” of a node. If a node fails or has a score too low an automatic system provide to restore it from Stage 1 or Stage 2. They don’t try to avoid problem, they simply handle them as fast as possible.

Insights and Sources:

3. Data warehouse layer

hadoopMySQL data is also moved to a “cold” storage to be analyzed. This storage is based on Hadoop HDFS (which leverage on HBase) and is queried using Hive. Data such as logging, clicks and feeds transit using Scribe and are aggregating and stored in Hadoop HDFS using Scribe-HDFS.

Standard Hadoop HDFS implementation has some limitations. Facebook adds a different kind of NameNodes called AvatarNodes which use Apache ZooKeeper to handle node failure. They also adds RaidNode that use XOR-parity files to decrease storage usage.

hadoop_cluster

Analysis are made using MapReduce. Anyway analyze several petabytes of data is hard and standard Hadoop MapReduce framework became a limit on 2011. So they developed Corona, a new scheduling framework that separates cluster resource management from job coordination. Results are stored into Oracle RAC (Real Application Cluster).

Insights and Sources:

In the next post I’m going to analyze the other Facebook services: Images, Message and Search.