About a month ago I wrote about how cool was to migrate to HHVM on OpenShift. Custom cartridge for Nginx + HHVM and MariaDB was running fast and I was really excited about the new stack. I was landing on a new, beautiful world.

About a week later I faced some problems because of disk space. Files took only 560MB over 1GB but OpenShift shell gave me an error for 100% disk usage (and a nice blank page on the home because cache couldn’t be written). I wasn’t able to understand why it was giving me that error. It probably depends on log files written by custom cartridge in other position inside of the filesystem. No idea. Anyway I had no time to go deeper so I bought 1 more GB of storage.

The day after I bought the storage blog speed goes down. It was almost impossible to open the blog and CloudFlare gives me timeout for half of the requests. Blog visits started to fall and I have no idea about how to fix that. Some weeks later I discover some troubles with My Corderwall Badges and Simple Sharer Button Adder but, in the OpenShift environment, I had no external caching system useful to handle this kind of problems.

I didn’t want to come back to MySQL and Apache but also trash all my articles wasn’t fun so I choose something I rejected 3 years ago: I took a standalone server.

server-in-datacenter

First choice was Scaleway. It’s trendy and is BareMetal. 3.5€ for a 4 core ARM (very hipster choice), 2 GB RAM, 50 GB SSD server. New interface is cool, better then Linode and Digital Ocean, server and resources are managed easily. Unfortunately HHVM is still experimental on ARM and SSD are on SAN, and they aren’t so fast (100MB/s).

Next choice was OVH. New 2016 VPS SSD (available in Canadian datacenters) are cheap enough (3.5$) and offer a virtual core Xeon with 2 GB RAM and 10 GB SSD. Multicore performances are lower and you have a lot of less storage but is an X86-64 architecture and SSD is faster (250 MB/s). I took this one!

Unfortunately my preferences aren’t changed since my first post. I’m still a developer, not a sysadmin. I’m not a master in Linux configuration and my stack has several running parts and my blog was still unavailable. My beautiful migration on cutting edge technologies became an emergency landing.

Luckily I found several online tutorial which explain how to master the WordPress stack. In the next days I completed the migration and my new stack now runs on: Pound, Varnish, Nginx, HHVM, PHP-FPM and MariaDB. I hope to have enough time in the coming days to publish all the useful stuff I used for configuration.

For the moment I’m proud to share average response time of the home page: 342ms 🙂

YouPorn is one of the most visited porn site on the web. I applied for a job as developer in 2009 because of a post who talk about its technological stack.

I heard again about its infrastructure because, about an year ago, @ErikPickupYP spoke about the great switchover at CooFoo and @antirez tweet some details regarding datastore. The development team rewrote the entire site using Redis as primary database.

Original stack was based on Perl and Catalyst and powered the site from 2006 to 2011. After acquisition they rewrote the site using a well designed LAMP stack.

The chosen framework is Symfony2 (which uses Doctrine as ORM) running over nginx with PHP-FPM helped by Varnish (speed up requests, manage cache and check servers status) and HAProxy (load balance and health check of servers). Syslog-ng handle logs. They maintain two pools of servers: a write pool with a fail-over to backup-Master and a read pool will servers except the master.

Datastore is the most interesting part. Initially they used MySQL but more than 200 million of pageviews and 300K query per second are too much to be handled using only MySQL. First try was to add ActiveMQ to enqueue writes but a separate Java infrastructure is too expensive to be maintained. Finally they add Redis in front of MySQL and use it as main datastore.

Now all reads come from Redis. MySQL is used to allow the building new sorted sets as requirements change and it’s highly normalized because it’s not used directly for the site. After the switchover additional Redis nodes were added, not because Redis was overworked, but because the network cards couldn’t keep up with Redis 😀

Lists are stored in a sorted set and MySQL is used as source to rebuild them when needed. Pipelining allows Redis to be faster and Append-only-file (AOF) is an efficient strategy to easily backup data.

In the end YouPorn uses a LAMP stack “on-steroids” which smartly uses Redis and other modern middlewares.

Sources:

Facebook has one of the biggest hardware infrastructure in the world with more than 60.000 servers. But servers are worthless if you can’t efficiently use them. So how does Facebook works?

Facebook born in 2004 on a LAMP stack. A PHP website with MySQL datastore. Now is powered by a complex set of interconnected systems grown over the original stack. Main components are:

  1. Frontend
  2. Main persistence layer
  3. Data warehouse layer
  4. Facebook Images
  5. Facebook Messages and Chat
  6. Facebook Search

1. Frontend

hiphopThe frontend is written in PHP and compiled using HipHop (open sourced on github) and g++. HipHop converts PHP in C++ you can compile. Facebook frontend is a single 1,5 GB binary file deployed using BitTorrent. It provide logic and template system. Some parts are external and use HipHop Interpreter (to develop and debug) and HipHop Virtual Machine to run HipHop “ByteCode”.

Is not clear which web server they are using. Apache was used in the beginning and probably is still used for main fronted. In 2009 they acquired (and open sourced) Tornado from FriendFeed and are using it for Real-Time updates feature of the frontend. Static contents (such as fronted images) are served through Akamai.

Everything not in the main binary (or in the ByteCode source) is served using using Thrift protocol. Java services are served without Tomcat or Jetty. Overhead required by these application server is worthless for Facebook architecture. Varnish is used to accelerate responses to HTTP requests.

To speed up frontend rendering they use BigPipe, a sort of mod_pagespeed. After a HTTP request servers fetch data and build HTML structure. HTML is sent before data retrieval and is rendered by the browser. After render, Javascript retrieve and display data (already available) in asynchronous way.

memcachedThe Memcached infrastructure is based on more than 800 servers with 28TB of memory. They use a modded version of Memcached which reimplement the connection pool buffer (and rely on some mods on the network layer of Linux) to better manage hundred of thousand of connections.

Insights and Sources:

2. Main Persistence layer

Facebook rely on MySQL as main datastore. I know, I can’t believe it too…

mysql_clusterThey have one of the largest MySQL Cluster deploy in the world and use the standard InnoDB engine. As many other people they designd the system to easìly handle MySQL failure, they simply enforce backup. Recently Facebook Engineers published a post about their 3-level backup stack:

  • Stage 1: each node (all the replica set) of cluster has 2 rack backup. One of these backup binary log (to speed up slave promotion) and one with mysqldump snapshot taken every night (to handle node failure).
  • Stage 2: each binlog and backup are copied in a larger Hadoop cluster mapped using HDFS. This layer provide a fast and distributed source of data useful to restore nodes also in differenti locations.
  • Stage 3: provide a long-term storage copying data from Hadoop cluster to a discrete storage in a separate region.

Facebook collect several statistics about failure, traffic and backups. These statistics contribute to build a “score” of a node. If a node fails or has a score too low an automatic system provide to restore it from Stage 1 or Stage 2. They don’t try to avoid problem, they simply handle them as fast as possible.

Insights and Sources:

3. Data warehouse layer

hadoopMySQL data is also moved to a “cold” storage to be analyzed. This storage is based on Hadoop HDFS (which leverage on HBase) and is queried using Hive. Data such as logging, clicks and feeds transit using Scribe and are aggregating and stored in Hadoop HDFS using Scribe-HDFS.

Standard Hadoop HDFS implementation has some limitations. Facebook adds a different kind of NameNodes called AvatarNodes which use Apache ZooKeeper to handle node failure. They also adds RaidNode that use XOR-parity files to decrease storage usage.

hadoop_cluster

Analysis are made using MapReduce. Anyway analyze several petabytes of data is hard and standard Hadoop MapReduce framework became a limit on 2011. So they developed Corona, a new scheduling framework that separates cluster resource management from job coordination. Results are stored into Oracle RAC (Real Application Cluster).

Insights and Sources:

In the next post I’m going to analyze the other Facebook services: Images, Message and Search.

Many high-traffic web applications take advantages from caching systems. HTML cache is easy and powerful. IMHO best solution is serving it using something like nginx and Varnish. Many people use custom solutions (for example a WordPress plugin) which produce an HTML snapshot of the page and save it to disk.

If you website il quite large in a flash you get a huge amount of small files stored on your disk. Clean cache is not easy as you would hope.

First solution is to use:

rm -Rf [path]

But this is dangerous because IO wait is terrible (even with SSD) and load average of you machine rise up to 50 in a minute.

find can help you:

find [path] -type f -print -delete

Unfortunately, after a while, load average rise again. Rising is slower but is still dangerous.

Solution is to use ionice. It enable you to limit priority of your process and avoid load average rise.

ionice -c 3 find [path] -type f -print -delete

Thanks to @dani_viga for the tips! He saves my day 🙂

More about ionice