How it Works: DataSift – PHP details

A few hours after I posted about DataSift architecture, @choult, one of the about 25 ninjas who develop DataSift platform, tweet me.

The following SlideShare presentation by @stuherbert, another ninja, talks about the use of PHP in DataSift. Unlike what you may think, PHP is widely used in data processing.

datasift_repo_languges

System is decomposable in three major data pipelines:

  • Data Archiving (Adds new data to Historic Archive)
  • Filtering Pipeline (Filtering and delivery data in realtime)
  • Playback Pipeline (Filtering and delivery data from Historic Archive)

And PHP is used for many parts of these.

datasift_php

They use a custom build of PHP 5.3.latest with several optimizations and compiled-in extensions (ZeroMQ, APC, XHProof, Redis, XDebug). The also develop some internal components:

  • Frink, tweetrmeme’s framework
  • Stone: foundation of in-house test tools, Hornet and Storyteller (they probably open source a fork named Storyplayer).

Unfortunately I wasn’t able to find more details about these. Anyway, here is the presentation:

[slideshare id=17557205&doc=phpandthefirehose2013-130323121454-phpapp01]

How it works: DataSift

datasift_logo

DataSift, as they said on their home page, “aggregate, process and deliver social data“. It is one of the oldest Twitter certified partners and offers data coming from almost every existing social network. I use it everyday to “listen” the net and find data I need for my analysis.

It’s impressive to watch how fast they collect data from external sources and deliver it to your chosen destination. When I tweet, a couple of minutes ago a JSON file land my S3 bucket.

To create an Internet scale filtering is not easy. Their infrastructure is really complex and optimized. This is a 2011 diagram of their workflow.

datasift_infrastructure

Twitter generates more than 500 million tweets per day and is only one of the available resources. The DataSift system performs 250+ million sentiment analysis with sub 100ms latency, and several TB of augmented (includes gender, sentiment, etc) data transits the platform daily. Data Filtering Nodes Can process up to 10,000 unique streams. Can do data-lookup’s on 10,000,000+ username lists in real-time. Links Augmentation Performs 27 million link resolves + lookups plus 15+ million full web page aggregations per day.

C++ is used for the performance-critical components, like the core filtering engine and PHP is for the site, external API server, most of the internal web services, and a custom-built, high performance job queue manager. Java/Scala for batch processing with HBase and MapReduce jobs. Kafka is used as queuing system and Ruby is used for deploys and provisioning. Thrift is widely used.

MySQL (Percona server) on SSD drives is used as primary storage, HBase cluster over more than 30 Hadoop nodes provides a place to store historical data and Memcached and Redis are used for caching.

Here is a schema of the processing unit which build the historical database.

datasift_historical

Message queues are another critical component of the infrastructure. 0mq (custom build from latest alpha branch, with some stability fixes, to use publisher-side filtering), used in different configurations:

  • PUB-SUB for replication / message broadcasting;
  • PUSH-PULL for round-robin workload distribution;
  • REQ-REP for health checks of different components.

Kafka for high-performance persistent queues. In both cases they’re working with the developers and contributing bug reports / traces / fixes / client libraries.

All code is pulled from the repo from Jenkins every 5 mins, automatically tested and verified with several QA tools, packaged as an RPM and moved to the dev package repo. Chef is used to automate deployments and manage configuration. All services emit StatsD events, which are combined with other system-level checks, added to Zenoss and displayed with Graphite.

The biggest challenge IMHO is filtering. Filtering at this scale requires a different approach. They started with work they did at TweetMeme. The core filter engine is in C++ and is called the Pickle Matrix. Over three years they’ve developed a compiler and their own virtual machine. We don’t know what their technology is exactly, but it might be something like Distributed Complex Event Processing with Query Rewriting.

Sources

Almost all content of this post come from the wonderful article “DataSift Architecture: Realtime Datamining At 120,000 Tweets Per Second” posted on HighScalability. Some details also from “Historical Architecture – Data Mining Billions of Tweets” from DataSift blog.

How it works: VKontakte

vkontakte_logo

Informations about VKontakte, the largest european social network, and its infrastructure are very few and fragmented. The only recent insights, in english, about its technology is a BTI’s press release which talks about VK migration on their infrastructure. Everything was top secret.

Only on 2011 at Moscow HighLoad++Pavel Durov and Oleg Illarionov told something about the architecture of the social network and insights are collected into this post (in russian). 

VK seems not different than any other popular social network: is over a LAMP stack and uses many other open source technologies.

  • Debian is the base for their custom Linux distro.
  • nginx mange load balancing in front of Apache who runs PHP using mod_php and XCache as opcode cacher.
  • MySQL is the main datastore but a custom DBMS (written using C and based on memcached protocol) is used for some magics. memcached helps also page caching.
  • XMPP is used for messages and chats and runs over node.js. Availability is granted by HAProxy who handle the node’s fragility.
  • Multimedia files are stored using xfs and media encoding is made using ffmpeg.
  • Everything is distributed over more than 4 datacenters

vk_logoThe main difference betweek VK and other social network is about server functions: VK servers are multifunctional. There is no clear distinction between database servers or file servers, they are used simultaneously in several roles.

Load balancing between servers occurs on a layered circuit which includes at balancing DNS, as well as routing requests within the system, wherein the different servers are used for different types of requests. 

For example, microblogging is working on a tricky circuit using memcached protocol capability for parallel sending requests for data on a large number of keys. In the absence of data in the cache, the same request is sent to the storage system, and the results are subjected to sorting, filtering and discarding the excess at the level of PHP-code.

The custom database is still a secret and is widely used in VKontakte. Many services use it: private messages, messages on the walls, statuses, search, privacy, friends lists and probably more. It uses a non-relational data model, and most operations are performed in memory. Access interface is an advanced protocol memcached. Specially compiled keys return the results of complex queries. They said is developed “best minds” of Russia.

I wasn’t able to find any other insight about VK infrastructure after this speech. They are like KGB 😀

A week with Jawbone UP

A couple of weeks ago @lastknight lent me his Jawbone UP because I want to buy an activity tracker in order to log my sleep “performances” and my workouts and I was undecided between the Jawbone UP24 and the Fitbit Force.

The UP is really interesting from a technical point of view. Is a small set of sensors (mostly accelerometers) with a battery (up to 10 days of use).

jawbone_up_tech

jawbone_up_app_logoFeelings are good: is lightweight enough to don’t notice when you wear it. You can track steps, sleep and a large set of activities (giving a different meaning to steps informations). There is not web interface yet. The only way to sync your device and view you data is your smartphone.

I started using the new 3.0 version of the app so i don’t know how the previous versions were.

I added my friends @gigisommo, @luka_bernardi and @dieguitoweb and I started to track all my activities. Dashboard shows your everyday progress:

IMG_6799

And there are details about walk, sleep and workouts:

IMG_6797

IMG_6796

IMG_6798

After a week I’m quite happy with this device. Steps are probably overvalued and sleep tracking is not perfect but data seems near enough to reality to be useful. Since I started to test it I tried to sleep more and do more steps and challenge with my friends was really fun. I really miss the web interface. Would be the best addition to this ecosystem.

Anyway I’m probably going to buy an UP24 on january. If you are part of this network add me as friend 😉

Are three deploy environments enough?

When I used Rails for the first time I was impressed by the use of multiple environments to handle different setup, policies and behavior of an application. A few year ago use of environments wasn’t so common and switching between development, test and production was an innovation for me.

Anyway big projects who use custom frameworks introduced this structure several years before. I had the pleasure to work over a lot of legacy code who implement different environments setup. For example classified ADS channel of Repubblica.it (built by my mentor @FabiolousMate) uses dev, demo and production. Other projects I worked on use staging. After have listened a lot of opinions I asked myself which are the most important environments and if three are enough.

I’m mostly a Ruby developer and I know the Rails ecosystem who uses 3 basic environment.

  • development is used to when you code. Source code is reloaded each time. Log is EXTREMELY verbose. Libraries includes debug and error logging features. Database is full of garbage data.
  • test is for automatic testing. Data is loaded and cleaned automatically every time you run tests. Everything can be mocked (database, APIs, external services, …). Libraries includes testing frameworks and log is just for test output.
  • production is to be safe. Logging is just for errors. Sometimes there is a caching layer. Libraries are loaded once. Data is replicated. Everything is set up to improve both performances and robustness.

These environments are really useful in order to manage application development. Unfortunately are not enough to handle every situation. For example production is not appropriate for testing new feature because of the poor log and the strong optimization (and the precious production data) and is not appropriate as well for demo purpose because has to be used by customers. Development is alike not appropriate to find bottlenecks because of messy data and debug code.

In my experience I usually add three more environment to my application trying to fit every situation. Most of cases these are enough.

  • staging is for deep testing of new features. Production data and development logging and libraries. Enable you to test side effects of your new features in the real world. If an edit works here probably works also in production
  • demo is for showtime. Production environment with sandboxes features and demo data. You can open this environment to anyone and he can play whatever he wants without dangers.
  • profile is to find bottlenecks. Development environment with specific library to profile and fine tuning of you process. You can adjust data to stress your system without worry about coherence of data.

This is IMHO a good setup of you deploy environments. Depending on projects some of these aren’t useful but in a large project each one can save you life.