I always like The Setup. Discover what kind of technologies, hardware and softwares other skilled people are using is extremely useful and really fun for me. This time I’d like to share some tips from the complete reboot I did to my personal ecosystem after switch to my new Macbook.

macbook_pro_13_retina

From the hardware side is a simple high-end 2015 Macbook Pro 13″ Retina with Intel Core i7 Haswell dual-core at 3,4GHz, 16GB of RAM and 1TB of SSD PCI Express 3.0. Is fast, solid, lightweight and flexible. The only required accessory is the Be.eZ LArobe Second Skin.

From the software side I decided to avoid Time Machine restore in order to setup a completely new environment. I started on a OS X 10.10 Yosemite fresh installation.

As polyglot developer I usually deal with a lot of different applications, programming languages and tools. In order to decide what top install, a list of what I had on the previous machine and what I need more was really useful.

Here is a list of useful software and some tips about the installation process.

Applications

paid_apps

Paid softwares worth having: Evernote (with Premium subscription and Skitch) and Todoist (with Premium subscription) both available on the Mac App Store. 1Password, Fantastical 2, OmniGraffle, Carbon Copy Cloner, Backblaze and Expandrive available on their own websites.

Free software worth having: Google Chrome and Mozilla Firefox as browser, Apache OpenOffice, Skype and Slack as chat, VLC for multimedia and Transmission for torrents.

app_from_suites

Suites or part of: Adobe Photoshop CC, Adobe Illustrator CC and Adobe Acrobat Pro DC are part of the Adobe Creative Cloud. Microsoft Word 2016, and Microsoft Excel 2016 are part of Microsoft Office 2016 for Mac (now in free preview). Apple Pages, and Apple Keynote are preinstalled as Apple iWork suite as well as Apple Calendar and Apple Contacts.

Development tools

Utilities for Power Users: Caffeine, Growl and HardwareGrowler, iStat Menu Pro, Disk Inventory X, Tor Browser and TrueCrypt 7.1a (you need to fix a little installation bug on OS X 10.10), Kinematic and Boot2Docker for Docker, Sublime Text 3 (with some additions like: Spacegray Theme, Soda Theme, a new icon, Source Code Pro font), Tower, Visual Studio Code, Android SDK (for Android emulator) and XCode (for iOS emulator), VirtualBox (with some useful Linux virtual images), iTerm 2.

CLI: OhMyZSH, Homebrew, GPG (installed using brew), XCode Command Line Tools (from Apple Developers website), Git (with git-flow installed using brew), AWS CLI (install via pip), PhantomJS, s3cmd and faster s4cmd, Heroku toolbelt and Openshift Client Tools (install via gem).

daemons

Servers: MariaDB 10.0 (brew), MongoDB 3.0 (brew), Redis 3.0 (brew), Elasticsearch 1.6 (brew), Nginx 1.8.0 (brew), PostgreSQL 9.4.2 (via Postgres.app), Hadoop 2.7.0 (brew), Spark 1.4 (download from official website), Neo4j 2.2 (brew), Accumulo 1.7.0 (download from official website), Crate 0.49 (download from official website), Mesos 0.22 (download from official website), Riak 2.1.1 (brew), Storm 0.9.5 (download from official website), Zookeeper 3.4.6 (brew), Sphinx 2.2 (brew), Cassandra 2.1.5 (brew).

languages

Programming languages: RVM, Ruby (MRI 2.2, 2.1, 2.0, 1.9.3, 1.8.7, REE 2012.02, JRuby 1.7.19 installed using RVM), PHP 5.6 with PHP-FPM (installed using brew), HHVM 3.7.2 (installed using brew with adding additional repo, has some issues on 10.10), Python 2.7 (brew python) and Python 3.4 (brew python3), Pip 7.1 (shipped with Python), NVM, Node.js 0.12 and IO.js 2.3 (both installed using NVM), Go 1.4.2 (from Golang website), Java 8 JVM (from Oracle website), Java 8 SE JDK (from Oracle website), Scala 2.11 (from Scala website), Clojure 1.6 (from Clojure website), Erlang 17.0 (brew), Haskell GHC 7.10 (brew), Haskell Cabal 1.22 (brew), OCaml 4.02.1 (brew), R 3.2.1 (from R for Mac OS X website), .NET Core and ASP.NET (brew using DNVM), GPU Ocelot (compiled with a lot of libraries).

Full reboot takes about 2 days. Some software are still missing but I was able to restart my work almost completely. I hope this list would be helpful for anyone 🙂

After GFS and MapReduce, Google solve again big data problems designing BigTable: a compressed, high performance, and proprietary data storage system that forms the basis for most of its projects. HBase and Cassandra are inspired from it.


google_logo

Title: Bigtable: A Distributed Storage System for Structured Data (PDF), November 2006
Authors: Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber

Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers.

Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products.

In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.


Check out the list of interesting papers and projects (Github).

Recently I needed to select best hosted service for some datastore to use for a large and complex project. Starting from Heroku and AppFog’s add-ons I found many free plans useful to test service and/or to use in production if your app is small enough (as example this blog runs on Heroku PostgreSQL’s Dev plan). Here the list:

MySQL

  • Xeround (Starter plan): 5 connection and 10 MB of storage
  • ClearDB (Ignite plan): 10 connections and 5 MB of storage

MongoDB

  • MongoHQ (Sandbox): 50MB of memory, 512MB of data
  • MongoLab (Starter plan): 496 MB of storage

Redis

  • RedisToGo (Nano plan): 5MB, 1 DB, 10 connections and no backups.
  • RedisCloud by Garantia Data: 20MB, 1 DB, 10 connections and no backups.
  • MyRedis (Gratis plan): 5MB, 1 DB, 3 connections and no backups.

Memcache

CouchDB

  • IrisCouch (up to 5$): No limits, usage fees for HTTP requests and storage.
  • Cloudant (Oxygen plan): 150,000 request, 250 MB of storage.

PostgreSQL – Heroku PostgreSQL (Dev plan): 20 connections, 10.000 rows of data
Cassandra – Cassandra.io (Beta on Heroku): 500 MB and 50 transactions per second
Riak – RiakOn! (Sandbox): 512MB of memory
Hadoop – Treasure Data (Nano plan): 100MB (compressed), data retention for 90 days
Neo4j – Heroku Neo4j (Heroku AddOn beta): 256MB of memory and 512MB of data.
OrientDB – NuvolaBase (Free): 100MB of storage and 100.000 records
TempoDB – TempoDB Hosted (Development plan): 50.000.000 data points, 50 series.
JustOneDB – Heroku JustOneDB (Lambda plan): 50MB of data

In previous post I analyzed Facebook main infrastructure. Now I’m going deeper into services.

Facebook Images

Facebook is the biggest photo sharing service in the world and grows by several millions of images every week. Pre-2009 infrastructure uses three NFS tier. Also with some optimization this solution can’t easily scale over a few billions of images.

So in 2009 Facebook develop Haystack, an HTTP based photo server. It is composed by 5 layers: HTTP server, Photo Store, Haystack Object Store, Filesystem and Storage.

Storage is made on storage blades using a RAID-6 configuration who provides adequate redundancy and excellent read performance. The poor write performance is partially mitigated by the RAID controller NVRAM write-back cache. Filesystem used is XFS and manage only storage-blade-local files, no NFS is used.

Haystack Object Store is a simple log structured (append-only) object store containing needles representing the stored objects. A Haystack consists of two files:

the actual haystack store file containing the needles

haystack_content

plus an index file

haystack_index

Photo Store server is responsible for accepting HTTP requests and translating them to the corresponding Haystack store operations. It keeps an in-memory index of all photo offsets in the haystack store file. The HTTP framework we use is the simple evhttp server provided with the open source libevent library.

Insights and Sources

Facebook Messages and Chat

Facebook messaging system is powered by a system called Cell. The entire messaging system (email, SMS, Facebook Chat, and the Facebook Inbox) is divided into cells, and each cell contains only a subset of users. Every Cell is composed by a cluster of application server (where business logic is defined) monitored by different ZooKeeper instances.

Application servers use a data acces layer to communicate with metadata storage, an HBase based system (old messaging infrastructure relied on Cassandra) which contains all the informations related to messages and users.

Cells are the “core” of the system. To connect them to the frontend there are different “entry points”. An MTA proxy parses mail and redirect data to the correct application. Emails are stored in the same structure than photos: Haystack. There are also discovering service to map user-to-Cell (based on hashing) and service-to-Cell (based on ZooKeeper notifications) and everything expose an API.

There is a “dirty” cache based on Memcached to serve messages (from a local cache of datacenter) and social information about the users (like social indexes).

facebook_messages_architecture

The search engine for messages is built using an inverted index stored in HBase.

Chat is based on an Epoll server developed in Erlang and accessed using Thrift and there is also a subsystem for logging chat messages (in C++). Both subsystems are clustered and partitioned for reliability and efficient failover.

Real-time presence notification is the most resource-intensive operation performed (not sending messages): keeping each online user aware of the online-idle-offline states of their friends. Real-time messaging is done using a variation of Comet, specifically XHR long polling, and/or BOSH.

Insights and Sources

Facebook Search

Original Facebook search engine simply searches into cached users informations: friends list, like lists and so on. Typeahead search (the search box on the top of Facebook frontend) came on 2009. It try to suggest you most interesting results. Performances are really important and results must be available within 100ms. It has to be fast and scalable and the structure of the system is built as follow:

typeahead_search

First attempts are still on browser cache where are stored informations about user (friends, like, pages). If cache misses starts and AJAX request.

Many Leaf services search for results inside theirs indexes (stored into an inverted index called Unicorn). When results references are fetched from the indexes, they are merged and loaded from global datastore. An aggregator provide a single channel to send data to client. Obviously query are cached.

On 2012 Facebook starts from the core part of typeahead search to build a new search tool. Unicorn is the core part of the new Graph Search. Formally is an in-memory inverted index which maps Facebook contents as a graph database would do and you can query it as graph traversing tool. To be used for Graph SearchUnicorn was updated to be more than a traversing tool. Now supports nested queries, scoring and support for different kind of resources. Results are aggregated on different levels.

unicorn

Query lifecycle is usually made by 2 steps: the Suggestion Phase and the Search Phase.

The Suggestion Phase works like “autocomplete” and Is powered by a Natural Language Processing (NLP) module attempts to parse text based on a grammar. It identifies parts of the query as potential entities and passes these parts down to Unicorn to search for them.

The Search Phase begins when the searcher has made a selection from the suggestions. The parse tree, along with the fbids of the matched entities, is sent back to the Top Aggregator. A user readable version of this query is displayed as part of the URL.

Currently Graph Search is still in beta.

Insights and Sources

Resources and insights

This post and the previous one are based and inspired by the answer of Michaël Figuière to the following question on Quora: What is Facebook’s architecture?

Additional stuff:

Yesterday @lastknight was looking for something to store and query a huge graph dataset. He found Titan developed by Aurelius and released last August. It is a distributed graph database who can rely on Apache Cassandra, Apache HBase or Oracle BerkeyDB for storage. It promises to be fully distributed and horizontally scalable, it’s really ambitious and presentation seem really interesting 🙂


Aurelius also develops Faunus, a Apache Hadoop-based graph analytics engine for analyzing massive-scale graphs.

More about Titan and Faunus:

[UPDATE 2013-03-10] Subscription page describe details about software (developed and planned) by Aurelius. The Fulgora processor seems really interesting. http://thinkaurelius.com/subscription/