Everything started while I was writing my first post about the Hadoop Ecosystem. I was relatively new to Hadoop and I wanted to discover all useful projects. I started collecting projects for about 9 months building a simple index.

About a month ago I found an interesting thread posted on the Hadoop Users Group on LinkedIn written by Javi Roman, High Performance Computing Manager at CEDIANT (UAX). He talks about a table which maps the Hadoop ecosystem likewise I did on my list.

He published his list on Github a couple of day later and called it the Hadoop Ecosystem Table. It was an HTML table, really interesting but really hard to use for other purpose. I wanted to merge my list with this table so I decided to fork it and add more abstractions.

I wrote a couple of Ruby scripts (thanks Nokogiri) to extract data from my list and Javi’s table and put in an agnostic container. After a couple of days spent hacking on these parsers I found a simple but elegant solution: JSON.

Information about each project is stored in a separated JSON file:

{
"name": "Apache HDFS",
"description": "The Hadoop Distributed File System (HDFS) offers a way to store large files across \nmultiple machines. Hadoop and HDFS was derived from Google File System (GFS) paper. \nPrior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. \nWith Zookeeper the HDFS High Availability feature addresses this problem by providing \nthe option of running two redundant NameNodes in the same cluster in an Active/Passive \nconfiguration with a hot standby. ",
"abstract": "a way to store large files across multiple machines",
"category": "Distributed Filesystem",
"tags": [
],
"links": [
{
"text": "hadoop.apache.org",
"url": "http://hadoop.apache.org/"
},
{
"text": "Google FileSystem - GFS Paper",
"url": "http://research.google.com/archive/gfs.html"
},
{
"text": "Cloudera Why HDFS",
"url": "http://blog.cloudera.com/blog/2012/07/why-we-build-our-platform-on-hdfs/"
},
{
"text": "Hortonworks Why HDFS",
"url": "http://hortonworks.com/blog/thinking-about-the-hdfs-vs-other-storage-technologies/"
}
]
}

It includes: project name, long and short description, category, tags and links.

I merged data into these files, and wrote a couple of generator in order to put data into different templates. Now i can generate code for my WordPress page and an update version of Javi’s table.

Finally I added more data into more generic categories not strictly related to Hadoop (like MySQL forks, Memcached forks and Search Engine platforms) and build a new version of the table: The Big Data Ecosystem table. JSON files are available to everyone and will be served directly from a CDN located under same domain of table.

This is how I built an open source big data map 🙂

database_venn_diagram

Last week I found this diagram on @al3xandru‘s MyNoSQL blog and I was surprise of how many softwares I never heard before.

From the diagram are missing many other softwares such as NuoDB (NewSQL), Aerospike (Key-Value), Titan (Graph), FoundationDB (Key-Values) Apache Accumulo (Key-Value), Apache Giraph (Graph) and more and includes some companies (like Cloudera, MapR and Xeround) also if they didn’t develop a custom version but just fork and maintain the main one.

Anyway it seems one of the best visual representation of the current database world and I’m going to use it as base to an updated and more detailed version 😉

Sources: 

YouPorn is one of the most visited porn site on the web. I applied for a job as developer in 2009 because of a post who talk about its technological stack.

I heard again about its infrastructure because, about an year ago, @ErikPickupYP spoke about the great switchover at CooFoo and @antirez tweet some details regarding datastore. The development team rewrote the entire site using Redis as primary database.

Original stack was based on Perl and Catalyst and powered the site from 2006 to 2011. After acquisition they rewrote the site using a well designed LAMP stack.

The chosen framework is Symfony2 (which uses Doctrine as ORM) running over nginx with PHP-FPM helped by Varnish (speed up requests, manage cache and check servers status) and HAProxy (load balance and health check of servers). Syslog-ng handle logs. They maintain two pools of servers: a write pool with a fail-over to backup-Master and a read pool will servers except the master.

Datastore is the most interesting part. Initially they used MySQL but more than 200 million of pageviews and 300K query per second are too much to be handled using only MySQL. First try was to add ActiveMQ to enqueue writes but a separate Java infrastructure is too expensive to be maintained. Finally they add Redis in front of MySQL and use it as main datastore.

Now all reads come from Redis. MySQL is used to allow the building new sorted sets as requirements change and it’s highly normalized because it’s not used directly for the site. After the switchover additional Redis nodes were added, not because Redis was overworked, but because the network cards couldn’t keep up with Redis 😀

Lists are stored in a sorted set and MySQL is used as source to rebuild them when needed. Pipelining allows Redis to be faster and Append-only-file (AOF) is an efficient strategy to easily backup data.

In the end YouPorn uses a LAMP stack “on-steroids” which smartly uses Redis and other modern middlewares.

Sources:

In the beginning was RediSQL, a “Hybrid Relational-Database/NOSQL-Datastore written in ANSI C”. After a while they changed the name to AlchemyDB.

Everything is built over Redis adding a Lua interpreter to implement a really interesting technique they call Datastore-Side-Scripting. Is like to use stored procedure, putting logic into datastore. They can achieve many different goals using this technique:

  • Implement a SQL-like language using Lua to decode SQL requests
  • Implement many datatypes not supported by Redis using Lua to fit into common Redis types the new structure
  • Serve content (like web pages o JSON data) directly from the datastore using a REST API.
  • Implement a GraphDB using SQL for Index and Lua for graph-traversal logic.
  • Implement Document-oriented model using Lua
  • Implement an ObjectDB using Lua

Last year Citrusleaf acquired AlchemyDB and Russ Sullivan (the guy behind AlchemyDB) incrementally porting functionality to run on top of Citrusleaf’s proven distributed high-availability linearly-scalable key-value store: Aerospike. It is a distributed NoSQL database, the first solution to claim ACID support and an extremely fast architecture optimized to run using SSDs.

I didn’t test it yet but as far I can see they provide and SDK for most popular programming languages. The Ruby one requires a native library. To start you need to add a node:

require "Citrusleaf"
c = Citrusleaf.new
c.add_node "10.1.1.6", 3000

And set, get and delete operations are done as follow:

# Writing Values
c.put 'namespace', 'myset', 'mykey', 'bin_name', value
# Reading Values
rv, gen, value = c.get 'namespace', 'myset', 'mykey', 'binx'
# Deleting Values
c.delete 'namespace', 'myset', 'mykey'

Documentation isn’t useful yet. The only way to understand how if is cool or not is to test it. That’s what I’ll do.

Redis is a RAM based key-value store. RAM is expensive. Hard disks (even SSD) are slow. It’s the truth, we know.

A few months ago people tried to use Redis instead of MySQL (or similar SQL DBs) as main datastore. When you do, is easy to clash against the memory limit. As we learned from operative systems, first solution is to use your disk space to “enlarge” your RAM. Redis versions from 2.0 to 2.6 offer a Virtual Memory implementation.

Virtual Memory seems to be really useful in many cases. If only a small part of your keys get the vast majority of accesses you can efficiently keep only that part of keys into RAM and leave the remaining part on disk.

To enable Virtual Memory you can switch it on using vm-enabled yes and set the memory limit using vm-max-memory. Additionally you can fine tune the configuration using vm-pages and vm-page-size for swap file and vm-max-threads for concurrency.

Anyway since version 2.4 Virtual Memory is deprecated. This is the official note about it:

Redis VM is now deprecated. Redis 2.4 will be the latest Redis version featuring Virtual Memory (but it also warns you that Virtual Memory usage is discouraged). We found that using VM has several disadvantages and problems. In the future of Redis we want to simply provide the best in-memory database (but persistent on disk as usual) ever, without considering at least for now the support for databases bigger than RAM. Our future efforts are focused into providing scripting, cluster, and better persistence.

The alternative is Redis cluster. It will be a “distributed and fault tolerant implementation of a subset of the features available in the Redis stand alone server”. At the moment is a work in progress. There are some client-side implementation (for Node.jsfor Ruby and more) but not yet an official, standalone version.

Virtual memory deprecation and Redis cluster long developing time make me think about a simple idea:

Redis is not ready to be the main datastore for a huge dataset, not yet. 

More about Redis scaling

[2013-03-09 UPDATE] @olinicola advises me a post by @antirez about to use Redis in memory and to swap on SSD. His response is the same:

TL;DR: the outcome of this test was expected and Redis is an in-memory system 🙂

Redis is widely used into projects I have to work on everyday at @thefool_it. My knowledge about it is really poor so I decided to improve my experience up to a PRO level. I understand basic Redis concepts because I worked with memcached in the past and differences were clearly explained into “Seven Databases in Seven Weeks“. My weaknesses are about everyday use: setup, administration, querying 🙁

Introduction and setup.

Installation is really easy because you can compile from source. On OSX you also have brew or port with an up-to-date package. Update isn’t so easy. Standard way is to start the updated version on another port and migrate data.

Data types are: StringsLists (ordered list of string), Hashes, Sets (no duplicated values) and Sorted Sets (sets sorted by a counter).

Standard distribution comes with a command line interface: the redis-cli. There is a standard library for most common environments and programming languages such as Node.js (node_redis), Python (redis-py), Ruby (redis-rb) and more.

In the coming weeks I’m going to practice about commands and admin techniques using following resources.

Redis Cookbook
by Tiago Macedo, Fred Oliveira

redis_cookbook

Other interesting sources

Redis‘s SET and ZSET (sorted sets) are a really powerful structure. The only limits are about set operation you can perform. Using Redis you can’t obtain the intersection (or the union) between two sorted set or between a SET and a ZSET. You can use SINTER to intersect a group of SET or SUNION for union. Unfortunately there is no direct way for ZSET.

In our use case, we had to intersect a ZSET (a sorted rank) and a SET (a group of categorized items) to find the rank of the element inside selected category.

After a successful search on Google I found a way on StackOverflow (view below link): use ZINTERSTORE. It’s really simple: act like SINTER but store results into a new ZSET. It has a quite expensive memory footprint but is ok if you frequently reuse the result (is like a cache and you can set expire time using EXPIRE).

Source
http://stackoverflow.com/questions/10500695/redis-how-to-intersect-a-normal-set-with-a-sorted-set

During last years I had to develop projects containing up to hundred of million of objects. Now I need to move ahead and scale up to several billions of objects, reaching the limit of “big-data” definition. The common implementation of relational model which I always used isn’t enough anymore.

We know that a standard single-machine instance of MySQL (which all web developers have used at least once) show its limit over the 100 millions of rows. I need to scale horizontally and also need most specific features to easily manage a huge amount of data.

This is not a limit of relational model. Other implementations (like PostgreSQL or Oracle) can easily scale over that limit. Unfortunately many operations you usually do on data (like joins and set operations) aren’t so fast to run with billion of records. I need something else.

So called “NoSQL databases” offer you more data model (document-oriented, columnar, key-value, graph and more) where you can store your data in an more efficient way. They also offer features like sharding, replication, caching and indexing out of the box.

I’m not a NoSQL expert so I can’t advise you if choose a DBMS instead of another is a good choice or not. I’m entering this world just now like many other developers but I think that polyglot persistence is the future. Store your data using more than one DBMS to fit your requirements and take advantage of features of each one is a smart choice.

Big-data and polyglot persistence are interesting topics. I found some interest books about these topics. They can be a high quality introduction.

Seven Databases in Seven Weeks
by Eric Redmond and Jim R. Wilson

Contains an overview about different kinds of data model with real-world example for each one: PostgreSQL (RDBMS), Riak and Redis (Key-Value), HBase (Column-oriented), MongoDB and CouchDB (Document-oriented) and Neo4j (Graph).

NoSQL Distilled
by Pramod J.Sadalage and Martin Fowler

Similarly to the previous one this book starts with overview about the NoSQL world. The first part analyze how different softwares implement key-features: data-modeling, distribution (to scaling horizontally) and replication (to keep if safe and analyzable) of data.

The second part focus on each different typology of DBMS and analyze how they implement concepts exposed in previous part.

Big Data Glossary
by Pete Warden

Big data is more that persistence. There are many other operations you can do on your data and many way to analyze results. If you aren’t familiar with concepts like MapReduce, Natural Language Processing and Machine Learning this book explain you the basics.

First 5 chapters are about storing big-data, other 6 chapters are about processing and refining data with focus on high-specific topics.