Polyglot persistence for big-data analysis: an introduction

During last years I had to develop projects containing up to hundred of million of objects. Now I need to move ahead and scale up to several billions of objects, reaching the limit of “big-data” definition. The common implementation of relational model which I always used isn’t enough anymore.

We know that a standard single-machine instance of MySQL (which all web developers have used at least once) show its limit over the 100 millions of rows. I need to scale horizontally and also need most specific features to easily manage a huge amount of data.

This is not a limit of relational model. Other implementations (like PostgreSQL or Oracle) can easily scale over that limit. Unfortunately many operations you usually do on data (like joins and set operations) aren’t so fast to run with billion of records. I need something else.

So called “NoSQL databases” offer you more data model (document-oriented, columnar, key-value, graph and more) where you can store your data in an more efficient way. They also offer features like sharding, replication, caching and indexing out of the box.

I’m not a NoSQL expert so I can’t advise you if choose a DBMS instead of another is a good choice or not. I’m entering this world just now like many other developers but I think that polyglot persistence is the future. Store your data using more than one DBMS to fit your requirements and take advantage of features of each one is a smart choice.

Big-data and polyglot persistence are interesting topics. I found some interest books about these topics. They can be a high quality introduction.

Seven Databases in Seven Weeks
by Eric Redmond and Jim R. Wilson

Contains an overview about different kinds of data model with real-world example for each one: PostgreSQL (RDBMS), Riak and Redis (Key-Value), HBase (Column-oriented), MongoDB and CouchDB (Document-oriented) and Neo4j (Graph).

NoSQL Distilled
by Pramod J.Sadalage and Martin Fowler

Similarly to the previous one this book starts with overview about the NoSQL world. The first part analyze how different softwares implement key-features: data-modeling, distribution (to scaling horizontally) and replication (to keep if safe and analyzable) of data.

The second part focus on each different typology of DBMS and analyze how they implement concepts exposed in previous part.

Big Data Glossary
by Pete Warden

Big data is more that persistence. There are many other operations you can do on your data and many way to analyze results. If you aren’t familiar with concepts like MapReduce, Natural Language Processing and Machine Learning this book explain you the basics.

First 5 chapters are about storing big-data, other 6 chapters are about processing and refining data with focus on high-specific topics.