Redis is a RAM based key-value store. RAM is expensive. Hard disks (even SSD) are slow. It’s the truth, we know.

A few months ago people tried to use Redis instead of MySQL (or similar SQL DBs) as main datastore. When you do, is easy to clash against the memory limit. As we learned from operative systems, first solution is to use your disk space to “enlarge” your RAM. Redis versions from 2.0 to 2.6 offer a Virtual Memory implementation.

Virtual Memory seems to be really useful in many cases. If only a small part of your keys get the vast majority of accesses you can efficiently keep only that part of keys into RAM and leave the remaining part on disk.

To enable Virtual Memory you can switch it on using vm-enabled yes and set the memory limit using vm-max-memory. Additionally you can fine tune the configuration using vm-pages and vm-page-size for swap file and vm-max-threads for concurrency.

Anyway since version 2.4 Virtual Memory is deprecated. This is the official note about it:

Redis VM is now deprecated. Redis 2.4 will be the latest Redis version featuring Virtual Memory (but it also warns you that Virtual Memory usage is discouraged). We found that using VM has several disadvantages and problems. In the future of Redis we want to simply provide the best in-memory database (but persistent on disk as usual) ever, without considering at least for now the support for databases bigger than RAM. Our future efforts are focused into providing scripting, cluster, and better persistence.

The alternative is Redis cluster. It will be a “distributed and fault tolerant implementation of a subset of the features available in the Redis stand alone server”. At the moment is a work in progress. There are some client-side implementation (for Node.jsfor Ruby and more) but not yet an official, standalone version.

Virtual memory deprecation and Redis cluster long developing time make me think about a simple idea:

Redis is not ready to be the main datastore for a huge dataset, not yet. 

More about Redis scaling

[2013-03-09 UPDATE] @olinicola advises me a post by @antirez about to use Redis in memory and to swap on SSD. His response is the same:

TL;DR: the outcome of this test was expected and Redis is an in-memory system 🙂

When you work on a single machine everything is easy. Unfortunately when you have to scale and be fault tolerant you must relay on multiple hosts and manage a structure usually called “cluster“.

MongoDB enable you to create a replica set to be fault tolerant and use sharding to scale horizontally. Sharding is not transparent to DBAs. You have to choose a shard-key and adding and removing capacity when the system needs.

Structure

In a MongoDB cluster you have 3 fundamental “pieces”:

  • Servers: usually called mongod
  • Routers: usually called mongos.
  • Config servers

Servers are the place where you actually store your data, when you start the mongod command on your machine you are running a server. In a cluster usually you have multiple shard distributed over multiple servers.
Every shard is usually a replica set (2 or more servers) so if one of the servers goes down your cluster remains up and running.

Routers are the interaction point between users and the cluster. If you want to interact with your cluster you have to do throughout a mongos. The process route your request to the correct shard and gives back you the answer.

Config servers hold all the information about cluster configuration. They are very sensitive nodes and they are the real point of failure of the system.

mongodb_cluster

Shard Key

Choosing the shard key is the most important part when you create a cluster. There are a few important rule to follow learned after several errors:

  1. Don’t choose a shard key with a low cardinality. If one of this possibile values grow too much you can’t split it anymore.
  2. Don’t use an ascending shard key. The only shard who grows is the last one and distribute load on the other server always require a lot of traffic.
  3. Don’t use a random shard key. If quite efficient but you have to add an index to use it

A good choice is to use a coarsely ascending key combined to a search key (something you commonly query). This choice won’t work well for everything but it’s a good way to start thinking about.

N.B. All the informations and the image of the cluster strucure comes from the book below. I read it last week and I find it really interesting 🙂

Scaling MongoDB
by Kristina Chodorow

scaling_mongodb

I’m developing a new project which require a data structure not yet well defined. We are evaluating different solutions for persistence and Amazon AWS is one of the partners we are considering. I’m trying to recap solutions which it offers.

Amazon Relational Database Service (RDS)

Relational Database similar to MySQL and PostgreSQL. It offers 3 different engines (with different costs) and each one should be fully compatible with the protocol of the corresponding DBMS: Oracle, MySQL and Microsoft SQL Server.

You can use it with ActiveRecord (with MySQL adapter) on Rails or Sinatra easily. Simply replace you database.yml with given parameters:

production:
adapter: mysql2
host: myprojectname.somestuff.amazonaws.com
database: myprojectname
username: myusername
password: mypass

Amazon DynamoDB

Key/Value Store similar to Riak and Cassandra. It is still in beta but Amazon released a paper (PDF) about its structure a few year ago which inspire many other products.

You can access it using Ruby and aws-sdk gem. I’m not an expert but this code should works for basic interaction (not tested yet).

require "aws"
# set connection parameters
AWS.config(
access_key_id: ENV["AWS_KEY"],
secret_access_key: ENV["AWS_SECRET"]
)
# open connection to DB
DB = AWS::DynamoDB.new
# create a table
TABLES["your_table_name"] = DB.tables["your_table_name"].load_schema
rescue AWS::DynamoDB::Errors::ResourceNotFoundException
table = DB.tables.create("your_table_name", 10, 5, schema)
# it takes time to be created
sleep 1 while table.status == :creating
TABLES["your_table_name"] = table.load_schema
end
end

After that you can interact with table:

# Create a new element
record = TABLES["your_table_name"].items.create(id: "andrea-mostosi")
record.attributes.add(name: ["Andrea"])
record.attributes.add(surname: ["Mostosi"])
# Search for value "andrea-mostosi" inside table
TABLES["your_table_name"].items.query(
hash_value: "andrea-mostosi",
)

Amazon Redshift

Relational DBMS based on PostgreSQL structured for a petabyte-scale amount of data (for data-warehousing). It was released to public in the last days and SDK isn’t well documented yet. Seems to be very interesting for big-data processing on a relational structure.

Amazon ElastiCache

In-RAM caching system based on Memcached protocol. It should be used to cache any kind of object like Memcached. Is different (and worse IMHO) than Redis because doesn’t offer persistence. I prefer a different kind of caching but may be a good choice if your application already use Memcached.

Amazon SimpleDB

RESTFul Key/Value Store using only strings as data types. You can use it with any REST ORM like ActiveResource, dm-rest-adapter or, my favorite, Her (read previous article). If you prefer you can use with any HTTP client like Faraday or HTTParty.

[UPDATE 2013-02-19] SimpleDB isn’t listed into “Database” menu anymore and it seems no longer available for activation.

Other DBMS on markerplace

Many companies offer support to theirs software deployed on EC2 instance. Engines include MongoDB, CouchDB, MySQL, PostgreSQL, Couchbase Server, DB2, Riak, Memcache and Redis.

Sources

Redis is widely used into projects I have to work on everyday at @thefool_it. My knowledge about it is really poor so I decided to improve my experience up to a PRO level. I understand basic Redis concepts because I worked with memcached in the past and differences were clearly explained into “Seven Databases in Seven Weeks“. My weaknesses are about everyday use: setup, administration, querying 🙁

Introduction and setup.

Installation is really easy because you can compile from source. On OSX you also have brew or port with an up-to-date package. Update isn’t so easy. Standard way is to start the updated version on another port and migrate data.

Data types are: StringsLists (ordered list of string), Hashes, Sets (no duplicated values) and Sorted Sets (sets sorted by a counter).

Standard distribution comes with a command line interface: the redis-cli. There is a standard library for most common environments and programming languages such as Node.js (node_redis), Python (redis-py), Ruby (redis-rb) and more.

In the coming weeks I’m going to practice about commands and admin techniques using following resources.

Redis Cookbook
by Tiago Macedo, Fred Oliveira

redis_cookbook

Other interesting sources

Redis‘s SET and ZSET (sorted sets) are a really powerful structure. The only limits are about set operation you can perform. Using Redis you can’t obtain the intersection (or the union) between two sorted set or between a SET and a ZSET. You can use SINTER to intersect a group of SET or SUNION for union. Unfortunately there is no direct way for ZSET.

In our use case, we had to intersect a ZSET (a sorted rank) and a SET (a group of categorized items) to find the rank of the element inside selected category.

After a successful search on Google I found a way on StackOverflow (view below link): use ZINTERSTORE. It’s really simple: act like SINTER but store results into a new ZSET. It has a quite expensive memory footprint but is ok if you frequently reuse the result (is like a cache and you can set expire time using EXPIRE).

Source
http://stackoverflow.com/questions/10500695/redis-how-to-intersect-a-normal-set-with-a-sorted-set

Things I learned this week about MongoDB:

  • Duplicated data and a denormalized schema are good to be faster especially if if your data doesn’t change often.
  • Allocate more space for objects on disk is hard also for mongoDB. Do it early.
  • Preprocess you data and create index to speed up queries you actually do is a damn good idea.
  • The query order (with AND and OR operators) is really important.
  • Use journaling and replication to keep safe you data.
  • Use Javascript to define your own functions

50 Tips and Tricks for MongoDB Developers
by Kristina Chodorow

mongodb_developers

During last years I had to develop projects containing up to hundred of million of objects. Now I need to move ahead and scale up to several billions of objects, reaching the limit of “big-data” definition. The common implementation of relational model which I always used isn’t enough anymore.

We know that a standard single-machine instance of MySQL (which all web developers have used at least once) show its limit over the 100 millions of rows. I need to scale horizontally and also need most specific features to easily manage a huge amount of data.

This is not a limit of relational model. Other implementations (like PostgreSQL or Oracle) can easily scale over that limit. Unfortunately many operations you usually do on data (like joins and set operations) aren’t so fast to run with billion of records. I need something else.

So called “NoSQL databases” offer you more data model (document-oriented, columnar, key-value, graph and more) where you can store your data in an more efficient way. They also offer features like sharding, replication, caching and indexing out of the box.

I’m not a NoSQL expert so I can’t advise you if choose a DBMS instead of another is a good choice or not. I’m entering this world just now like many other developers but I think that polyglot persistence is the future. Store your data using more than one DBMS to fit your requirements and take advantage of features of each one is a smart choice.

Big-data and polyglot persistence are interesting topics. I found some interest books about these topics. They can be a high quality introduction.

Seven Databases in Seven Weeks
by Eric Redmond and Jim R. Wilson

Contains an overview about different kinds of data model with real-world example for each one: PostgreSQL (RDBMS), Riak and Redis (Key-Value), HBase (Column-oriented), MongoDB and CouchDB (Document-oriented) and Neo4j (Graph).

NoSQL Distilled
by Pramod J.Sadalage and Martin Fowler

Similarly to the previous one this book starts with overview about the NoSQL world. The first part analyze how different softwares implement key-features: data-modeling, distribution (to scaling horizontally) and replication (to keep if safe and analyzable) of data.

The second part focus on each different typology of DBMS and analyze how they implement concepts exposed in previous part.

Big Data Glossary
by Pete Warden

Big data is more that persistence. There are many other operations you can do on your data and many way to analyze results. If you aren’t familiar with concepts like MapReduce, Natural Language Processing and Machine Learning this book explain you the basics.

First 5 chapters are about storing big-data, other 6 chapters are about processing and refining data with focus on high-specific topics.