Recently I needed to select best hosted service for some datastore to use for a large and complex project. Starting from Heroku and AppFog’s add-ons I found many free plans useful to test service and/or to use in production if your app is small enough (as example this blog runs on Heroku PostgreSQL’s Dev plan). Here the list:

MySQL

  • Xeround (Starter plan): 5 connection and 10 MB of storage
  • ClearDB (Ignite plan): 10 connections and 5 MB of storage

MongoDB

  • MongoHQ (Sandbox): 50MB of memory, 512MB of data
  • MongoLab (Starter plan): 496 MB of storage

Redis

  • RedisToGo (Nano plan): 5MB, 1 DB, 10 connections and no backups.
  • RedisCloud by Garantia Data: 20MB, 1 DB, 10 connections and no backups.
  • MyRedis (Gratis plan): 5MB, 1 DB, 3 connections and no backups.

Memcache

CouchDB

  • IrisCouch (up to 5$): No limits, usage fees for HTTP requests and storage.
  • Cloudant (Oxygen plan): 150,000 request, 250 MB of storage.

PostgreSQL – Heroku PostgreSQL (Dev plan): 20 connections, 10.000 rows of data
Cassandra – Cassandra.io (Beta on Heroku): 500 MB and 50 transactions per second
Riak – RiakOn! (Sandbox): 512MB of memory
Hadoop – Treasure Data (Nano plan): 100MB (compressed), data retention for 90 days
Neo4j – Heroku Neo4j (Heroku AddOn beta): 256MB of memory and 512MB of data.
OrientDB – NuvolaBase (Free): 100MB of storage and 100.000 records
TempoDB – TempoDB Hosted (Development plan): 50.000.000 data points, 50 series.
JustOneDB – Heroku JustOneDB (Lambda plan): 50MB of data

This week my problem was to modelize a semi-relational structure. We decided to use MongoDB because (someone says) is fast, scalable and schema-less. Unfortunately I’m not a good MongoDB designer yet. Data modeling was mostly easy because I can copy the relational part of the schema. The biggest data modeling problem is about m-to-m relations. How to decide if embed m-to-m relations keys into documents or not? To make the right choice I decided to test different design solutions.

Foreign keys emdedded:

class A
include Mongoid::Document
field :name, type: String
has_and_belongs_to_many :bs
end
class B
include Mongoid::Document
field :name, type: String
has_and_belongs_to_many :as
end
def direct(small, large)
small.times do |i|
a = A.new
a.name = "A#{i}"
large.times do |j|
b = B.create(name: "B#{j}")
a.bs << b
end
a.save
end
end

Foreign keys into an external document:

class C
include Mongoid::Document
field :name, type: String
has_many :rels
end
class D
include Mongoid::Document
field :name, type: String
has_many :rels
end
class Rel
include Mongoid::Document
belongs_to :c
belongs_to :d
end
def with_rel(small, large)
small.times do |i|
c = C.new
c.name = "C#{i}"
large.times do |j|
d = D.create(name: "D#{j}")
Rel.create(c: c, d: d)
end
end
end

I tested insert time for a database with 10 objects related to a growing number of other objects each iteration (from 100 to 5000).

def measure(message, &block)
cleanup
start = Time.now.to_f
yield
finish = (Time.now.to_f - start).to_f
puts "#{message}: #{"%0.3f" % finish}"
end
(1..50).each do |e|
measure "10 A embeds #{e*100} B each one" do
direct(10, e*100)
end
measure "10 A linked to #{e*100} B with extenal relation" do
with_rel(10, e*100)
end
end

Results are really interesting:

Number of relation for each element Insert time embedding relation key Insert time with external relation
100 0.693 1.021
200 1.435 2.006
300 1.959 2.720
400 2.711 3.587
500 3.477 4.531
600 4.295 5.414
700 5.106 6.369
800 5.985 7.305
900 6.941 8.221
1000 7.822 8.970
1200 12.350 13.946
1400 14.820 15.532
1600 15.806 17.344
1800 18.722 18.372
2000 21.552 20.732
3000 36.151 29.818
4000 56.060 38.154
5000 82.996 47.658

As you can see when number of embedded relation keys go over 2000, the time grow geometrically.

I know, this is not a real case test so we can’t say that using embedded relation is worse than using external. Anyway is really interesting observe that limits are always the same in both SQL and NoSQL world: when you hit a memory limit and need to go to disk, performances degrade.

In coming post I’m going to analyze reading performances.

When you work on a single machine everything is easy. Unfortunately when you have to scale and be fault tolerant you must relay on multiple hosts and manage a structure usually called “cluster“.

MongoDB enable you to create a replica set to be fault tolerant and use sharding to scale horizontally. Sharding is not transparent to DBAs. You have to choose a shard-key and adding and removing capacity when the system needs.

Structure

In a MongoDB cluster you have 3 fundamental “pieces”:

  • Servers: usually called mongod
  • Routers: usually called mongos.
  • Config servers

Servers are the place where you actually store your data, when you start the mongod command on your machine you are running a server. In a cluster usually you have multiple shard distributed over multiple servers.
Every shard is usually a replica set (2 or more servers) so if one of the servers goes down your cluster remains up and running.

Routers are the interaction point between users and the cluster. If you want to interact with your cluster you have to do throughout a mongos. The process route your request to the correct shard and gives back you the answer.

Config servers hold all the information about cluster configuration. They are very sensitive nodes and they are the real point of failure of the system.

mongodb_cluster

Shard Key

Choosing the shard key is the most important part when you create a cluster. There are a few important rule to follow learned after several errors:

  1. Don’t choose a shard key with a low cardinality. If one of this possibile values grow too much you can’t split it anymore.
  2. Don’t use an ascending shard key. The only shard who grows is the last one and distribute load on the other server always require a lot of traffic.
  3. Don’t use a random shard key. If quite efficient but you have to add an index to use it

A good choice is to use a coarsely ascending key combined to a search key (something you commonly query). This choice won’t work well for everything but it’s a good way to start thinking about.

N.B. All the informations and the image of the cluster strucure comes from the book below. I read it last week and I find it really interesting 🙂

Scaling MongoDB
by Kristina Chodorow

scaling_mongodb

Things I learned this week about MongoDB:

  • Duplicated data and a denormalized schema are good to be faster especially if if your data doesn’t change often.
  • Allocate more space for objects on disk is hard also for mongoDB. Do it early.
  • Preprocess you data and create index to speed up queries you actually do is a damn good idea.
  • The query order (with AND and OR operators) is really important.
  • Use journaling and replication to keep safe you data.
  • Use Javascript to define your own functions

50 Tips and Tricks for MongoDB Developers
by Kristina Chodorow

mongodb_developers