I few days ago I have been at Codemotion in Milan and I had the opportunity to discover some insights about technologies used by two of our main competitor in Italy: BlogMeter and Datalytics. It’s quite interesting because, also if technical challenges are almost the same, each company use a differente approach with a different stack.

datalytics_logo

Datalytics a is relatively new company founded 4 months ago. They had a desk at Codemotion to show theirs products and recruit new people. I chatted with Marco Caruso, the CTO (who probably didn’t know who I am, sorry Marco, I just wanted to avoid hostility 😉 ), about technologies they use and developer profile they were looking for. Requires skills was:

Their tech team is composed by 4 developers (including the CTO) and main products are: Datalytics Monitoring™ (a sort of statistical dashboard that shows buzz stats in real time) and Datalytics Engage™ (a real time analytics dashboard for live events). I have no technical insights about how they systems works but I can guess some details inferring them from the buzz words they use.

Supported sources are Twitter, Facebook (only public data), Instagram, Youtube, Vine (logos are on their website) and probably Pinterest.

They use DataSift as data source in addition to standard APIs. I suppose their processing pipeline uses Storm to manage streaming input, maybe with an importing layer before. Data is crunched using Hadoop and Java and results are stored on MongoDB (Massimo Brignoli, Italian MongoDB evangelist, advertise their company during his presentation so I suppose they largely use it).

Node.js should be used for frontend. Is fast enough for near real time application (also using websockets) and play really well both with Angular.js and MongoDB (the MEAN stack). D3.js is obviously the only choice for complex dynamic charts.

I’m not so happy when I discover a new competitor in our market segment. Competition gets harder and this is not fun. Anyway guys at Datalytics seems smart (and nice) and compete with them would be a pleasure and will push me to do my best.

Now I’m curios to know if Datalytics is monitoring buzz on the web around its company name. I’m going to tweet about this article using #Datalytics hashtag. If you find this article please tweet me “Yes, we found it bwahaha” 😛

[UPDATE 2014-12-27 21:18 CET]

@DatalyticsIT favorite my tweet on December 1st. This probably means they found my article but the didn’t read it! 😀

crate_logo

I usually don’t trust cutting edge datastore. They promise a lot of stunning features (and use a lot of superlatives to describe them) but almost every time they are too young and have so much problems to run in production to be useless. I thought the same also about Crate Data.

“Massively scalable data store. It requires zero administration”

First time I read these words (take from the home page of Crate Data) I wasn’t impressed. I simply didn’t think was true. Some months later I read some articles and the overview of the project and I found something more interesting:

It includes solid established open source components (Presto, Elasticsearch, Lucene, Netty)

I used both Lucene and Elasticsearch in production for several years and I really like Presto. Combine some production-ready components can definitely be a smart way to create something great. I decided to give it a try.

They offer a quick way to test it:

bash -c "$(curl -L try.crate.io)"

But I don’t like self install scripts so I decided to download it a run from bin. It simply require JVM. I unpacked it on my desktop on OS X and I launched ./bin/crate. The process bind the port 4200 (or first available between 4200 and 4300) and if you go to http://127.0.0.1:4200/admin you found the admin interface (there is no authentication). You also had a command line interface: ./bin/crash. Is similar to MySQL client and you are familiar with any other SQL client you will be familiar with crash too.

I created a simple table with semi-standard SQL code (data types are a bit different)

create table items (id integer, title string)

Then I search for a Ruby client and I found crate_ruby, the official Ruby client. I started to fill the table using a Ruby script and a million record CSV as input. Inserts go by 5K per second and the meantime I did some aggregation query on database using standard SQL (GROUP BY, ORDER BY and so on) to test performances and response was quite fast.

CSV.foreach("data.csv", col_sep: ";").each do |row|
client.execute("INSERT INTO items (id, title) VALUES (\$1, \$2)", [row[0], row[9]])
end

Finally I decided to inspect cluster features by running another process on the same machine. After a couple of seconds the admin interface shows a new node and after a dozen informs me data was fully replicated. I also tried to shut down both process to see what happen and data seems ok. I was impressed.

crate_admin

I still have many doubts about Crate. I don’t know how to manage users and privileges, I don’t know how to create a custom topology for a cluster and I don’t know how difficult is to use advanced features (like full text search or blob upload). But at the moment I’m impressed because administration seems really easy and scalability seems easy too.

Next step will be test it in production under a Rails application (I found an interesting activerecord-crate-adapter) and test advanced features to implement a real time search. I don’t know if I’ll use it but beginning looks very good.

Next week O’Reilly will host a webcast about Crate. I’m really looking forward to discover more about the project.

Everything started while I was writing my first post about the Hadoop Ecosystem. I was relatively new to Hadoop and I wanted to discover all useful projects. I started collecting projects for about 9 months building a simple index.

About a month ago I found an interesting thread posted on the Hadoop Users Group on LinkedIn written by Javi Roman, High Performance Computing Manager at CEDIANT (UAX). He talks about a table which maps the Hadoop ecosystem likewise I did on my list.

He published his list on Github a couple of day later and called it the Hadoop Ecosystem Table. It was an HTML table, really interesting but really hard to use for other purpose. I wanted to merge my list with this table so I decided to fork it and add more abstractions.

I wrote a couple of Ruby scripts (thanks Nokogiri) to extract data from my list and Javi’s table and put in an agnostic container. After a couple of days spent hacking on these parsers I found a simple but elegant solution: JSON.

Information about each project is stored in a separated JSON file:

{
"name": "Apache HDFS",
"description": "The Hadoop Distributed File System (HDFS) offers a way to store large files across \nmultiple machines. Hadoop and HDFS was derived from Google File System (GFS) paper. \nPrior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. \nWith Zookeeper the HDFS High Availability feature addresses this problem by providing \nthe option of running two redundant NameNodes in the same cluster in an Active/Passive \nconfiguration with a hot standby. ",
"abstract": "a way to store large files across multiple machines",
"category": "Distributed Filesystem",
"tags": [
],
"links": [
{
"text": "hadoop.apache.org",
"url": "http://hadoop.apache.org/"
},
{
"text": "Google FileSystem - GFS Paper",
"url": "http://research.google.com/archive/gfs.html"
},
{
"text": "Cloudera Why HDFS",
"url": "http://blog.cloudera.com/blog/2012/07/why-we-build-our-platform-on-hdfs/"
},
{
"text": "Hortonworks Why HDFS",
"url": "http://hortonworks.com/blog/thinking-about-the-hdfs-vs-other-storage-technologies/"
}
]
}

It includes: project name, long and short description, category, tags and links.

I merged data into these files, and wrote a couple of generator in order to put data into different templates. Now i can generate code for my WordPress page and an update version of Javi’s table.

Finally I added more data into more generic categories not strictly related to Hadoop (like MySQL forks, Memcached forks and Search Engine platforms) and build a new version of the table: The Big Data Ecosystem table. JSON files are available to everyone and will be served directly from a CDN located under same domain of table.

This is how I built an open source big data map 🙂

database_venn_diagram

Last week I found this diagram on @al3xandru‘s MyNoSQL blog and I was surprise of how many softwares I never heard before.

From the diagram are missing many other softwares such as NuoDB (NewSQL), Aerospike (Key-Value), Titan (Graph), FoundationDB (Key-Values) Apache Accumulo (Key-Value), Apache Giraph (Graph) and more and includes some companies (like Cloudera, MapR and Xeround) also if they didn’t develop a custom version but just fork and maintain the main one.

Anyway it seems one of the best visual representation of the current database world and I’m going to use it as base to an updated and more detailed version 😉

Sources: 

When you work on a single machine everything is easy. Unfortunately when you have to scale and be fault tolerant you must relay on multiple hosts and manage a structure usually called “cluster“.

MongoDB enable you to create a replica set to be fault tolerant and use sharding to scale horizontally. Sharding is not transparent to DBAs. You have to choose a shard-key and adding and removing capacity when the system needs.

Structure

In a MongoDB cluster you have 3 fundamental “pieces”:

  • Servers: usually called mongod
  • Routers: usually called mongos.
  • Config servers

Servers are the place where you actually store your data, when you start the mongod command on your machine you are running a server. In a cluster usually you have multiple shard distributed over multiple servers.
Every shard is usually a replica set (2 or more servers) so if one of the servers goes down your cluster remains up and running.

Routers are the interaction point between users and the cluster. If you want to interact with your cluster you have to do throughout a mongos. The process route your request to the correct shard and gives back you the answer.

Config servers hold all the information about cluster configuration. They are very sensitive nodes and they are the real point of failure of the system.

mongodb_cluster

Shard Key

Choosing the shard key is the most important part when you create a cluster. There are a few important rule to follow learned after several errors:

  1. Don’t choose a shard key with a low cardinality. If one of this possibile values grow too much you can’t split it anymore.
  2. Don’t use an ascending shard key. The only shard who grows is the last one and distribute load on the other server always require a lot of traffic.
  3. Don’t use a random shard key. If quite efficient but you have to add an index to use it

A good choice is to use a coarsely ascending key combined to a search key (something you commonly query). This choice won’t work well for everything but it’s a good way to start thinking about.

N.B. All the informations and the image of the cluster strucure comes from the book below. I read it last week and I find it really interesting 🙂

Scaling MongoDB
by Kristina Chodorow

scaling_mongodb

Things I learned this week about MongoDB:

  • Duplicated data and a denormalized schema are good to be faster especially if if your data doesn’t change often.
  • Allocate more space for objects on disk is hard also for mongoDB. Do it early.
  • Preprocess you data and create index to speed up queries you actually do is a damn good idea.
  • The query order (with AND and OR operators) is really important.
  • Use journaling and replication to keep safe you data.
  • Use Javascript to define your own functions

50 Tips and Tricks for MongoDB Developers
by Kristina Chodorow

mongodb_developers

During last years I had to develop projects containing up to hundred of million of objects. Now I need to move ahead and scale up to several billions of objects, reaching the limit of “big-data” definition. The common implementation of relational model which I always used isn’t enough anymore.

We know that a standard single-machine instance of MySQL (which all web developers have used at least once) show its limit over the 100 millions of rows. I need to scale horizontally and also need most specific features to easily manage a huge amount of data.

This is not a limit of relational model. Other implementations (like PostgreSQL or Oracle) can easily scale over that limit. Unfortunately many operations you usually do on data (like joins and set operations) aren’t so fast to run with billion of records. I need something else.

So called “NoSQL databases” offer you more data model (document-oriented, columnar, key-value, graph and more) where you can store your data in an more efficient way. They also offer features like sharding, replication, caching and indexing out of the box.

I’m not a NoSQL expert so I can’t advise you if choose a DBMS instead of another is a good choice or not. I’m entering this world just now like many other developers but I think that polyglot persistence is the future. Store your data using more than one DBMS to fit your requirements and take advantage of features of each one is a smart choice.

Big-data and polyglot persistence are interesting topics. I found some interest books about these topics. They can be a high quality introduction.

Seven Databases in Seven Weeks
by Eric Redmond and Jim R. Wilson

Contains an overview about different kinds of data model with real-world example for each one: PostgreSQL (RDBMS), Riak and Redis (Key-Value), HBase (Column-oriented), MongoDB and CouchDB (Document-oriented) and Neo4j (Graph).

NoSQL Distilled
by Pramod J.Sadalage and Martin Fowler

Similarly to the previous one this book starts with overview about the NoSQL world. The first part analyze how different softwares implement key-features: data-modeling, distribution (to scaling horizontally) and replication (to keep if safe and analyzable) of data.

The second part focus on each different typology of DBMS and analyze how they implement concepts exposed in previous part.

Big Data Glossary
by Pete Warden

Big data is more that persistence. There are many other operations you can do on your data and many way to analyze results. If you aren’t familiar with concepts like MapReduce, Natural Language Processing and Machine Learning this book explain you the basics.

First 5 chapters are about storing big-data, other 6 chapters are about processing and refining data with focus on high-specific topics.