I usually don’t trust cutting edge datastore. They promise a lot of stunning features (and use a lot of superlatives to describe them) but almost every time they are too young and have so much problems to run in production to be useless. I thought the same also about Crate Data.

“Massively scalable data store. It requires zero administration”

First time I read these words (take from the home page of Crate Data) I wasn’t impressed. I simply didn’t think was true. Some months later I read some articles and the overview of the project and I found something more interesting:

It includes solid established open source components (Presto, Elasticsearch, Lucene, Netty)

I used both Lucene and Elasticsearch in production for several years and I really like Presto. Combine some production-ready components can definitely be a smart way to create something great. I decided to give it a try.

They offer a quick way to test it:

bash -c "$(curl -L"

But I don’t like self install scripts so I decided to download it a run from bin. It simply require JVM. I unpacked it on my desktop on OS X and I launched ./bin/crate. The process bind the port 4200 (or first available between 4200 and 4300) and if you go to you found the admin interface (there is no authentication). You also had a command line interface: ./bin/crash. Is similar to MySQL client and you are familiar with any other SQL client you will be familiar with crash too.

I created a simple table with semi-standard SQL code (data types are a bit different)

create table items (id integer, title string)

Then I search for a Ruby client and I found crate_ruby, the official Ruby client. I started to fill the table using a Ruby script and a million record CSV as input. Inserts go by 5K per second and the meantime I did some aggregation query on database using standard SQL (GROUP BY, ORDER BY and so on) to test performances and response was quite fast.

CSV.foreach("data.csv", col_sep: ";").each do |row|
client.execute("INSERT INTO items (id, title) VALUES (\$1, \$2)", [row[0], row[9]])

Finally I decided to inspect cluster features by running another process on the same machine. After a couple of seconds the admin interface shows a new node and after a dozen informs me data was fully replicated. I also tried to shut down both process to see what happen and data seems ok. I was impressed.


I still have many doubts about Crate. I don’t know how to manage users and privileges, I don’t know how to create a custom topology for a cluster and I don’t know how difficult is to use advanced features (like full text search or blob upload). But at the moment I’m impressed because administration seems really easy and scalability seems easy too.

Next step will be test it in production under a Rails application (I found an interesting activerecord-crate-adapter) and test advanced features to implement a real time search. I don’t know if I’ll use it but beginning looks very good.

Next week O’Reilly will host a webcast about Crate. I’m really looking forward to discover more about the project.


In the beginning was Apache Lucene. Written in 1999, Lucene is an “information retrieval software library” built to index documents containing fields of text. This flexibility allows Lucene’s API to be independent of the file format. Almost everything can be indexed as long as its textual information can be extracted.


Formally Lucene is an inverted full-text index. The core elements of such an index are segments, documents, fields, and terms. Every index consists of one or more segments. Each segment contains one or more documents. Each document has one or more fields, and each field contains one or more terms. Each term is a pair of Strings representing a field name and a value. A segment consists of a series of files.

Scaling is done by distributing indexes into multiple servers. One server ‘shard’ will get a query request and then search itself, as well as the other shards in the configuration, and return the combined results from each shard.


Apache Solr is a search platform, part of the Apache Lucene project. Its major features include full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document handling. It provide a REST-like API supporting XML and JSON format. It’s used by many notable sites to index theirs contents, here is the public list.

There are many well-tested way to interact with Solr. If you use Ruby Sunspot can be a good choice. Here is a small example (from the official website). Indexing is made within a model:

class Post < ActiveRecord::Base   searchable do     text :title, :body     text :comments do { |comment| comment.body }     end     integer :blog_id     integer :author_id     integer :category_ids, :multiple => true
time :published_at
string :sort_title do
title.downcase.gsub(/^(an?|the)\b/, '')

And when you search something you can specify many different conditions. do
fulltext 'best pizza'
with :blog_id, 1
order_by :published_at, :desc
paginate :page => 2, :per_page => 15
facet :category_ids, :author_id

solrcloudVersion 4.0 start supporting high availability through sharding using SolrCloud. It is a way to shard and scale indexes. Shards and replicas are distributed across nodes and nodes are monitored by ZooKeeper. Any node can receive query request and propagate it to the correct place. Image on the side (coming from an interesting blog post about SolrCloud) describe an example of setup.


ElasticSearch is a search platform (written by Shay Banon the creator of Compass, another search platform). It provide a JSON API and supports almost every feature of Solr.

There are many way to use it, many also with Ruby. Tire seems a good choice. A small example (from the Github page). Define what attribute to index and index them:

Tire.index 'articles' do
create :mappings => {
:article => {
:properties => {
:id       => { :type => 'string', :index => 'not_analyzed', :include_in_all => false },
:title    => { :type => 'string', :boost => 2.0,            :analyzer => 'snowball'  },
:tags     => { :type => 'string', :analyzer => 'keyword'                             },
:content  => { :type => 'string', :analyzer => 'snowball'                            }
store :title => 'One',   :tags => ['ruby']
store :title => 'Two',   :tags => ['ruby', 'python']
store :title => 'Three', :tags => ['java']
store :title => 'Four',  :tags => ['ruby', 'php']

Then search them:

s = 'articles' do
query do
string 'title:T*'
filter :terms, :tags => ['ruby']
sort { by :title, 'desc' }
facet 'global-tags', :global => true do
terms :tags
facet 'current-tags' do
terms :tags


Sphinx is the only real alternative to Lucene. Differently than Lucene, Sphinx is designed to index content coming from a database. It supports native protocols of MySQL, MariaDB and PostgreSQL or standard ODBC protocol. You can also run Sphinx as standalone server and communicating with it using the SphinxAPI.

Sphinx also offer a storage engine called SphinxSE. It’s compatible with MySQL and integrated into MariaDB. Querying is possible using SphinxQL, a subset of SQL.

To use it in Ruby the official gem is Thinking Sphinx. Below some example of usage directly from the github page. Defining indexs:

ThinkingSphinx::Index.define :article, :with => :active_record do
indexes title, content
indexes, :as => :user
indexes user.articles.title, :as => :related_titles
has published

and querying
select: '@weight * 10 + document_boost as custom_weight',
order: :custom_weight

Others libraries

There are many other software and library designed to index and search stuff.

  • Amazon CloudSearch is a fully-managed search service in the cloud. It’s part of the AWS cloud and should be “fast and highly scalable” as Amazon says.
  • Lemur Project is a kind of information retrieval framework. It integrates the Indri search engine, a C and C++ library who can easily index HTML and XML stuff and be distributed across cluster’s nodes.
  • Xaplan is probabilistic information retrieval library. Is written in C++ and can be used with many popular languages. It supports the Probabilistic Information Retrieval model and also supports a rich set of boolean query operators.