Realtime full-text search

lucene

In the beginning was Apache Lucene. Written in 1999, Lucene is an “information retrieval software library” built to index documents containing fields of text. This flexibility allows Lucene’s API to be independent of the file format. Almost everything can be indexed as long as its textual information can be extracted.

lucene_structure

Formally Lucene is an inverted full-text index. The core elements of such an index are segments, documents, fields, and terms. Every index consists of one or more segments. Each segment contains one or more documents. Each document has one or more fields, and each field contains one or more terms. Each term is a pair of Strings representing a field name and a value. A segment consists of a series of files.

Scaling is done by distributing indexes into multiple servers. One server ‘shard’ will get a query request and then search itself, as well as the other shards in the configuration, and return the combined results from each shard.

solr

Apache Solr is a search platform, part of the Apache Lucene project. Its major features include full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document handling. It provide a REST-like API supporting XML and JSON format. It’s used by many notable sites to index theirs contents, here is the public list.

There are many well-tested way to interact with Solr. If you use Ruby Sunspot can be a good choice. Here is a small example (from the official website). Indexing is made within a model:

class Post < ActiveRecord::Base   searchable do     text :title, :body     text :comments do       comments.map { |comment| comment.body }     end     integer :blog_id     integer :author_id     integer :category_ids, :multiple => true
time :published_at
string :sort_title do
title.downcase.gsub(/^(an?|the)\b/, '')
end
end
end

And when you search something you can specify many different conditions.

Post.search do
fulltext 'best pizza'
with :blog_id, 1
with(:published_at).less_than Time.now
order_by :published_at, :desc
paginate :page => 2, :per_page => 15
facet :category_ids, :author_id
end

solrcloudVersion 4.0 start supporting high availability through sharding using SolrCloud. It is a way to shard and scale indexes. Shards and replicas are distributed across nodes and nodes are monitored by ZooKeeper. Any node can receive query request and propagate it to the correct place. Image on the side (coming from an interesting blog post about SolrCloud) describe an example of setup.

elasticsearch

ElasticSearch is a search platform (written by Shay Banon the creator of Compass, another search platform). It provide a JSON API and supports almost every feature of Solr.

There are many way to use it, many also with Ruby. Tire seems a good choice. A small example (from the Github page). Define what attribute to index and index them:

Tire.index 'articles' do
delete
create :mappings => {
:article => {
:properties => {
:id       => { :type => 'string', :index => 'not_analyzed', :include_in_all => false },
:title    => { :type => 'string', :boost => 2.0,            :analyzer => 'snowball'  },
:tags     => { :type => 'string', :analyzer => 'keyword'                             },
:content  => { :type => 'string', :analyzer => 'snowball'                            }
}
}
}
store :title => 'One',   :tags => ['ruby']
store :title => 'Two',   :tags => ['ruby', 'python']
store :title => 'Three', :tags => ['java']
store :title => 'Four',  :tags => ['ruby', 'php']
refresh
end

Then search them:

s = Tire.search 'articles' do
query do
string 'title:T*'
end
filter :terms, :tags => ['ruby']
sort { by :title, 'desc' }
facet 'global-tags', :global => true do
terms :tags
end
facet 'current-tags' do
terms :tags
end
end

sphinx

Sphinx is the only real alternative to Lucene. Differently than Lucene, Sphinx is designed to index content coming from a database. It supports native protocols of MySQL, MariaDB and PostgreSQL or standard ODBC protocol. You can also run Sphinx as standalone server and communicating with it using the SphinxAPI.

Sphinx also offer a storage engine called SphinxSE. It’s compatible with MySQL and integrated into MariaDB. Querying is possible using SphinxQL, a subset of SQL.

To use it in Ruby the official gem is Thinking Sphinx. Below some example of usage directly from the github page. Defining indexs:

ThinkingSphinx::Index.define :article, :with => :active_record do
indexes title, content
indexes user.name, :as => :user
indexes user.articles.title, :as => :related_titles
has published
end

and querying

ThinkingSphinx.search(
select: '@weight * 10 + document_boost as custom_weight',
order: :custom_weight
)

Others libraries

There are many other software and library designed to index and search stuff.

  • Amazon CloudSearch is a fully-managed search service in the cloud. It’s part of the AWS cloud and should be “fast and highly scalable” as Amazon says.
  • Lemur Project is a kind of information retrieval framework. It integrates the Indri search engine, a C and C++ library who can easily index HTML and XML stuff and be distributed across cluster’s nodes.
  • Xaplan is probabilistic information retrieval library. Is written in C++ and can be used with many popular languages. It supports the Probabilistic Information Retrieval model and also supports a rich set of boolean query operators.

Sources: