I always like The Setup. Discover what kind of technologies, hardware and softwares other skilled people are using is extremely useful and really fun for me. This time I’d like to share some tips from the complete reboot I did to my personal ecosystem after switch to my new Macbook.

macbook_pro_13_retina

From the hardware side is a simple high-end 2015 Macbook Pro 13″ Retina with Intel Core i7 Haswell dual-core at 3,4GHz, 16GB of RAM and 1TB of SSD PCI Express 3.0. Is fast, solid, lightweight and flexible. The only required accessory is the Be.eZ LArobe Second Skin.

From the software side I decided to avoid Time Machine restore in order to setup a completely new environment. I started on a OS X 10.10 Yosemite fresh installation.

As polyglot developer I usually deal with a lot of different applications, programming languages and tools. In order to decide what top install, a list of what I had on the previous machine and what I need more was really useful.

Here is a list of useful software and some tips about the installation process.

Applications

paid_apps

Paid softwares worth having: Evernote (with Premium subscription and Skitch) and Todoist (with Premium subscription) both available on the Mac App Store. 1Password, Fantastical 2, OmniGraffle, Carbon Copy Cloner, Backblaze and Expandrive available on their own websites.

Free software worth having: Google Chrome and Mozilla Firefox as browser, Apache OpenOffice, Skype and Slack as chat, VLC for multimedia and Transmission for torrents.

app_from_suites

Suites or part of: Adobe Photoshop CC, Adobe Illustrator CC and Adobe Acrobat Pro DC are part of the Adobe Creative Cloud. Microsoft Word 2016, and Microsoft Excel 2016 are part of Microsoft Office 2016 for Mac (now in free preview). Apple Pages, and Apple Keynote are preinstalled as Apple iWork suite as well as Apple Calendar and Apple Contacts.

Development tools

Utilities for Power Users: Caffeine, Growl and HardwareGrowler, iStat Menu Pro, Disk Inventory X, Tor Browser and TrueCrypt 7.1a (you need to fix a little installation bug on OS X 10.10), Kinematic and Boot2Docker for Docker, Sublime Text 3 (with some additions like: Spacegray Theme, Soda Theme, a new icon, Source Code Pro font), Tower, Visual Studio Code, Android SDK (for Android emulator) and XCode (for iOS emulator), VirtualBox (with some useful Linux virtual images), iTerm 2.

CLI: OhMyZSH, Homebrew, GPG (installed using brew), XCode Command Line Tools (from Apple Developers website), Git (with git-flow installed using brew), AWS CLI (install via pip), PhantomJS, s3cmd and faster s4cmd, Heroku toolbelt and Openshift Client Tools (install via gem).

daemons

Servers: MariaDB 10.0 (brew), MongoDB 3.0 (brew), Redis 3.0 (brew), Elasticsearch 1.6 (brew), Nginx 1.8.0 (brew), PostgreSQL 9.4.2 (via Postgres.app), Hadoop 2.7.0 (brew), Spark 1.4 (download from official website), Neo4j 2.2 (brew), Accumulo 1.7.0 (download from official website), Crate 0.49 (download from official website), Mesos 0.22 (download from official website), Riak 2.1.1 (brew), Storm 0.9.5 (download from official website), Zookeeper 3.4.6 (brew), Sphinx 2.2 (brew), Cassandra 2.1.5 (brew).

languages

Programming languages: RVM, Ruby (MRI 2.2, 2.1, 2.0, 1.9.3, 1.8.7, REE 2012.02, JRuby 1.7.19 installed using RVM), PHP 5.6 with PHP-FPM (installed using brew), HHVM 3.7.2 (installed using brew with adding additional repo, has some issues on 10.10), Python 2.7 (brew python) and Python 3.4 (brew python3), Pip 7.1 (shipped with Python), NVM, Node.js 0.12 and IO.js 2.3 (both installed using NVM), Go 1.4.2 (from Golang website), Java 8 JVM (from Oracle website), Java 8 SE JDK (from Oracle website), Scala 2.11 (from Scala website), Clojure 1.6 (from Clojure website), Erlang 17.0 (brew), Haskell GHC 7.10 (brew), Haskell Cabal 1.22 (brew), OCaml 4.02.1 (brew), R 3.2.1 (from R for Mac OS X website), .NET Core and ASP.NET (brew using DNVM), GPU Ocelot (compiled with a lot of libraries).

Full reboot takes about 2 days. Some software are still missing but I was able to restart my work almost completely. I hope this list would be helpful for anyone 🙂

crate_logo

I usually don’t trust cutting edge datastore. They promise a lot of stunning features (and use a lot of superlatives to describe them) but almost every time they are too young and have so much problems to run in production to be useless. I thought the same also about Crate Data.

“Massively scalable data store. It requires zero administration”

First time I read these words (take from the home page of Crate Data) I wasn’t impressed. I simply didn’t think was true. Some months later I read some articles and the overview of the project and I found something more interesting:

It includes solid established open source components (Presto, Elasticsearch, Lucene, Netty)

I used both Lucene and Elasticsearch in production for several years and I really like Presto. Combine some production-ready components can definitely be a smart way to create something great. I decided to give it a try.

They offer a quick way to test it:

bash -c "$(curl -L try.crate.io)"

But I don’t like self install scripts so I decided to download it a run from bin. It simply require JVM. I unpacked it on my desktop on OS X and I launched ./bin/crate. The process bind the port 4200 (or first available between 4200 and 4300) and if you go to http://127.0.0.1:4200/admin you found the admin interface (there is no authentication). You also had a command line interface: ./bin/crash. Is similar to MySQL client and you are familiar with any other SQL client you will be familiar with crash too.

I created a simple table with semi-standard SQL code (data types are a bit different)

create table items (id integer, title string)

Then I search for a Ruby client and I found crate_ruby, the official Ruby client. I started to fill the table using a Ruby script and a million record CSV as input. Inserts go by 5K per second and the meantime I did some aggregation query on database using standard SQL (GROUP BY, ORDER BY and so on) to test performances and response was quite fast.

CSV.foreach("data.csv", col_sep: ";").each do |row|
client.execute("INSERT INTO items (id, title) VALUES (\$1, \$2)", [row[0], row[9]])
end

Finally I decided to inspect cluster features by running another process on the same machine. After a couple of seconds the admin interface shows a new node and after a dozen informs me data was fully replicated. I also tried to shut down both process to see what happen and data seems ok. I was impressed.

crate_admin

I still have many doubts about Crate. I don’t know how to manage users and privileges, I don’t know how to create a custom topology for a cluster and I don’t know how difficult is to use advanced features (like full text search or blob upload). But at the moment I’m impressed because administration seems really easy and scalability seems easy too.

Next step will be test it in production under a Rails application (I found an interesting activerecord-crate-adapter) and test advanced features to implement a real time search. I don’t know if I’ll use it but beginning looks very good.

Next week O’Reilly will host a webcast about Crate. I’m really looking forward to discover more about the project.

klout_logo

According to WikipediaKlout is “a website and mobile app that uses social media analytics to rank its users according to online social influence via the “Klout Score“, which is a numerical value between 1 and 100“.

This is not so different from what I try to do everyday. They get signals from social networks, process them in order to extract relevant data and show some diagrams and a synthetic index of user influence. It’s really interesting for me observe how their data is stored and processed.

At Hadoop Summit 2012, Dave Mariani (by Klout) and Denny Lee (by Microsoft) presented the Klout architecture and shown the following diagram:

klout_architecture

It shows many different technologies, a great example of polyglot persistence 🙂

Klout uses a lot of Hadoop. It’s used to collect signals coming from different Signal Collectors (one for each social network i suppose). Procedure to enhance data are written using Pig and Hive used also for data warehouse.

Currently MySQL is used only to collect user registrations, ingested into the data warehouse system. In the past they use it as bridge between the data warehouse and their “Cube“, a Microsoft SQL Server Analysis Services (SSAS). They use it for Business Intelligence with Excel and other custom apps. On 2011 data were migrated using Sqoop. Now they can leverage on Microsoft’s Hive ODBC driver and MySQL isn’t used anymore.

Website and mobile app are based on the Klout API. Data is collected from the data warehouse and stored into HBase (users profile and score) and MongoDB (interaction between users). ElasticSearch is used as search index.

Most of custom components are written in Scala. The only exception is the website, written in Javascript/Node.js.

In the end Klout is probably the biggest company working both using open source tools coming from the Hadoop ecosystem and Microsoft tools. The Hadoop version for Windows Azure, developed in pair with Hortonworks, is probably the first product of this collaboration.

Sources

lucene

In the beginning was Apache Lucene. Written in 1999, Lucene is an “information retrieval software library” built to index documents containing fields of text. This flexibility allows Lucene’s API to be independent of the file format. Almost everything can be indexed as long as its textual information can be extracted.

lucene_structure

Formally Lucene is an inverted full-text index. The core elements of such an index are segments, documents, fields, and terms. Every index consists of one or more segments. Each segment contains one or more documents. Each document has one or more fields, and each field contains one or more terms. Each term is a pair of Strings representing a field name and a value. A segment consists of a series of files.

Scaling is done by distributing indexes into multiple servers. One server ‘shard’ will get a query request and then search itself, as well as the other shards in the configuration, and return the combined results from each shard.

solr

Apache Solr is a search platform, part of the Apache Lucene project. Its major features include full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document handling. It provide a REST-like API supporting XML and JSON format. It’s used by many notable sites to index theirs contents, here is the public list.

There are many well-tested way to interact with Solr. If you use Ruby Sunspot can be a good choice. Here is a small example (from the official website). Indexing is made within a model:

class Post < ActiveRecord::Base   searchable do     text :title, :body     text :comments do       comments.map { |comment| comment.body }     end     integer :blog_id     integer :author_id     integer :category_ids, :multiple => true
time :published_at
string :sort_title do
title.downcase.gsub(/^(an?|the)\b/, '')
end
end
end

And when you search something you can specify many different conditions.

Post.search do
fulltext 'best pizza'
with :blog_id, 1
with(:published_at).less_than Time.now
order_by :published_at, :desc
paginate :page => 2, :per_page => 15
facet :category_ids, :author_id
end

solrcloudVersion 4.0 start supporting high availability through sharding using SolrCloud. It is a way to shard and scale indexes. Shards and replicas are distributed across nodes and nodes are monitored by ZooKeeper. Any node can receive query request and propagate it to the correct place. Image on the side (coming from an interesting blog post about SolrCloud) describe an example of setup.

elasticsearch

ElasticSearch is a search platform (written by Shay Banon the creator of Compass, another search platform). It provide a JSON API and supports almost every feature of Solr.

There are many way to use it, many also with Ruby. Tire seems a good choice. A small example (from the Github page). Define what attribute to index and index them:

Tire.index 'articles' do
delete
create :mappings => {
:article => {
:properties => {
:id       => { :type => 'string', :index => 'not_analyzed', :include_in_all => false },
:title    => { :type => 'string', :boost => 2.0,            :analyzer => 'snowball'  },
:tags     => { :type => 'string', :analyzer => 'keyword'                             },
:content  => { :type => 'string', :analyzer => 'snowball'                            }
}
}
}
store :title => 'One',   :tags => ['ruby']
store :title => 'Two',   :tags => ['ruby', 'python']
store :title => 'Three', :tags => ['java']
store :title => 'Four',  :tags => ['ruby', 'php']
refresh
end

Then search them:

s = Tire.search 'articles' do
query do
string 'title:T*'
end
filter :terms, :tags => ['ruby']
sort { by :title, 'desc' }
facet 'global-tags', :global => true do
terms :tags
end
facet 'current-tags' do
terms :tags
end
end

sphinx

Sphinx is the only real alternative to Lucene. Differently than Lucene, Sphinx is designed to index content coming from a database. It supports native protocols of MySQL, MariaDB and PostgreSQL or standard ODBC protocol. You can also run Sphinx as standalone server and communicating with it using the SphinxAPI.

Sphinx also offer a storage engine called SphinxSE. It’s compatible with MySQL and integrated into MariaDB. Querying is possible using SphinxQL, a subset of SQL.

To use it in Ruby the official gem is Thinking Sphinx. Below some example of usage directly from the github page. Defining indexs:

ThinkingSphinx::Index.define :article, :with => :active_record do
indexes title, content
indexes user.name, :as => :user
indexes user.articles.title, :as => :related_titles
has published
end

and querying

ThinkingSphinx.search(
select: '@weight * 10 + document_boost as custom_weight',
order: :custom_weight
)

Others libraries

There are many other software and library designed to index and search stuff.

  • Amazon CloudSearch is a fully-managed search service in the cloud. It’s part of the AWS cloud and should be “fast and highly scalable” as Amazon says.
  • Lemur Project is a kind of information retrieval framework. It integrates the Indri search engine, a C and C++ library who can easily index HTML and XML stuff and be distributed across cluster’s nodes.
  • Xaplan is probabilistic information retrieval library. Is written in C++ and can be used with many popular languages. It supports the Probabilistic Information Retrieval model and also supports a rich set of boolean query operators.

Sources: