After almost 4 years as CTO at The Fool, is time for me to search for new adventures.¬†Starting from May 1st 2016 I’ll join the Curcuma team. Below, a picture from new office:

curcuma

Just joking ūüėõ Will be¬†a big challenge¬†for me because I’ll move from specific field, web and social network analysis, to general purpose development where projects are really varied: custom CMS, integration with IoT devices, mobile application and many others. In the end, a good challenge!

soliduseCommerce solutions are quite popular¬†in Curcuma’s portfolio and my last experience about was in 2008 with an early version of Magento. I worked on similar products but I’m quite “rusty” on this topic.¬†Starting from the Ruby ecosystem, default in Curcuma, only two realistic options are available: Spree (acquired by¬†First Data and no longer supported) and Solidus (a Spree fork quite young but already interesting).

I searched for tutorials about Solidus but version 1.0.0 was shipped last August (and is based on Spree 2.4) and community is still young. I found only beginner’s¬†tutorials so I decided to follow Github README instructions on master branch.

Install

Start with a fresh installation of Rails 4.2 (Rails 5.0 beta seems not supported yet), add gems and run bundle install

gem 'solidus'
gem 'solidus_auth_devise'

Inspecting Gemfile.lock you can find solidus dependencies:

solidus (1.2.2)
solidus_api (= 1.2.2)
solidus_backend (= 1.2.2)
solidus_core (= 1.2.2)
solidus_frontend (= 1.2.2)
solidus_sample (= 1.2.2)
solidus_auth_devise (1.3.0)

The solidus package seems a container for these modules. I really like this approach: is clean, encourages isolation and mask complexity. Also gemspec is the cleanest I’ve seen yet.

# encoding: UTF-8
require_relative 'core/lib/spree/core/version.rb'
Gem::Specification.new do |s|
s.platform    = Gem::Platform::RUBY
s.name        = 'solidus'
s.version     = Spree.solidus_version
s.summary     = 'Full-stack e-commerce framework for Ruby on Rails.'
s.description = 'Solidus is an open source e-commerce framework for Ruby on Rails.'
s.files        = Dir['README.md', 'lib/**/*']
s.require_path = 'lib'
s.requirements << 'none' s.required_ruby_version = '>= 2.1.0'
s.required_rubygems_version = '>= 1.8.23'
s.author       = 'Solidus Team'
s.email        = 'contact@solidus.io'
s.homepage     = 'http://solidus.io'
s.license      = 'BSD-3'
s.add_dependency 'solidus_core', s.version
s.add_dependency 'solidus_api', s.version
s.add_dependency 'solidus_backend', s.version
s.add_dependency 'solidus_frontend', s.version
s.add_dependency 'solidus_sample', s.version
end

Setup and config

Anyway next step on README is to run following rake tasks

bundle exec rails g spree:install
bundle exec rake railties:install:migrations

First one gives me a warning:

[WARNING] You are not setting Devise.secret_key within your application!
You must set this in config/initializers/devise.rb. Here's an example:
Devise.secret_key = "7eaa914b11299876c503eca74af..."

fires some actions related to assets, migrations and seeds then ask me for username and password. Standard install.

About the warning I found another post that recommend to run this task.

rails g solidus:auth:install

Is not clear to me what it does but seems working. After run warning is left.

Rake about migration (bundle exec rake railties:install:migrations) gives no output. I suppose migrations are already installed after first step. No idea.

Anyway last step liste on README is to run migrations (bundle exec rake db:migrate) and give no output too so everything seems ok.

No we can fire rails s and enjoy our brand new store ūüėÄ

solidus-screen

A bit more control

These steps are cool but do a lot of things we probably don’t want like install demo products and demo users. Following the README, installation can be run without any automatic step:

rails g spree:install --migrate=false --sample=false --seed=false

and then you are free to run any of the available step with your customizations

bundle exec rake railties:install:migrations
bundle exec rake db:migrate
bundle exec rake db:seed
bundle exec rake spree_sample:load

Now our new store is ready. It’s time to dig deeper into Solidus structure. See ya in another¬†post bro ūüėČ

I always like The Setup. Discover what kind of technologies, hardware and softwares other skilled people are using is extremely useful and really fun for me. This time I’d like to share some tips from the complete reboot I did to my personal ecosystem after switch to my new Macbook.

macbook_pro_13_retina

From the hardware side is a simple high-end 2015 Macbook Pro 13″ Retina with Intel Core i7 Haswell dual-core at 3,4GHz, 16GB of RAM and 1TB of SSD PCI Express 3.0. Is fast, solid, lightweight and flexible. The only required accessory is the Be.eZ LArobe Second Skin.

From the software side I decided to avoid Time Machine restore in order to setup a completely new environment. I started on a OS X 10.10 Yosemite fresh installation.

As polyglot developer I usually deal with a lot of different applications, programming languages and tools. In order to decide what top install, a list of what I had on the previous machine and what I need more was really useful.

Here is a list of useful software and some tips about the installation process.

Applications

paid_apps

Paid softwares worth having: Evernote (with Premium subscription and Skitch) and Todoist (with Premium subscription) both available on the Mac App Store. 1Password, Fantastical 2, OmniGraffle, Carbon Copy Cloner, Backblaze and Expandrive available on their own websites.

Free software worth having: Google Chrome and Mozilla Firefox as browser, Apache OpenOffice, Skype and Slack as chat, VLC for multimedia and Transmission for torrents.

app_from_suites

Suites or part of: Adobe Photoshop CC, Adobe Illustrator CC and Adobe Acrobat Pro DC are part of the Adobe Creative Cloud. Microsoft Word 2016, and Microsoft Excel 2016 are part of Microsoft Office 2016 for Mac (now in free preview). Apple Pages, and Apple Keynote are preinstalled as Apple iWork suite as well as Apple Calendar and Apple Contacts.

Development tools

Utilities for Power Users: Caffeine, Growl and HardwareGrowler, iStat Menu Pro, Disk Inventory X, Tor Browser and TrueCrypt 7.1a (you need to fix a little installation bug on OS X 10.10), Kinematic and Boot2Docker for Docker, Sublime Text 3 (with some additions like: Spacegray Theme, Soda Theme, a new icon, Source Code Pro font), Tower, Visual Studio Code, Android SDK (for Android emulator) and XCode (for iOS emulator), VirtualBox (with some useful Linux virtual images), iTerm 2.

CLI: OhMyZSH, Homebrew, GPG (installed using brew), XCode Command Line Tools (from Apple Developers website), Git (with git-flow installed using brew), AWS CLI (install via pip), PhantomJS, s3cmd and faster s4cmd, Heroku toolbelt and Openshift Client Tools (install via gem).

daemons

Servers: MariaDB 10.0 (brew), MongoDB 3.0 (brew), Redis 3.0 (brew), Elasticsearch 1.6 (brew), Nginx 1.8.0 (brew), PostgreSQL 9.4.2 (via Postgres.app), Hadoop 2.7.0 (brew), Spark 1.4 (download from official website), Neo4j 2.2 (brew), Accumulo 1.7.0 (download from official website), Crate 0.49 (download from official website), Mesos 0.22 (download from official website), Riak 2.1.1 (brew), Storm 0.9.5 (download from official website), Zookeeper 3.4.6 (brew), Sphinx 2.2 (brew), Cassandra 2.1.5 (brew).

languages

Programming languages: RVM, Ruby (MRI 2.2, 2.1, 2.0, 1.9.3, 1.8.7, REE 2012.02, JRuby 1.7.19 installed using RVM), PHP 5.6 with PHP-FPM (installed using brew), HHVM 3.7.2 (installed using brew with adding additional repo, has some issues on 10.10), Python 2.7 (brew python) and Python 3.4 (brew python3), Pip 7.1 (shipped with Python), NVM, Node.js 0.12 and IO.js 2.3 (both installed using NVM), Go 1.4.2 (from Golang website), Java 8 JVM (from Oracle website), Java 8 SE JDK (from Oracle website), Scala 2.11 (from Scala website), Clojure 1.6 (from Clojure website), Erlang 17.0 (brew), Haskell GHC 7.10 (brew), Haskell Cabal 1.22 (brew), OCaml 4.02.1 (brew), R 3.2.1 (from R for Mac OS X website), .NET Core and ASP.NET (brew using DNVM), GPU Ocelot (compiled with a lot of libraries).

Full reboot takes about 2 days. Some software are still missing but I was able to restart my work almost completely. I hope this list would be helpful for anyone ūüôā

Everything started while I was writing my first post about the Hadoop Ecosystem. I was relatively new to Hadoop and I wanted to discover all useful projects. I started collecting projects for about 9 months building a simple index.

About a month ago I found an interesting thread posted on the Hadoop Users Group on LinkedIn written by Javi Roman, High Performance Computing Manager at CEDIANT (UAX). He talks about a table which maps the Hadoop ecosystem likewise I did on my list.

He published his list on Github a couple of day later and called it the Hadoop Ecosystem Table. It was an HTML table, really interesting but really hard to use for other purpose. I wanted to merge my list with this table so I decided to fork it and add more abstractions.

I wrote a couple of Ruby scripts (thanks Nokogiri) to extract data from my list and Javi’s table and put in an agnostic container. After a couple of days spent hacking on these parsers I found a simple but elegant solution: JSON.

Information about each project is stored in a separated JSON file:

{
"name": "Apache HDFS",
"description": "The Hadoop Distributed File System (HDFS) offers a way to store large files across \nmultiple machines. Hadoop and HDFS was derived from Google File System (GFS) paper. \nPrior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. \nWith Zookeeper the HDFS High Availability feature addresses this problem by providing \nthe option of running two redundant NameNodes in the same cluster in an Active/Passive \nconfiguration with a hot standby. ",
"abstract": "a way to store large files across multiple machines",
"category": "Distributed Filesystem",
"tags": [
],
"links": [
{
"text": "hadoop.apache.org",
"url": "http://hadoop.apache.org/"
},
{
"text": "Google FileSystem - GFS Paper",
"url": "http://research.google.com/archive/gfs.html"
},
{
"text": "Cloudera Why HDFS",
"url": "http://blog.cloudera.com/blog/2012/07/why-we-build-our-platform-on-hdfs/"
},
{
"text": "Hortonworks Why HDFS",
"url": "http://hortonworks.com/blog/thinking-about-the-hdfs-vs-other-storage-technologies/"
}
]
}

It includes: project name, long and short description, category, tags and links.

I merged data into these files, and wrote a couple of generator in order to put data into different templates. Now i can generate code for my WordPress page and an update version of Javi’s table.

Finally I added more data into more generic categories not strictly related to Hadoop (like MySQL forks, Memcached forks and Search Engine platforms) and build a new version of the table: The Big Data Ecosystem table. JSON files are available to everyone and will be served directly from a CDN located under same domain of table.

This is how I built an open source big data map ūüôā

datasift_logo

DataSift, as they said on their home page, “aggregate, process and deliver social data“. It is one of the oldest Twitter certified partners and offers data coming from almost every existing social network. I use it everyday to “listen” the net and find data I need for my analysis.

It’s impressive to watch how fast they collect data from external sources and deliver it to your chosen destination. When I tweet, a couple of minutes ago a JSON file land my S3 bucket.

To create an Internet scale filtering is not easy. Their infrastructure is really complex and optimized. This is a 2011 diagram of their workflow.

datasift_infrastructure

Twitter generates more than 500 million tweets per day and is only one of the available resources. The DataSift system performs 250+ million sentiment analysis with sub 100ms latency, and several TB of augmented (includes gender, sentiment, etc) data transits the platform daily. Data Filtering Nodes Can process up to 10,000 unique streams. Can do data-lookup’s on 10,000,000+ username lists in real-time. Links Augmentation Performs 27 million link resolves + lookups plus 15+ million full web page aggregations per day.

C++ is used for the performance-critical components, like the core filtering engine and PHP is for the site, external API server, most of the internal web services, and a custom-built, high performance job queue manager. Java/Scala for batch processing with HBase and MapReduce jobs. Kafka is used as queuing system and Ruby is used for deploys and provisioning. Thrift is widely used.

MySQL (Percona server) on SSD drives is used as primary storage, HBase cluster over more than 30 Hadoop nodes provides a place to store historical data and Memcached and Redis are used for caching.

Here is a schema of the processing unit which build the historical database.

datasift_historical

Message queues are another critical component of the infrastructure. 0mq (custom build from latest alpha branch, with some stability fixes, to use publisher-side filtering), used in different configurations:

  • PUB-SUB for replication / message broadcasting;
  • PUSH-PULL for round-robin workload distribution;
  • REQ-REP for health checks of different components.

Kafka for high-performance persistent queues. In both cases they’re working with the developers and contributing bug reports / traces / fixes / client libraries.

All code is pulled from the repo from Jenkins every 5 mins, automatically tested and verified with several QA tools, packaged as an RPM and moved to the dev package repo. Chef is used to automate deployments and manage configuration. All services emit StatsD events, which are combined with other system-level checks, added to Zenoss and displayed with Graphite.

The biggest challenge IMHO is filtering. Filtering at this scale requires a different approach. They started with work they did at TweetMeme. The core filter engine is in C++ and is called the Pickle Matrix.¬†Over three years they’ve developed a compiler and their own virtual machine. We don’t know what their technology is exactly, but it might be something like Distributed Complex Event Processing with Query Rewriting.

Sources

Almost all content of this post come from the wonderful article “DataSift Architecture: Realtime Datamining At 120,000 Tweets Per Second” posted on HighScalability. Some details also from “Historical Architecture – Data Mining Billions of Tweets” from DataSift blog.

From the home page

Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. […] Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.”

Introduction

Storm enables you to define a Topology (an abstraction of cluster computation) in order to describe how to handle data flow. In a topology you can define some Spouts (entry point for your data with basic preprocessing) and some Bolts (a single step of data manipulation). This simple strategy enable you to define a complex processing of streams of data.

Storm nodes are of two kinds: master and worker. Master node runs Nimbus, it is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures. Worker nodes run Supervisor. The Supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it. Everything is done through ZooKeeper.

Libraries

Resources

Books

When I used Rails for the first time I was impressed by the use of multiple environments to handle different setup, policies and behavior of an application. A few year ago use of environments wasn’t so common and switching between development, test and production was an innovation for me.

Anyway big projects who use custom frameworks introduced this structure several years before. I had the pleasure to work over a lot of legacy code who implement different environments setup. For example classified ADS channel of Repubblica.it (built by my mentor @FabiolousMate) uses dev, demo and production. Other projects I worked on use staging. After have listened a lot of opinions I asked myself which are the most important environments and if three are enough.

I’m mostly a Ruby developer and I know the Rails ecosystem who uses 3 basic environment.

  • development is used to when you code. Source code is reloaded each time. Log is EXTREMELY verbose. Libraries includes debug and error logging features. Database is full of garbage data.
  • test is for automatic testing. Data is loaded and cleaned automatically every time you run tests. Everything can be mocked (database, APIs, external services, …). Libraries includes testing frameworks and log is just for test output.
  • production is to be safe. Logging is just for errors. Sometimes there is a caching layer. Libraries are loaded once. Data¬†is replicated. Everything is set up to improve both performances and robustness.

These environments are really useful in order to manage application development. Unfortunately are not enough to handle every situation. For example production is not appropriate for testing new feature because of the poor log and the strong optimization (and the precious production data) and is not appropriate as well for demo purpose because has to be used by customers. Development is alike not appropriate to find bottlenecks because of messy data and debug code.

In my experience I usually add three more environment to my application trying to fit every situation. Most of cases these are enough.

  • staging is for deep testing of new features. Production data and development logging and libraries. Enable you to test side effects of your new features in the real world. If an edit works here probably works also in production
  • demo is for showtime. Production environment with sandboxes features and demo data. You can open this environment to anyone and he can play whatever he wants without dangers.
  • profile is to find bottlenecks. Development environment with specific library to profile and fine tuning of you process. You can adjust data to stress your system without worry about coherence of data.

This is IMHO a good setup of you deploy environments. Depending on projects some of these aren’t useful but in a large project each one can save you life.

Ruby doesn’t like strings which are not UTF-8 encoded. CSV files are usually a bunch of data coming from somewhere and most of times are not UTF-8 encoded. When you try to read them you can expect to have problems. I¬†fought against encoding problem for a long time and now I found how to avoid major problems and I’m very proud of this (because of many of headaches… :-/ ).

If you try to read a CSV file you can specify option :encoding to set source and destination encoding (format: “source:destination“) and pass it to the CSV engine already converted

CSV.foreach("file.csv", encoding: "iso-8859-1:UTF-8") do |row|
# use row here...
end

If you resource is not a file but a String or a file handler you need to covert it before use CSV engine. The standard String#force_encode method seems not working as expected:

a = "\xff"
a.force_encoding "utf-8"
a.valid_encoding?
# => returns false
a =~ /x/
# => provokes ArgumentError: invalid byte sequence in UTF-8

You must use String#encode! method to get things done:

a = "\xff"
a.encode!("utf-8", "utf-8", :invalid => :replace)
a.valid_encoding?
# => returns true now
a ~= /x/
# => works now

So using an external resource:

handler = open("http://www.example.com/file.csv")
csv_string = handler.read.encode!("UTF-8", "iso-8859-1", invalid: :replace)
CSV.parse(csv_string) do |row|
# use row here...
end

Sources:

OpenURI is a really useful part of the Ruby standard library. I never used it with basic authentication but I thought than specify credential in URL was enough. I was wrong. It returns an error:

ArgumentError: userinfo not supported. [RFC3986]

The right way to use auth params is a bit hidden in the documentation page. You can find it in the OpenRead open method as option.

open("http://www.your-website.net", 
http_basic_authentication: ["user", "password"])

Source:

This week my problem was to modelize a semi-relational structure. We decided to use MongoDB because (someone says) is fast, scalable and schema-less. Unfortunately I’m not a good MongoDB designer yet. Data modeling was mostly easy because I can copy the relational part of the schema. The biggest data modeling problem is about m-to-m relations. How to decide if embed m-to-m relations keys into documents or not? To make the right choice I decided to test different design solutions.

Foreign keys emdedded:

class A
include Mongoid::Document
field :name, type: String
has_and_belongs_to_many :bs
end
class B
include Mongoid::Document
field :name, type: String
has_and_belongs_to_many :as
end
def direct(small, large)
small.times do |i|
a = A.new
a.name = "A#{i}"
large.times do |j|
b = B.create(name: "B#{j}")
a.bs << b
end
a.save
end
end

Foreign keys into an external document:

class C
include Mongoid::Document
field :name, type: String
has_many :rels
end
class D
include Mongoid::Document
field :name, type: String
has_many :rels
end
class Rel
include Mongoid::Document
belongs_to :c
belongs_to :d
end
def with_rel(small, large)
small.times do |i|
c = C.new
c.name = "C#{i}"
large.times do |j|
d = D.create(name: "D#{j}")
Rel.create(c: c, d: d)
end
end
end

I tested insert time for a database with 10 objects related to a growing number of other objects each iteration (from 100 to 5000).

def measure(message, &block)
cleanup
start = Time.now.to_f
yield
finish = (Time.now.to_f - start).to_f
puts "#{message}: #{"%0.3f" % finish}"
end
(1..50).each do |e|
measure "10 A embeds #{e*100} B each one" do
direct(10, e*100)
end
measure "10 A linked to #{e*100} B with extenal relation" do
with_rel(10, e*100)
end
end

Results are really interesting:

Number of relation for each element Insert time embedding relation key Insert time with external relation
100 0.693 1.021
200 1.435 2.006
300 1.959 2.720
400 2.711 3.587
500 3.477 4.531
600 4.295 5.414
700 5.106 6.369
800 5.985 7.305
900 6.941 8.221
1000 7.822 8.970
1200 12.350 13.946
1400 14.820 15.532
1600 15.806 17.344
1800 18.722 18.372
2000 21.552 20.732
3000 36.151 29.818
4000 56.060 38.154
5000 82.996 47.658

As you can see when number of embedded relation keys go over 2000, the time grow geometrically.

I know, this is not a real case test so we can’t say that using embedded relation is worse than using external. Anyway is really interesting observe that limits are always the same in both SQL and NoSQL world: when you hit a memory limit and need to go to disk, performances degrade.

In coming post I’m going to analyze reading performances.

Recently were released two important updates in the Ruby world (informally named ROR24):

  1. Ruby 2.0.0-p0
    http://www.ruby-lang.org/en/news/2013/02/24/ruby-2-0-0-p0-is-released/
  2. Rails 4.0.beta1
    http://weblog.rubyonrails.org/2013/2/25/Rails-4-0-beta1/

Following this release, PragProg has released a new update for two of the most popular book about this topics.

Programming Ruby (the pickaxe book)
by Dave Thomas, with Chad Fowler and Andy Hunt

programming_ruby_2

Agile Web Development with Rails
by Sam Ruby, Dave Thomas and David Heinemeier Hansson

agile_web_devlopment_with_rails_4

I bought them yesterday. At first look, updates look cool also if there are only minor updates. In the coming days I’m going to practice about these new stuff and write some posts about it ūüėČ