After almost 4 years as CTO at The Fool, is time for me to search for new adventures. Starting from May 1st 2016 I’ll join the Curcuma team. Below, a picture from new office:

curcuma

Just joking 😛 Will be a big challenge for me because I’ll move from specific field, web and social network analysis, to general purpose development where projects are really varied: custom CMS, integration with IoT devices, mobile application and many others. In the end, a good challenge!

soliduseCommerce solutions are quite popular in Curcuma’s portfolio and my last experience about was in 2008 with an early version of Magento. I worked on similar products but I’m quite “rusty” on this topic. Starting from the Ruby ecosystem, default in Curcuma, only two realistic options are available: Spree (acquired by First Data and no longer supported) and Solidus (a Spree fork quite young but already interesting).

I searched for tutorials about Solidus but version 1.0.0 was shipped last August (and is based on Spree 2.4) and community is still young. I found only beginner’s tutorials so I decided to follow Github README instructions on master branch.

Install

Start with a fresh installation of Rails 4.2 (Rails 5.0 beta seems not supported yet), add gems and run bundle install

gem 'solidus'
gem 'solidus_auth_devise'

Inspecting Gemfile.lock you can find solidus dependencies:

solidus (1.2.2)
solidus_api (= 1.2.2)
solidus_backend (= 1.2.2)
solidus_core (= 1.2.2)
solidus_frontend (= 1.2.2)
solidus_sample (= 1.2.2)
solidus_auth_devise (1.3.0)

The solidus package seems a container for these modules. I really like this approach: is clean, encourages isolation and mask complexity. Also gemspec is the cleanest I’ve seen yet.

# encoding: UTF-8
require_relative 'core/lib/spree/core/version.rb'
Gem::Specification.new do |s|
s.platform    = Gem::Platform::RUBY
s.name        = 'solidus'
s.version     = Spree.solidus_version
s.summary     = 'Full-stack e-commerce framework for Ruby on Rails.'
s.description = 'Solidus is an open source e-commerce framework for Ruby on Rails.'
s.files        = Dir['README.md', 'lib/**/*']
s.require_path = 'lib'
s.requirements << 'none' s.required_ruby_version = '>= 2.1.0'
s.required_rubygems_version = '>= 1.8.23'
s.author       = 'Solidus Team'
s.email        = 'contact@solidus.io'
s.homepage     = 'http://solidus.io'
s.license      = 'BSD-3'
s.add_dependency 'solidus_core', s.version
s.add_dependency 'solidus_api', s.version
s.add_dependency 'solidus_backend', s.version
s.add_dependency 'solidus_frontend', s.version
s.add_dependency 'solidus_sample', s.version
end

Setup and config

Anyway next step on README is to run following rake tasks

bundle exec rails g spree:install
bundle exec rake railties:install:migrations

First one gives me a warning:

[WARNING] You are not setting Devise.secret_key within your application!
You must set this in config/initializers/devise.rb. Here's an example:
Devise.secret_key = "7eaa914b11299876c503eca74af..."

fires some actions related to assets, migrations and seeds then ask me for username and password. Standard install.

About the warning I found another post that recommend to run this task.

rails g solidus:auth:install

Is not clear to me what it does but seems working. After run warning is left.

Rake about migration (bundle exec rake railties:install:migrations) gives no output. I suppose migrations are already installed after first step. No idea.

Anyway last step liste on README is to run migrations (bundle exec rake db:migrate) and give no output too so everything seems ok.

No we can fire rails s and enjoy our brand new store 😀

solidus-screen

A bit more control

These steps are cool but do a lot of things we probably don’t want like install demo products and demo users. Following the README, installation can be run without any automatic step:

rails g spree:install --migrate=false --sample=false --seed=false

and then you are free to run any of the available step with your customizations

bundle exec rake railties:install:migrations
bundle exec rake db:migrate
bundle exec rake db:seed
bundle exec rake spree_sample:load

Now our new store is ready. It’s time to dig deeper into Solidus structure. See ya in another post bro 😉

bedrock_big_logo During the last couple of weeks I had to work on a PHP project with a custom WordPress stack I have never used before: Bedrock.

The home page says “Bedrock is a modern WordPress stack that gets you started with the best development tools, practices, and project structure.“.

What Bedrock really is

It is a regular WordPress installation with a different folder structure and is integrated with composer for dependencies management and capistrano for deploy. The structure reminds Rails or similar frameworks but contains usual WordPress component and run on the same web stack.

├── composer.json
├── config
│   ├── application.php
│   └── environments
│       ├── development.php
│       ├── staging.php
│       └── production.php
├── vendor
└── web
├── app
│   ├── mu-plugins
│   ├── plugins
│   ├── themes
│   └── uploads
├── wp-config.php
├── index.php
└── wp

Server configuration

The project use to works on Apache with mod_php but I personally don’t like this stack. I’d like to test it on HHVM but at the moment I preferred to run it on nginx with PHP-FPM. Starting with an empty Ubuntu 14.04 installation I set up a LEMP stack with memcached and Redis using apt-get:

apt-get update
apt-get install build-essential tcl8.5 curl screen bootchart git mailutils munin-node vim nmap tcpdump nginx mysql-server mysql-client memcached redis-server php5-fpm php5-curl php5-mysql php5-mcrypt php5-memcache php5-redis php5-gd

Everything works fine except the Redis extension (used for custom function unrelated with WordPress). I don’t know why but the config file wasn’t copied into the configuration directory /etc/php5/fpm/conf.d/. You can find it among the available mods into /etc/php5/mods-available/.

PHP-FPM uses a standard configuration placed into /etc/php5/fpm/pool.d/example.conf. It listen on 127.0.0.1:9000 or unix socket in /var/run/php5-fpm-example.sock (I assume the configured name was “example”).

Memcached should be configured to be used for session sharing among multiple servers. To activate it you need to edit the php.ini configuration file setting the following parameters into /etc/php5/fpm/php.ini

session.save_handler = memcache
session.save_path = 'tcp://192.168.0.1:11211,tcp://192.168.0.2:11211'

nginx configuration is placed into /etc/nginx/sites-available/ and linked into /etc/nginx/sites-enabled/ as usual and forward request to PHP-FPM for PHP files.

server {
listen 80 default deferred;
root /var/www/example/htdocs/current/web/;
index index.html index.htm index.php;
server_name www.example.com;
access_log /var/www/example/logs/access.log;
error_log /var/www/example/logs/error.log;
location / {
try_files $uri $uri/ /index.php?q=$uri&$args;
}
location ~\.php$ {
try_files $uri =404;
# fastcgi_pass 127.0.0.1:9000;
fastcgi_pass unix:/var/run/php5-fpm-example.sock;
fastcgi_index index.php;
fastcgi_buffers 16 16k;
fastcgi_buffer_size 32k;
include fastcgi_params;
}
location ~ /\.ht {
deny all;
}
}

Root directory will be web/ of Bedrock prefixed with current/ to support the capistrano directory structure displayed below.

├── current -> /var/www/example/htdocs/releases/20150120114500/
├── releases
│   ├── 20150080072500
│   ├── 20150090083000
│   ├── 20150100093500
│   ├── 20150110104000
│   └── 20150120114500
├── repo
│   └── <VCS related data>
├── revisions.log
└── shared
└── <linked_files and linked_dirs>

Local configuration

I’m quite familiar with capistrano because of my Ruby recent background. You need a Ruby version greater then 1.9.3 to run it (RVM helps). First step is to download dependencies. Ruby uses Bundler.

# run it to install bundler gem the first time
gem install bundler
# run it to install dependencies
bundle install

Bundler read the Gemfile (and Gemfile.lock) and download all the required gems (Ruby libraries).

Now the technological stack is ready locally and on server 🙂
I’ll probably describe how to run a LEMP stack on OS X in a next post. For the moment I’m assuming you are able to run it locally. Here is useful guides by Jonas Friedmann and rtCamp.

Anyway Bedrock could run over any LAMP/LEMP stack. The only “special” feature is the Composer integration. Composer for PHP is like Bundler for Ruby. Helps developers to manage dependencies in the project. Here is used to manage plugins, themes and WordPress core update.

You can run composer install to install libraries. If you update libraries configuration or you want to force download of them (maybe after a fresh install) run composer update.

Deploy

Capistrano enable user to setup different deploy environment. A global configuration is defined and you need to specify only custom configuration for each environment. An example of /config/deploy/production.rb:

set :application, 'example'
set :stage, :production
set :branch, "master"
server '192.168.0.1', user: 'user', roles: %w{web app db}

Everything else is inherited from global config where are defined all the other deploy properties. Is important to say that deploy script of capistrano on Bedrock only download source code from Git repo and run composer install for main project. If you need to run in on any plugin you new to define a custom capistrano task and run it after the end of deploy. For instance you can add in the global configuration the following lines in order to install dependencies on a specific plugin:

namespace :deploy do
desc 'Rebuild Plugin Libraries'
task :updateplugin do
on roles(:app), in: :sequence, wait: 5 do
execute "cd /var/www/#{fetch(:application)}/htdocs/current/web/app/plugins/anything/ && composer install"
end
end
end
after 'deploy:publishing', 'deploy:updateplugin'

Now you are ready to deploy your Bedrock install on server!
Simply run cap production deploy, restart PHP-FPM (service php5-fpm restart) and enjoy it 😀

Many thanks to Giuseppe, great sysadmin and friend, for support during development and deploy of this @#@?!?@# application.

A few weeks ago I faced a performance issue around a lookup query on a 5GB MySQL database. About 10% of these request take 100x more time than the others. Syntax was really simple:

SELECT  `authors`.* FROM `authors`  WHERE `authors`.`remote_id` = 415418856 LIMIT 1;

Actually it wasn’t a real “slow query”. Table is quite large (3M+ record) and a response time of 300ms was reasonable. Unfortunately my application run this query thousands times each minute and performances were degraded. The remote_id is a standard varchar field indexed by a unique Hash index and I couldn’t understand why these queries take so much time.

After a long search I found a note in the reference of MySQL related to type conversion:
MySQL 5.7 Reference Manual > 12.2 Type Conversion in Expression Evaluation


For comparisons of a string column with a number, MySQL cannot use an index on the column to look up the value quickly. If str_col is an indexed string column, the index cannot be used when performing the lookup in the following statement:

SELECT * FROM tbl_name WHERE str_col=1;

Everything was clear: my Ruby application doesn’t implicit cast to string the remote_id I use when this value is an integer. This is a quite strange behavior because ActiveRecord, the application ORM, usually operates cast transparently. Usually.

The fixed version of the query runs in 3ms:

SELECT  `authors`.* FROM `authors`  WHERE `authors`.`remote_id` = '415418856' LIMIT 1;

Small changes, big headaches, great results 🙂

crate_logo

I usually don’t trust cutting edge datastore. They promise a lot of stunning features (and use a lot of superlatives to describe them) but almost every time they are too young and have so much problems to run in production to be useless. I thought the same also about Crate Data.

“Massively scalable data store. It requires zero administration”

First time I read these words (take from the home page of Crate Data) I wasn’t impressed. I simply didn’t think was true. Some months later I read some articles and the overview of the project and I found something more interesting:

It includes solid established open source components (Presto, Elasticsearch, Lucene, Netty)

I used both Lucene and Elasticsearch in production for several years and I really like Presto. Combine some production-ready components can definitely be a smart way to create something great. I decided to give it a try.

They offer a quick way to test it:

bash -c "$(curl -L try.crate.io)"

But I don’t like self install scripts so I decided to download it a run from bin. It simply require JVM. I unpacked it on my desktop on OS X and I launched ./bin/crate. The process bind the port 4200 (or first available between 4200 and 4300) and if you go to http://127.0.0.1:4200/admin you found the admin interface (there is no authentication). You also had a command line interface: ./bin/crash. Is similar to MySQL client and you are familiar with any other SQL client you will be familiar with crash too.

I created a simple table with semi-standard SQL code (data types are a bit different)

create table items (id integer, title string)

Then I search for a Ruby client and I found crate_ruby, the official Ruby client. I started to fill the table using a Ruby script and a million record CSV as input. Inserts go by 5K per second and the meantime I did some aggregation query on database using standard SQL (GROUP BY, ORDER BY and so on) to test performances and response was quite fast.

CSV.foreach("data.csv", col_sep: ";").each do |row|
client.execute("INSERT INTO items (id, title) VALUES (\$1, \$2)", [row[0], row[9]])
end

Finally I decided to inspect cluster features by running another process on the same machine. After a couple of seconds the admin interface shows a new node and after a dozen informs me data was fully replicated. I also tried to shut down both process to see what happen and data seems ok. I was impressed.

crate_admin

I still have many doubts about Crate. I don’t know how to manage users and privileges, I don’t know how to create a custom topology for a cluster and I don’t know how difficult is to use advanced features (like full text search or blob upload). But at the moment I’m impressed because administration seems really easy and scalability seems easy too.

Next step will be test it in production under a Rails application (I found an interesting activerecord-crate-adapter) and test advanced features to implement a real time search. I don’t know if I’ll use it but beginning looks very good.

Next week O’Reilly will host a webcast about Crate. I’m really looking forward to discover more about the project.

Everything started while I was writing my first post about the Hadoop Ecosystem. I was relatively new to Hadoop and I wanted to discover all useful projects. I started collecting projects for about 9 months building a simple index.

About a month ago I found an interesting thread posted on the Hadoop Users Group on LinkedIn written by Javi Roman, High Performance Computing Manager at CEDIANT (UAX). He talks about a table which maps the Hadoop ecosystem likewise I did on my list.

He published his list on Github a couple of day later and called it the Hadoop Ecosystem Table. It was an HTML table, really interesting but really hard to use for other purpose. I wanted to merge my list with this table so I decided to fork it and add more abstractions.

I wrote a couple of Ruby scripts (thanks Nokogiri) to extract data from my list and Javi’s table and put in an agnostic container. After a couple of days spent hacking on these parsers I found a simple but elegant solution: JSON.

Information about each project is stored in a separated JSON file:

{
"name": "Apache HDFS",
"description": "The Hadoop Distributed File System (HDFS) offers a way to store large files across \nmultiple machines. Hadoop and HDFS was derived from Google File System (GFS) paper. \nPrior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. \nWith Zookeeper the HDFS High Availability feature addresses this problem by providing \nthe option of running two redundant NameNodes in the same cluster in an Active/Passive \nconfiguration with a hot standby. ",
"abstract": "a way to store large files across multiple machines",
"category": "Distributed Filesystem",
"tags": [
],
"links": [
{
"text": "hadoop.apache.org",
"url": "http://hadoop.apache.org/"
},
{
"text": "Google FileSystem - GFS Paper",
"url": "http://research.google.com/archive/gfs.html"
},
{
"text": "Cloudera Why HDFS",
"url": "http://blog.cloudera.com/blog/2012/07/why-we-build-our-platform-on-hdfs/"
},
{
"text": "Hortonworks Why HDFS",
"url": "http://hortonworks.com/blog/thinking-about-the-hdfs-vs-other-storage-technologies/"
}
]
}

It includes: project name, long and short description, category, tags and links.

I merged data into these files, and wrote a couple of generator in order to put data into different templates. Now i can generate code for my WordPress page and an update version of Javi’s table.

Finally I added more data into more generic categories not strictly related to Hadoop (like MySQL forks, Memcached forks and Search Engine platforms) and build a new version of the table: The Big Data Ecosystem table. JSON files are available to everyone and will be served directly from a CDN located under same domain of table.

This is how I built an open source big data map 🙂

When I used Rails for the first time I was impressed by the use of multiple environments to handle different setup, policies and behavior of an application. A few year ago use of environments wasn’t so common and switching between development, test and production was an innovation for me.

Anyway big projects who use custom frameworks introduced this structure several years before. I had the pleasure to work over a lot of legacy code who implement different environments setup. For example classified ADS channel of Repubblica.it (built by my mentor @FabiolousMate) uses dev, demo and production. Other projects I worked on use staging. After have listened a lot of opinions I asked myself which are the most important environments and if three are enough.

I’m mostly a Ruby developer and I know the Rails ecosystem who uses 3 basic environment.

  • development is used to when you code. Source code is reloaded each time. Log is EXTREMELY verbose. Libraries includes debug and error logging features. Database is full of garbage data.
  • test is for automatic testing. Data is loaded and cleaned automatically every time you run tests. Everything can be mocked (database, APIs, external services, …). Libraries includes testing frameworks and log is just for test output.
  • production is to be safe. Logging is just for errors. Sometimes there is a caching layer. Libraries are loaded once. Data is replicated. Everything is set up to improve both performances and robustness.

These environments are really useful in order to manage application development. Unfortunately are not enough to handle every situation. For example production is not appropriate for testing new feature because of the poor log and the strong optimization (and the precious production data) and is not appropriate as well for demo purpose because has to be used by customers. Development is alike not appropriate to find bottlenecks because of messy data and debug code.

In my experience I usually add three more environment to my application trying to fit every situation. Most of cases these are enough.

  • staging is for deep testing of new features. Production data and development logging and libraries. Enable you to test side effects of your new features in the real world. If an edit works here probably works also in production
  • demo is for showtime. Production environment with sandboxes features and demo data. You can open this environment to anyone and he can play whatever he wants without dangers.
  • profile is to find bottlenecks. Development environment with specific library to profile and fine tuning of you process. You can adjust data to stress your system without worry about coherence of data.

This is IMHO a good setup of you deploy environments. Depending on projects some of these aren’t useful but in a large project each one can save you life.

Serialized fields in Rails are a really useful feature to store structured data related to a single element of your application. Performance usually aren’t so stunning because they are stored in a text field.

Recently to overcome this limit hstore on PostgreSQL and similar structure on other DBMS have gained popularity.

Anyway editing data using a form still require a lot of code. Last week a was working on a form to edit options of an elements stored into serialized field and I found this question on StackOverflow. It seems a really interesting solution. For a serialized field called properties

class Element < ActiveRecord::Base
serialize :properties
end

I can dynamically define accessor method for any field I need.

class Element < ActiveRecord::Base
serialize :properties
def self.serialized_attr_accessor(*args)
args.each do |method_name|
eval "
def #{method_name}
(self.properties || {})[:#{method_name}]
end
def #{method_name}=(value)
self.properties ||= {}
self.properties[:#{method_name}] = value
end
attr_accessible :#{method_name}
"
end
end
serialized_attr_accessor :field1, :field2, :field3
end

And then you can easies access fields in a view

#haml
- form_for @element do |f|
= f.text_field :field1
= f.text_field :field2
= f.text_field :field3

IMHO it’s a really clean way to improve quality of accessor code.

Ruby doesn’t like strings which are not UTF-8 encoded. CSV files are usually a bunch of data coming from somewhere and most of times are not UTF-8 encoded. When you try to read them you can expect to have problems. I fought against encoding problem for a long time and now I found how to avoid major problems and I’m very proud of this (because of many of headaches… :-/ ).

If you try to read a CSV file you can specify option :encoding to set source and destination encoding (format: “source:destination“) and pass it to the CSV engine already converted

CSV.foreach("file.csv", encoding: "iso-8859-1:UTF-8") do |row|
# use row here...
end

If you resource is not a file but a String or a file handler you need to covert it before use CSV engine. The standard String#force_encode method seems not working as expected:

a = "\xff"
a.force_encoding "utf-8"
a.valid_encoding?
# => returns false
a =~ /x/
# => provokes ArgumentError: invalid byte sequence in UTF-8

You must use String#encode! method to get things done:

a = "\xff"
a.encode!("utf-8", "utf-8", :invalid => :replace)
a.valid_encoding?
# => returns true now
a ~= /x/
# => works now

So using an external resource:

handler = open("http://www.example.com/file.csv")
csv_string = handler.read.encode!("UTF-8", "iso-8859-1", invalid: :replace)
CSV.parse(csv_string) do |row|
# use row here...
end

Sources:

OpenURI is a really useful part of the Ruby standard library. I never used it with basic authentication but I thought than specify credential in URL was enough. I was wrong. It returns an error:

ArgumentError: userinfo not supported. [RFC3986]

The right way to use auth params is a bit hidden in the documentation page. You can find it in the OpenRead open method as option.

open("http://www.your-website.net", 
http_basic_authentication: ["user", "password"])

Source:

heroku_appfog

A few months ago I migrated my personal website (a single page with a bunch of links served by Sinatra) from Heroku to AppFog. I did it because Heroku service has one big limitation: you can’t point a root domain to a Heroku app. Actually there are some workarounds to do it but Heroku discourage them.

AppFog is quite different from Heroku. It allows you to point your root domain to their app (see blog post and docs about that). It doesn’t use GIT for deploy, it uses a command line tool called af who transfers your code from e to the cloud and perform management operations (actually really similar to Heroku toolbelt).

Setup is done using one of the available Jumpstart, a ready-to-go scaffold for your needs. There are jumpstarts for Rails, Sinatra, Django, WordPress and more. In addition you can choose several add-ons (many of those are also available on Heroku). The free plan offers you 2GB of memory with no limits. It’s enough for personal use.

Another advantage is the lack of idle. Heroku Dynos goes idle if you app isn’t accessed for a while. My site has 5-6 accesses per day so it’s idle for each request… AppFog seems not affected by this problem, probably works in a different way.

If you website is small and has low traffic my advise is to use AppFog.