From the home page

Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. […] Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.”

Introduction

Storm enables you to define a Topology (an abstraction of cluster computation) in order to describe how to handle data flow. In a topology you can define some Spouts (entry point for your data with basic preprocessing) and some Bolts (a single step of data manipulation). This simple strategy enable you to define a complex processing of streams of data.

Storm nodes are of two kinds: master and worker. Master node runs Nimbus, it is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures. Worker nodes run Supervisor. The Supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it. Everything is done through ZooKeeper.

Libraries

Resources

Books

When I used Rails for the first time I was impressed by the use of multiple environments to handle different setup, policies and behavior of an application. A few year ago use of environments wasn’t so common and switching between development, test and production was an innovation for me.

Anyway big projects who use custom frameworks introduced this structure several years before. I had the pleasure to work over a lot of legacy code who implement different environments setup. For example classified ADS channel of Repubblica.it (built by my mentor @FabiolousMate) uses dev, demo and production. Other projects I worked on use staging. After have listened a lot of opinions I asked myself which are the most important environments and if three are enough.

I’m mostly a Ruby developer and I know the Rails ecosystem who uses 3 basic environment.

  • development is used to when you code. Source code is reloaded each time. Log is EXTREMELY verbose. Libraries includes debug and error logging features. Database is full of garbage data.
  • test is for automatic testing. Data is loaded and cleaned automatically every time you run tests. Everything can be mocked (database, APIs, external services, …). Libraries includes testing frameworks and log is just for test output.
  • production is to be safe. Logging is just for errors. Sometimes there is a caching layer. Libraries are loaded once. Data is replicated. Everything is set up to improve both performances and robustness.

These environments are really useful in order to manage application development. Unfortunately are not enough to handle every situation. For example production is not appropriate for testing new feature because of the poor log and the strong optimization (and the precious production data) and is not appropriate as well for demo purpose because has to be used by customers. Development is alike not appropriate to find bottlenecks because of messy data and debug code.

In my experience I usually add three more environment to my application trying to fit every situation. Most of cases these are enough.

  • staging is for deep testing of new features. Production data and development logging and libraries. Enable you to test side effects of your new features in the real world. If an edit works here probably works also in production
  • demo is for showtime. Production environment with sandboxes features and demo data. You can open this environment to anyone and he can play whatever he wants without dangers.
  • profile is to find bottlenecks. Development environment with specific library to profile and fine tuning of you process. You can adjust data to stress your system without worry about coherence of data.

This is IMHO a good setup of you deploy environments. Depending on projects some of these aren’t useful but in a large project each one can save you life.

Last summer I had the pleasure to review a really interesting book about Spark written by Holden Karau for PacktPub. She is a really smart woman currently software development engineer at Google, active in Spark‘s developers community. In the past she worked for MicrosoftAmazon and Foursquare.

Spark is a framework for writing fast, distributed programs. It’s similar to Hadoop MapReduce but uses fast in-memory approach. Spark ecosystem incorporates an inbuilt tools for interactive query analysis (Shark), a large-scale graph processing and analysis framework (Bagel), and real-time analysis framework (Spark Streaming). I discovered them a few months ago exploring the extended Hadoop ecosystem.

The book covers topics about how to write distributed map reduce style programs. You can find everything you need: setting up your Spark cluster, use the interactive shell and write and deploy distributed jobs in Scala, Java and Python. Last chapters look at how to use Hive with Spark to use a SQL-like query syntax with Shark, and manipulating resilient distributed datasets (RDDs).

Have fun reading it! 😀

Fast data processing with Spark
by Holden Karau

fast_data_processing_with_spark_cover

The title is also listed into Research Areas & Publications section of Google Research portal: http://research.google.com/pubs/pub41431.html

Recently I had to analyze interactions on a Facebook page. I need to fetch all the contents from the stream and analyze user actions. Retrive interactions count foreach post can be hard because Facebook APIs are like hell: they change very fast, return a lot of errors, have understandable limits and give you many headache.

Anyway after a lot of tries I found a way to fetch quantitative informations about posts and photos on the stream. First of all you need the contents.

Get the contents

Graph endpoint is: https://graph.facebook.com/. You can fetch page data (I use the BBCNews page as example) at:

https://graph.facebook.com/bbcnews/posts?access_token=your_access_token

To get a valid access token you have different ways and Facebook let you choose many different kind of access tokens, each one with a different rate limit.

Returned data is a JSON array of elements. Each elements has a lot of properties which describe items on the timeline. The returned element included into the stream has just a subset of these properties (last comments, last likes, some counters). Here you can find text content, pictures and links. To get more data you need three more properties: id, type and object_id.

Status updates are identified by type “status”Photos are identified by type “photo” and Video by type “video”. The id field is used as identifier for the entry on the stream. The object_id instead is used to identify object inside the Facebook graph.

Actions: comments, likes and shares

Comments are returned paginated and sometimes APIs doesn’t return the entire list. To get the total count you need to specify the parameter summary=true.

https://graph.facebook.com/228735667216_10151700273382217/comments?summary=true&access_token=your_access_token

At the end of response you can find additional informations about comments feed. total_count displays the count.

"summary": {
"order": "ranked",
"total_count": 100
}

Likes are similar to comments. They have similar limitations and have a similar endpoint to retrive data with the same parameter summary=true.

https://graph.facebook.com/228735667216_10151700273382217/likes?summary=true&access_token=your_access_token

This time summary shows only total count:

"summary": {
"total_count": 949
}

Shares count can be found as part of the object detail.

https://graph.facebook.com/228735667216_10151700273382217/?access_token=your_access_token

After created and updated date you find shares property:

"shares": {
"count": 238
}

Convert the object_id

Depending on you data feed, sometimes id is not available and you have to handle the object_id. To be able to use previous methods you need to query the Facebook database using FQL looking for the story_id.

SELECT page_story_id
FROM photo
WHERE object_id = '10151700273362217'

https://api.facebook.com/method/fql.query?format=json&acces_token=#your_access_token&query=SELECT%20page_story_id%20FROM%20photo%20WHERE%20object_id%20%3D%20%2710151700273362217%27

The result is the page_story_id (the id of the post on the feed) of the object.

"data": [
{
"page_story_id": "228735667216_10151700273382217"
}
]

Now you can use this to retrieve counters and data.

I use to spend much time playing with data in order to import, export and aggregate it. MySQL is one on my primary source of data because stays behind many popular projects and is usually first choice also for custom solutions.

Recently I discovered some really useful unknown functions which help me to export complex data.

GROUP_CONCAT

This function returns a string result with the concatenated non-NULL values from a group. It returns NULL if there are no non-NULL values.

SELECT student_name,
GROUP_CONCAT(test_score SEPARATOR ',')
FROM student
GROUP BY student_name;

SELECT INTO

The SELECT * INTO OUTFILE statement is intended primarily to let you very quickly dump a table to a text file on the server machine. It is really useful to export data as CSV directly from your master server. You can also use DUMPFILE if you need a raw output.

SELECT a,b,a+b INTO OUTFILE '/tmp/result.txt'
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
ESCAPED BY '"'
LINES TERMINATED BY '\n'
FROM test_table;

If you plan to use it with a standard CSV library you must refer to RFC 4180 for correct format in order to avoid reading errors.

Apache Sqoop

If you database is bigger than you are able to manage you probably need Sqoop. It is is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

You can import data from you MySQL database to a CSV file stored on HDFS and access it from anywhere in your cluster.

sqoop import \
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--table cities

References

At the present almost every user generated content is shared on social networks.

Information which Facebook, Twitter, LinkedIn and Google can retrieve about page content aren’t always complete. To help crawler to better understand what they are reading you can use different kinds of metatags to add information about you and your content.

Last week I added these to my WordPress blog.

Google Authorship

Allow author to connect his Google+ profile to his blog. Every article contains some metatags with information about the author and the URL.

<link href="http://usefulstuff.io/2013/07/more-about-big-data-ecosystem/" rel="canonical" />
<link href="http://usefulstuff.io/?p=890" rel="shortlink" />
<link href="https://plus.google.com/104663370650182235263" rel="author" />
<link href="https://plus.google.com/104663370650182235263" rel="publisher" />

The best plugin I found is Google Author Link.

google_authorship

Twitter Cards

Generate inline content previews on Twitter.com and Twitter clients. They make it possible to attach media experiences to Tweets that link to your content.

<meta name="twitter:card" content="summary" />
<meta name="twitter:creator" content="@zenkay" />
<meta name="twitter:site" content="@zenkay" />
<meta name="twitter:title" content="More about big-data ecosystem" />
<meta name="twitter:description" content="Last month while I was inspecting the Hadoop ecosystem I found many other software related to big-data. Below the (incomplete again) list.N.B. Informations and texts are taken from official websites or sources referenced at the …" />
<meta name="twitter:image" content="http://usefulstuff.io/default.png" />

Best plugin I found is JM Twitter Cards but is not flexible enough and I’m looking for something else.

twitter_cards

Open Graph

open_graph_logoProtocol enables any web page to become a rich object in a social graph. For instance, this is used on Facebook to allow any web page to have the same functionality as any other object on Facebook.

Best plugin is WP Facebook Open Graph Protocol. Despite its name it work with every website supports Open Graph protocol.

<meta content="http://usefulstuff.io/2013/07/more-about-big-data-ecosystem/" />
<meta content="More about big-data ecosystem" />
<meta content="Useful Stuff" />
<meta content="Last month while I was inspecting the Hadoop ecosystem I found many other software related to big-data. Below the (incomplete again) list.  N.B. Informations " />
<meta content="article" />
<meta content="en_us" />

How it looks on Facebook:

facebook_open_graph

How it looks on LinkedIn:

linkedin_open_graph

It’s about half an year I want to move my blog away from Heroku. It’s the best PaaS I ever used but the free plan has a huge limit: the dynos idle. In a previous post i talked about how to use Heroku to build a reverse proxy in front of AppFog to avoid theirs custom domain limit but the idle problem is still there. My blog has less than 100 visits per day and almost every visitor has to wait 5-10 seconds to view home page because dynos are always idle.

openshift_logoToday I decided to move to another platform suggested by my friend @dani_viga: OpenShift. It’s a PaaS similar to Heroku which use Git to control revision and has a similar scaling system. And the free plan hasn’t the idle problem and it’s 10 times faster!

I created a new application using the following cartridge: PHP 5.3, MySQL 5.1 (I’d like to use MariaDB but cartridge is still in development and I couldn’t install it) and phpMyAdmin 3.4. They require a Git repo to setup application and provide a WordPress template to start. I used it as template moving code of my blog into /php directory.

The hard part was to migrate my PostgreSQL database into the new MySQL. To start I removed PG4WP plugin following installation instruction in reverse order.

Then I exported my PostgreSQL database using heroku db:pull command. It’s based on taps and is really useful. I had some problems with my local installation of MySQL because taps has no options about packet size and character set so you must set them as default. I added a few line to my.cnf configuration:

# enlarged, before was 1M
max_allowed_packet = 10M
# default to utf-8
skip-character-set-client-handshake
character_set_client=utf8
character_set_server=utf8

At the end of the pull my local database contains a exact copy of the Heroku one and I can dump to a SQL file and import into the new MySQL cartridge using phpMyAdmin.

The only problem I had was about SSL certificate. The free plan doesn’t offer SSL certificate for custom domain so I have to remove the use of HTTPS for the login. You can do in the wp-config.php setting:

define('FORCE_SSL_ADMIN', false);

Now my blog runs on OpenShift and by now seems incredibly faster 😀

Serialized fields in Rails are a really useful feature to store structured data related to a single element of your application. Performance usually aren’t so stunning because they are stored in a text field.

Recently to overcome this limit hstore on PostgreSQL and similar structure on other DBMS have gained popularity.

Anyway editing data using a form still require a lot of code. Last week a was working on a form to edit options of an elements stored into serialized field and I found this question on StackOverflow. It seems a really interesting solution. For a serialized field called properties

class Element < ActiveRecord::Base
serialize :properties
end

I can dynamically define accessor method for any field I need.

class Element < ActiveRecord::Base
serialize :properties
def self.serialized_attr_accessor(*args)
args.each do |method_name|
eval "
def #{method_name}
(self.properties || {})[:#{method_name}]
end
def #{method_name}=(value)
self.properties ||= {}
self.properties[:#{method_name}] = value
end
attr_accessible :#{method_name}
"
end
end
serialized_attr_accessor :field1, :field2, :field3
end

And then you can easies access fields in a view

#haml
- form_for @element do |f|
= f.text_field :field1
= f.text_field :field2
= f.text_field :field3

IMHO it’s a really clean way to improve quality of accessor code.

Yesterday I had to re-deploy the WordPress installation of PrimeGap.net on a new server and, looking for some tips about configuration, I found a new strange buzzword: LEMP Stack. 

We all know the LAMP Stack and we all know it’s old, slow and hard to scale. It includes any distribution of Linux, Apache with PHP as a module and MySQL 5.x.

A LEMP Stack is a bit different. First of all it uses nginx (pronounced “engine x”) and this explain the “E”. Then you can replace MySQL with any of the other fork. I personally use MariaDB 10.0. Many people also use Percona.

You can also replace PHP with another language such Python or Ruby but if you still use PHP choose PHP-FPM.

Many hosting provider provide useful guides to setup you server:

Linode is a bit different and uses PHP-FastCGI. Both uses MySQL. If you, like me, prefer MariaDB following guides should help you:

Current version of WordPress is easy to run on it. WordPress Codex provides a custom configuration to uses nginx. There are many optimization you can do. This Gist seems well done: https://gist.github.com/tjstein/902803

Welcome to the next-gen 🙂

Ruby doesn’t like strings which are not UTF-8 encoded. CSV files are usually a bunch of data coming from somewhere and most of times are not UTF-8 encoded. When you try to read them you can expect to have problems. I fought against encoding problem for a long time and now I found how to avoid major problems and I’m very proud of this (because of many of headaches… :-/ ).

If you try to read a CSV file you can specify option :encoding to set source and destination encoding (format: “source:destination“) and pass it to the CSV engine already converted

CSV.foreach("file.csv", encoding: "iso-8859-1:UTF-8") do |row|
# use row here...
end

If you resource is not a file but a String or a file handler you need to covert it before use CSV engine. The standard String#force_encode method seems not working as expected:

a = "\xff"
a.force_encoding "utf-8"
a.valid_encoding?
# => returns false
a =~ /x/
# => provokes ArgumentError: invalid byte sequence in UTF-8

You must use String#encode! method to get things done:

a = "\xff"
a.encode!("utf-8", "utf-8", :invalid => :replace)
a.valid_encoding?
# => returns true now
a ~= /x/
# => works now

So using an external resource:

handler = open("http://www.example.com/file.csv")
csv_string = handler.read.encode!("UTF-8", "iso-8859-1", invalid: :replace)
CSV.parse(csv_string) do |row|
# use row here...
end

Sources: