Hadoop would not be real without this paper. MapReduce is the most famous and still most used processing paradigm for big data. It is not suitable for everything and there are several improvements (Dryad, Spark, …) but Google, Facebook, Twitter and many other has million rows of code deployed into their systems.


google_logoTitle: MapReduce: Simplified Data Processing on Large Clusters (PDF), December 2004
Authors: Jeffrey Dean and Sanjay Ghemawat

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.

Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google’s clusters every day.


Check out the list of interesting papers and projects (Github).

This is “the paper” which started everything in big data environment. During 2003 Google already had problems most of our still haven’t in terms of size and availability of data. They developed a proprietary distributed filesystem called GFS. After a couple of years Yahoo creates HDFS, the distributed filesystem, part of Hadoop framework inspire by this paper. As The Hadoop co-creator Doug Cutting (@cutting): “Google is living a few years in the future and sending the rest of us messages”.


google_logo

Title: The Google File System (PDF), October 2003
Authors: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.

While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore radically different design points.

The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients.

In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use.


Check out the list of interesting papers and projects (Github).

A  couple of weeks ago I was playing with Hack looking for online resources. I found on Youtube the playlist of the Hack Dev Day 2014 where Hack was officially presented to the world.

Introduction to the language by Julien Verlaguet is really interesting, it show the advantages of static typing and how the HHVM is able to preserve the rapid development cycle of PHP.

Also talk by Josh Watzman is interesting. He talks about how to convert PHP code to Hack code and years of experience at Facebook are extremely useful.

The conference also talks about how to run HHVM on Heroku, gives an overview of library and common use cases of Hack and talks about HHVM strong optimization.

If you are playing with Hack I absolutely recommend these videos.

From the home page

Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. […] Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.”

Introduction

Storm enables you to define a Topology (an abstraction of cluster computation) in order to describe how to handle data flow. In a topology you can define some Spouts (entry point for your data with basic preprocessing) and some Bolts (a single step of data manipulation). This simple strategy enable you to define a complex processing of streams of data.

Storm nodes are of two kinds: master and worker. Master node runs Nimbus, it is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures. Worker nodes run Supervisor. The Supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it. Everything is done through ZooKeeper.

Libraries

Resources

Books

Last summer I had the pleasure to review a really interesting book about Spark written by Holden Karau for PacktPub. She is a really smart woman currently software development engineer at Google, active in Spark‘s developers community. In the past she worked for MicrosoftAmazon and Foursquare.

Spark is a framework for writing fast, distributed programs. It’s similar to Hadoop MapReduce but uses fast in-memory approach. Spark ecosystem incorporates an inbuilt tools for interactive query analysis (Shark), a large-scale graph processing and analysis framework (Bagel), and real-time analysis framework (Spark Streaming). I discovered them a few months ago exploring the extended Hadoop ecosystem.

The book covers topics about how to write distributed map reduce style programs. You can find everything you need: setting up your Spark cluster, use the interactive shell and write and deploy distributed jobs in Scala, Java and Python. Last chapters look at how to use Hive with Spark to use a SQL-like query syntax with Shark, and manipulating resilient distributed datasets (RDDs).

Have fun reading it! 😀

Fast data processing with Spark
by Holden Karau

fast_data_processing_with_spark_cover

The title is also listed into Research Areas & Publications section of Google Research portal: http://research.google.com/pubs/pub41431.html

Since when I was 16 I tried to learn “everything” about computer science. I was an enthusiast boy with huge ambitions, I know 🙂 Anyway learn everything implies you to know what “everything” means. Computer science is a wide field. Goes from math to physics, from languages to software development and is young, mostly unknown and rapidly evolving.

I searched for years for a well done “map” of computer science, something able to describe the structure of discipline and give the right place to each single element related to computer science and I failed, this map dosen’t exists yet. People tried to describe it using different taxonomy but none of these do it properly.

Some attempts:

So I did it by myself starting with knowledge I had and trying to derive “The Map” using a bottom-up approach. It took several years and at the end I came to a sentence:

Computer Science is an Implementation of a set of logical-mathematical algorithms made using a descriptive language which run on a physical environment.

This sentence should describe every “object” related to computer science. There are four main elements:

  • Implementations,
  • Algorithms,
  • Languages
  • Physical environment.

Implementation is software. You can classify it by abstraction level (firmware, kernels, driver, middleware, softwares, distributed softwares) and inside each level by category.

Algorithms are the logical approach to the problem. You can classify it by complexity and than by kind (descriptive for data structure, deterministic, not deterministic, probabilistic, parallels, quantum, infinite)

Languages are used to describe the algorithms. You can classify by abstraction from the physical level (transmission codes, protocols, assembly, compiled, semi-compiled, interpreted, metalanguages).

Physical environment is the hardware. You can classify it by the complexity of the implemented logic (combinatorial, sequential, programmed and microprogrammed, complex, distributed).

Each of these “aspect” can be enumerated and every “object” related to computer science can be described using these four variables, a quadruplet (Physical environment, Language, Algorithm, Software).

Some examples:

  • The Ruby programming language is:
    (complex, interpreted, [middleware|software], *)
    an interpreted language which run on a complex electronic environment (CPU + RAM) done to write middleware or softwares implementing any kind of algorithm.
  • An Apple iPhone is:
    (complex, compiled, *, *)
    a bunch of every kind of software (from firmware, to remote services) mostly made using a compiled language (objective-c) which run on a complex environment (the iPhone hardware) implementing mostly any kind of algorithm and data structure.
  • Mobile apps are
    (complex, [compiled, semicompiled, interpreted], [middleware|software], *)
    they run on complex hardware (smartphones), are made using objective-c, java, javascript or similar languages and are software or client for a bigger application.
  • Amazon Cloud is
    (distribuited, *, distribuited software, *)
    distribuited software who run on a distribuited hardware using from simple protocols to scripting language to implement algorithms.
  • The “quicksort” is
    (programmed, *, *, deterministic)
    a deterministic algorithm implementable on a programmed system.

I know, in real life this “index” is completely worthless. Anyway while I were building it I discovered many different things and many connection I hadn’t imagined before and now my knowledge has a strong ordered base.

Recently were released two important updates in the Ruby world (informally named ROR24):

  1. Ruby 2.0.0-p0
    http://www.ruby-lang.org/en/news/2013/02/24/ruby-2-0-0-p0-is-released/
  2. Rails 4.0.beta1
    http://weblog.rubyonrails.org/2013/2/25/Rails-4-0-beta1/

Following this release, PragProg has released a new update for two of the most popular book about this topics.

Programming Ruby (the pickaxe book)
by Dave Thomas, with Chad Fowler and Andy Hunt

programming_ruby_2

Agile Web Development with Rails
by Sam Ruby, Dave Thomas and David Heinemeier Hansson

agile_web_devlopment_with_rails_4

I bought them yesterday. At first look, updates look cool also if there are only minor updates. In the coming days I’m going to practice about these new stuff and write some posts about it 😉

When you work on a single machine everything is easy. Unfortunately when you have to scale and be fault tolerant you must relay on multiple hosts and manage a structure usually called “cluster“.

MongoDB enable you to create a replica set to be fault tolerant and use sharding to scale horizontally. Sharding is not transparent to DBAs. You have to choose a shard-key and adding and removing capacity when the system needs.

Structure

In a MongoDB cluster you have 3 fundamental “pieces”:

  • Servers: usually called mongod
  • Routers: usually called mongos.
  • Config servers

Servers are the place where you actually store your data, when you start the mongod command on your machine you are running a server. In a cluster usually you have multiple shard distributed over multiple servers.
Every shard is usually a replica set (2 or more servers) so if one of the servers goes down your cluster remains up and running.

Routers are the interaction point between users and the cluster. If you want to interact with your cluster you have to do throughout a mongos. The process route your request to the correct shard and gives back you the answer.

Config servers hold all the information about cluster configuration. They are very sensitive nodes and they are the real point of failure of the system.

mongodb_cluster

Shard Key

Choosing the shard key is the most important part when you create a cluster. There are a few important rule to follow learned after several errors:

  1. Don’t choose a shard key with a low cardinality. If one of this possibile values grow too much you can’t split it anymore.
  2. Don’t use an ascending shard key. The only shard who grows is the last one and distribute load on the other server always require a lot of traffic.
  3. Don’t use a random shard key. If quite efficient but you have to add an index to use it

A good choice is to use a coarsely ascending key combined to a search key (something you commonly query). This choice won’t work well for everything but it’s a good way to start thinking about.

N.B. All the informations and the image of the cluster strucure comes from the book below. I read it last week and I find it really interesting 🙂

Scaling MongoDB
by Kristina Chodorow

scaling_mongodb

Redis is widely used into projects I have to work on everyday at @thefool_it. My knowledge about it is really poor so I decided to improve my experience up to a PRO level. I understand basic Redis concepts because I worked with memcached in the past and differences were clearly explained into “Seven Databases in Seven Weeks“. My weaknesses are about everyday use: setup, administration, querying 🙁

Introduction and setup.

Installation is really easy because you can compile from source. On OSX you also have brew or port with an up-to-date package. Update isn’t so easy. Standard way is to start the updated version on another port and migrate data.

Data types are: StringsLists (ordered list of string), Hashes, Sets (no duplicated values) and Sorted Sets (sets sorted by a counter).

Standard distribution comes with a command line interface: the redis-cli. There is a standard library for most common environments and programming languages such as Node.js (node_redis), Python (redis-py), Ruby (redis-rb) and more.

In the coming weeks I’m going to practice about commands and admin techniques using following resources.

Redis Cookbook
by Tiago Macedo, Fred Oliveira

redis_cookbook

Other interesting sources

Things I learned this week about MongoDB:

  • Duplicated data and a denormalized schema are good to be faster especially if if your data doesn’t change often.
  • Allocate more space for objects on disk is hard also for mongoDB. Do it early.
  • Preprocess you data and create index to speed up queries you actually do is a damn good idea.
  • The query order (with AND and OR operators) is really important.
  • Use journaling and replication to keep safe you data.
  • Use Javascript to define your own functions

50 Tips and Tricks for MongoDB Developers
by Kristina Chodorow

mongodb_developers