neural-network

Deep Learning is a trending buzzword in the Machine Learning environment. All the major players in Silicon Valley are heavily investing in these topics and US universities are improving their courses offer.

I’m really interested in artificial intelligence both for fun and for work and I spent a few hours in the last weeks searching for best MOOCs about this topic. I found only a few courses but they are from the most notable figures in Deep Learning and Neural Networks environment.

Machine Learning
Stanford University on Coursera, Andrew Ng

Andrew Ng is Chief Scientist at Baidu Research since 2015, founder of Coursera and Machine Learning lecturer at Stanford University. He also founded the Google Brain project in 2011. His Machine Learning (CS229a) course at Stanford is quite mythical and, obviously, was my starting point.

machine-learning-ng

Machine Learning, Coursera

Neural Networks for Machine Learning
University of Toronto on Coursera, Geoffrey Hinton

Geoffrey Hinton is working at Google (probably on Google Brain) since 2013 when Google acquire his company DNNResearch Inc. He is a cognitive psychologist most noted for his work on artificial neural networks. His Coursera course on Neural Networks is related to 2012 but seem to be one of the best resource about these topics.

neural-networks-for-machine-learning

Neural Networks for Machine Learning, Coursera

Deep Learning (2015)
New York University on TechTalks, Yann LeCun (videos on techtalks.tv)

In 2013 LeCun became the first director of Facebook AI Research. He is well known for his work on optical character recognition and computer vision using convolutional neural networks (CNN), and is a founding father of convolutional nets. 2015 Deep Learning course at NYU is the last course about this topic hold by him.

Yann LeCun. CIFAR NCAP pre-NIPS' Workshop. Photo: Josh Valcarcel/WIRED

Yann LeCun. CIFAR NCAP pre-NIPS’ Workshop. Photo: Josh Valcarcel/WIRED

Big Data, Large Scale Machine Learning
New York University on TechTalks, John Langford and Yann LeCun

Another interesting course about Machine Learning hold by LeCun and John Langford, researcher at Yahoo Research, Microsoft Research and IBM’s Watson Research Center.

langford

John Langford, NYU

Deep Learning Courses
NVIDIA Accelerated Computing

This is not a college course. NVIDIA was one of the most important graphic board manufacturer in the early 2000s and now, with the experience of massive parallel computer on GPUs, is heavily investing in Deep Learning. This course is focused on usage of GPUs on most common deep learning framework: DIGITS, Caffe, Theano and Torch.

deep-learning-course

Deep Learning Courses, NVIDIA

Mastering Apache Spark
Mike Frampton, Packt Publishing

Last summer I had the opportunity to collaborate in review of this title. Chapter about MLlib contains a useful introduction to Artificial Neural Networks on Spark. Implementation seems still young but is already possible to distribute the network over a Spark cluster.

mastering-apache-spark

Mastering Apache Spark

[UPDATE 2016-01-31]

Deep Learning 
Vincent Vanhoucke, Google, Udacity

Google, a few days ago, releases on Udacity a Deep Learning course focused on TensorFlow, its deep learning tool. It’s the first course officially sponsored by a big companym is free and seems a great introduction. Thanks to Piotr Chromiec for pointing 🙂

deep-learning-google

From the home page

Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. […] Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.”

Introduction

Storm enables you to define a Topology (an abstraction of cluster computation) in order to describe how to handle data flow. In a topology you can define some Spouts (entry point for your data with basic preprocessing) and some Bolts (a single step of data manipulation). This simple strategy enable you to define a complex processing of streams of data.

Storm nodes are of two kinds: master and worker. Master node runs Nimbus, it is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures. Worker nodes run Supervisor. The Supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it. Everything is done through ZooKeeper.

Libraries

Resources

Books

Last summer I had the pleasure to review a really interesting book about Spark written by Holden Karau for PacktPub. She is a really smart woman currently software development engineer at Google, active in Spark‘s developers community. In the past she worked for MicrosoftAmazon and Foursquare.

Spark is a framework for writing fast, distributed programs. It’s similar to Hadoop MapReduce but uses fast in-memory approach. Spark ecosystem incorporates an inbuilt tools for interactive query analysis (Shark), a large-scale graph processing and analysis framework (Bagel), and real-time analysis framework (Spark Streaming). I discovered them a few months ago exploring the extended Hadoop ecosystem.

The book covers topics about how to write distributed map reduce style programs. You can find everything you need: setting up your Spark cluster, use the interactive shell and write and deploy distributed jobs in Scala, Java and Python. Last chapters look at how to use Hive with Spark to use a SQL-like query syntax with Shark, and manipulating resilient distributed datasets (RDDs).

Have fun reading it! 😀

Fast data processing with Spark
by Holden Karau

fast_data_processing_with_spark_cover

The title is also listed into Research Areas & Publications section of Google Research portal: http://research.google.com/pubs/pub41431.html

Recently were released two important updates in the Ruby world (informally named ROR24):

  1. Ruby 2.0.0-p0
    http://www.ruby-lang.org/en/news/2013/02/24/ruby-2-0-0-p0-is-released/
  2. Rails 4.0.beta1
    http://weblog.rubyonrails.org/2013/2/25/Rails-4-0-beta1/

Following this release, PragProg has released a new update for two of the most popular book about this topics.

Programming Ruby (the pickaxe book)
by Dave Thomas, with Chad Fowler and Andy Hunt

programming_ruby_2

Agile Web Development with Rails
by Sam Ruby, Dave Thomas and David Heinemeier Hansson

agile_web_devlopment_with_rails_4

I bought them yesterday. At first look, updates look cool also if there are only minor updates. In the coming days I’m going to practice about these new stuff and write some posts about it 😉

When you work on a single machine everything is easy. Unfortunately when you have to scale and be fault tolerant you must relay on multiple hosts and manage a structure usually called “cluster“.

MongoDB enable you to create a replica set to be fault tolerant and use sharding to scale horizontally. Sharding is not transparent to DBAs. You have to choose a shard-key and adding and removing capacity when the system needs.

Structure

In a MongoDB cluster you have 3 fundamental “pieces”:

  • Servers: usually called mongod
  • Routers: usually called mongos.
  • Config servers

Servers are the place where you actually store your data, when you start the mongod command on your machine you are running a server. In a cluster usually you have multiple shard distributed over multiple servers.
Every shard is usually a replica set (2 or more servers) so if one of the servers goes down your cluster remains up and running.

Routers are the interaction point between users and the cluster. If you want to interact with your cluster you have to do throughout a mongos. The process route your request to the correct shard and gives back you the answer.

Config servers hold all the information about cluster configuration. They are very sensitive nodes and they are the real point of failure of the system.

mongodb_cluster

Shard Key

Choosing the shard key is the most important part when you create a cluster. There are a few important rule to follow learned after several errors:

  1. Don’t choose a shard key with a low cardinality. If one of this possibile values grow too much you can’t split it anymore.
  2. Don’t use an ascending shard key. The only shard who grows is the last one and distribute load on the other server always require a lot of traffic.
  3. Don’t use a random shard key. If quite efficient but you have to add an index to use it

A good choice is to use a coarsely ascending key combined to a search key (something you commonly query). This choice won’t work well for everything but it’s a good way to start thinking about.

N.B. All the informations and the image of the cluster strucure comes from the book below. I read it last week and I find it really interesting 🙂

Scaling MongoDB
by Kristina Chodorow

scaling_mongodb

Redis is widely used into projects I have to work on everyday at @thefool_it. My knowledge about it is really poor so I decided to improve my experience up to a PRO level. I understand basic Redis concepts because I worked with memcached in the past and differences were clearly explained into “Seven Databases in Seven Weeks“. My weaknesses are about everyday use: setup, administration, querying 🙁

Introduction and setup.

Installation is really easy because you can compile from source. On OSX you also have brew or port with an up-to-date package. Update isn’t so easy. Standard way is to start the updated version on another port and migrate data.

Data types are: StringsLists (ordered list of string), Hashes, Sets (no duplicated values) and Sorted Sets (sets sorted by a counter).

Standard distribution comes with a command line interface: the redis-cli. There is a standard library for most common environments and programming languages such as Node.js (node_redis), Python (redis-py), Ruby (redis-rb) and more.

In the coming weeks I’m going to practice about commands and admin techniques using following resources.

Redis Cookbook
by Tiago Macedo, Fred Oliveira

redis_cookbook

Other interesting sources

Things I learned this week about MongoDB:

  • Duplicated data and a denormalized schema are good to be faster especially if if your data doesn’t change often.
  • Allocate more space for objects on disk is hard also for mongoDB. Do it early.
  • Preprocess you data and create index to speed up queries you actually do is a damn good idea.
  • The query order (with AND and OR operators) is really important.
  • Use journaling and replication to keep safe you data.
  • Use Javascript to define your own functions

50 Tips and Tricks for MongoDB Developers
by Kristina Chodorow

mongodb_developers

Code readability is one of the most undervalued skill in programming.
Write good code is not easy, writing readable code is fucking hard.

This is one of the best books about code readability with dozen of useful tips & tricks.

The Art of Readable Code 
by Dustin Boswell and Trevor Foucher

the_art_of_readable_code

During last years I had to develop projects containing up to hundred of million of objects. Now I need to move ahead and scale up to several billions of objects, reaching the limit of “big-data” definition. The common implementation of relational model which I always used isn’t enough anymore.

We know that a standard single-machine instance of MySQL (which all web developers have used at least once) show its limit over the 100 millions of rows. I need to scale horizontally and also need most specific features to easily manage a huge amount of data.

This is not a limit of relational model. Other implementations (like PostgreSQL or Oracle) can easily scale over that limit. Unfortunately many operations you usually do on data (like joins and set operations) aren’t so fast to run with billion of records. I need something else.

So called “NoSQL databases” offer you more data model (document-oriented, columnar, key-value, graph and more) where you can store your data in an more efficient way. They also offer features like sharding, replication, caching and indexing out of the box.

I’m not a NoSQL expert so I can’t advise you if choose a DBMS instead of another is a good choice or not. I’m entering this world just now like many other developers but I think that polyglot persistence is the future. Store your data using more than one DBMS to fit your requirements and take advantage of features of each one is a smart choice.

Big-data and polyglot persistence are interesting topics. I found some interest books about these topics. They can be a high quality introduction.

Seven Databases in Seven Weeks
by Eric Redmond and Jim R. Wilson

Contains an overview about different kinds of data model with real-world example for each one: PostgreSQL (RDBMS), Riak and Redis (Key-Value), HBase (Column-oriented), MongoDB and CouchDB (Document-oriented) and Neo4j (Graph).

NoSQL Distilled
by Pramod J.Sadalage and Martin Fowler

Similarly to the previous one this book starts with overview about the NoSQL world. The first part analyze how different softwares implement key-features: data-modeling, distribution (to scaling horizontally) and replication (to keep if safe and analyzable) of data.

The second part focus on each different typology of DBMS and analyze how they implement concepts exposed in previous part.

Big Data Glossary
by Pete Warden

Big data is more that persistence. There are many other operations you can do on your data and many way to analyze results. If you aren’t familiar with concepts like MapReduce, Natural Language Processing and Machine Learning this book explain you the basics.

First 5 chapters are about storing big-data, other 6 chapters are about processing and refining data with focus on high-specific topics.