After GFS and MapReduce, Google solve again big data problems designing BigTable: a compressed, high performance, and proprietary data storage system that forms the basis for most of its projects. HBase and Cassandra are inspired from it.


google_logo

Title: Bigtable: A Distributed Storage System for Structured Data (PDF), November 2006
Authors: Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber

Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers.

Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products.

In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.


Check out the list of interesting papers and projects (Github).

This is “the paper” which started everything in big data environment. During 2003 Google already had problems most of our still haven’t in terms of size and availability of data. They developed a proprietary distributed filesystem called GFS. After a couple of years Yahoo creates HDFS, the distributed filesystem, part of Hadoop framework inspire by this paper. As The Hadoop co-creator Doug Cutting (@cutting): “Google is living a few years in the future and sending the rest of us messages”.


google_logo

Title: The Google File System (PDF), October 2003
Authors: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.

While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore radically different design points.

The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients.

In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use.


Check out the list of interesting papers and projects (Github).

google_logo

I always underestimated the contribute of Google to the evolution of big-data processing. I used to think that Google only manages and shows some search results.  Not so much data. Not so much as Facebook or Twitter at least…

Obviously I was wrong. Google has to manage a HUGE amount of data and big-data processing was already a problem on 2002! Its contribution to the current processing technologies such as Hadoop and its filesystem HDFS and HBase was fundamental.

We can split contribution into two periods. The first of these (from 2003 to 2008) influenced technologies we are using today. The second (from 2009 since today) is influencing product we are going to use is the near future.

The first period gave us

  • GFS Google FileSystem (PDF paper), a scalable distributed file system for large distributed data-intensive applications which later inspire HDFS
  • BigTable (PDF paper), a columnar oriented database designed to store petabyte of data across large clusters which later inspire HBase
  • the concept of MapReduce (PDF paper), a programming model to process large datasets distributed across large cluster. Hadoop implements this programming model over the HDFS or similar filesystems.

This series of paper revolutionized the strategies behind data warehouse and now all the largest companies uses products, inspired by these papers, we all knows.

The second period is less popular at the moment. Google faced many limits in its previous infrastructure and tried to fix them and move ahead. This behavior gave as many other technologies, some of these not yet completely public:

  • Caffeine, a new search infrastructure who use GFS2, next-generation MapReduce and next-generation BigTable.
  • Colossus, formerly known as Google FileSystem 2 the next generation GFS.
  • Spanner (PDF paper), a scalable, multi-version, globally-distributed, and synchronously-replicated database, the NewSQL evolution of BigTable
  • Dremel (PDF paper), a scalable, near-real-time ad-hoc query system for analysis of read-only nested data, and its implementation for the GAEBigQuery.
  • Percolator (PDF paper), a platform for incremental processing which continually update the search index.
  • Pregel (PDF paper), a system for large-scale graph processing similar to MapReduce for columnar data.

Now market is different than 2002. Many companies such Cloudera and MapR are working hard for big-data and Apache Foundation as well. Anyway Google has 10 years of advantages and its technologies are still stunning.

Probably many of these papers are going to influence the next 10 year. First results are already here. Apache Drill and Cloudera Impala implement the Dremel paper specification, Apache Giraph implements the Pregel one and HBase Coprocessor the Percolator one.

And they are just some examples, a Google search can show you more 😉

Insights: