This is the first public paper, published by NASA, that mention the word “Big Data“. Actually it’s not related to data processing but is like the beginning for this funny buzz word ūüôā


Title: Application-Controlled Demand Paging for Out-of-Core Visualization (PDF), 1997
Authors: Michael Cox, David Ellsworth


In the area of scientific visualization, input data sets are often very large. In visualization of Computational Fluid Dynamics (CFD) in particular, input data sets today can surpass 100 Gbytes, and are expected to scale with the ability of supercomputers to generate them. Some visualization tools already partition large data sets into segments, and load appropriate segments as they are needed.

However, this does not remove the problem for two reasons: 1) there are data sets for which even the individual segments are too large for the largest graphics workstations, 2) many practitioners do not have access to workstations with the memory capacity required to load even a segment, especially since the state-of-the-art visualization tools tend to be developed by researchers with much more powerful machines. When the size of the data that must be accessed is larger than the size of memory, some form of virtual memory is simply required. This may be by segmentation, paging, or by paged segments.

In this paper we demonstrate that complete reliance on operating system virtual memory for out-of-core visualization leads to egregious performance. We then describe a paged segment system that we have implemented, and explore the principles of memory management that can be employed by the application for out-of-core visualization.

We show that application control over some of these can significantly improve performance. We show that sparse traversal can be exploited by loading only those data actually required. We show also that application control over data loading can be exploited by 1) loading data from alternative storage format (in particular 3-dimensional data stored in subcubes), 2) controlling the page size.

Both of these techniques effectively reduce the total memory required by visualization at run-time. We also describe experiments we have done on remote out-of-core visualization (when pages are read by demand from remote disk) whose results are promising.

Check out the list of interesting papers and projects (Github).

After GFS and MapReduce, Google solve again big data problems designing BigTable: a compressed, high performance, and proprietary data storage system that forms the basis for most of its projects. HBase and Cassandra are inspired from it.


Title: Bigtable: A Distributed Storage System for Structured Data (PDF), November 2006
Authors: Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber

Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers.

Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products.

In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.

Check out the list of interesting papers and projects (Github).

For several years I thought MapReduce was the only paradigm for distributed data processing. Only a few month ago, watching “Clash of the Titans Beyond MapReduce” (a really interesting talk at The Hive meetup)¬†Matei Zaharia (@matei_zaharia), CTO of Databricks and co-creator of Spark, cited Dryad, a programming paradigm developed by Microsoft. I don’t know any open project which implement it but was widely used by Microsoft Research.


Title: Dryad: Distributed Data-Parallel Programs from Sequential
Building Blocks (PDF), March 2007
Authors: Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly


Dryad is a general-purpose distributed execution engine for coarse-grain data-parallel applications. A Dryad application combines computational ‚Äúvertices‚ÄĚ with communication ‚Äúchannels‚ÄĚ to form a dataflow graph.

Dryad runs the application by executing the vertices of this graph on a set of available computers, communicating as appropriate through files, TCP pipes, and shared-memory FIFOs. The vertices provided by the application developer are quite simple and are usually written as sequential programs with no thread creation or locking. Concurrency arises from Dryad scheduling vertices to run simultaneously on multiple computers, or on multiple CPU cores within a computer. The application can discover the size and placement of data at run time, and modify the graph as the computation progresses to make efficient use of the available resources.

Dryad is designed to scale from powerful multi-core single computers, through small clusters of computers, to data centers with thousands of computers. The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.

Check out the list of interesting papers and projects (Github).

Hadoop would not be real without this paper. MapReduce is the most famous and still most used processing paradigm for big data. It is not suitable for everything and there are several improvements (Dryad, Spark, …) but Google, Facebook, Twitter and many other has million rows of code deployed into their systems.

google_logoTitle: MapReduce: Simplified Data Processing on Large Clusters (PDF), December 2004
Authors: Jeffrey Dean and Sanjay Ghemawat

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.

Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google’s clusters every day.

Check out the list of interesting papers and projects (Github).

This is “the paper”¬†which started everything in big data environment. During 2003 Google already had problems most of our still haven’t in terms of size and availability of data. They developed a proprietary distributed filesystem called GFS. After a couple of years Yahoo creates HDFS, the distributed filesystem, part of Hadoop framework inspire by this paper. As The Hadoop co-creator Doug Cutting (@cutting): “Google is living a few years in the future and sending the rest of us messages”.


Title: The Google File System (PDF), October 2003
Authors: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.

While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore radically different design points.

The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients.

In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use.

Check out the list of interesting papers and projects (Github).