Researcher are always publishing new papers about processing data and scaling architecture. Here is a incomplete but useful list of relevant paper published in the VLDB/big-data environment starting from 1997 with the paper from NASA that mention the word “big-data” fro the first time.
Papers of 2016
- 2016 – Learning to Simplify: Fully Convolutional Networks for Rough Sketch Cleanup
- 2016 – Practical Black-Box Attacks against Deep Learning Systems using Adversarial Examples
- 2016 – Understanding Deep Convolutional Networks
Papers of 2015
- 2015 – A Neural Algorithm of Artistic Style
- 2015 – Deep Image: Scaling up Image Recognition
- 2015 – Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
- 2015 – Deep Speech: Scaling up end-to-end speech recognition
- 2015 – Fast Convolutional Nets With fbfft: A GPU Performance Evaluation
- 2015 – G-OLA: Generalized On-Line Aggregation for Interactive Analysis on Big Data
- 2015 – Giraffe: Using Deep Reinforcement Learning to Play Chess
- 2015 – Hidden Technical Debt in Machine Learning Systems
- 2015 – Klout Score: Measuring Influence Across Multiple Social Networks
- 2015 – Large-scale cluster management at Google with Borg
- 2015 – Machine Learning Classification over Encrypted Data
- 2015 – Machine Learning Methods for Computer Security
- 2015 – Neural Networks with Few Multiplications
- 2015 – Self-Repairing Disk Arrays
- 2015 – Spark SQL: Relational Data Processing in Spark
- 2015 – SparkNetwork: Training Deep Network in Spark
- 2015 – Succinct: Enabling Queries on Compressed Data
- 2015 – Taming the Wild: A Unified Analysis of HOGWILD!-Style Algorithms
- 2015 – The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox
- 2015 – Trill: A High-Performance Incremental Query Processor for Diverse Analytics
- 2015 – Twitter Heron: Stream Processing at Scale
Papers of 2014
- 2014 – 3D Object Manipulation in a Single Photograph using Stock 3D Models
- 2014 – A Partitioning Framework for Aggressive Data Skipping
- 2014 – A Sample-and-Clean Framework for Fast and Accurate Query Processing on Dirty Data
- 2014 – A Self-Configurable Geo-Replicated Cloud Storage System
- 2014 – All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications
- 2014 – Arrakis: The Operating System is the Control Plane
- 2014 – Automatic Construction of Inference-Supporting Knowledge Bases
- 2014 – Bayesian group latent factor analysis with structured sparse priors
- 2014 – Chinese Open Relation Extraction for Knowledge Acquisition
- 2014 – Coordination Avoidance in Database Systems
- 2014 – DeepFace: Closing the Gap to Human-Level Performance in Face Verification
- 2014 – Diagram Understanding in Geometry Questions
- 2014 – Discourse Complements Lexical Semantics for Non-factoid Answer Reranking
- 2014 – Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?
- 2014 – Eidetic Systems
- 2014 – Execution Primitives for Scalable Joins and Aggregations in Map Reduce
- 2014 – Extracting More Concurrency from Distributed Transactions
- 2014 – f4: Facebookâs Warm BLOB Storage System
- 2014 – Fast Databases with Fast Durability and Recovery Through Multicore Parallelism
- 2014 – Fastpass: A Centralized “Zero-Queue” Datacenter Network
- 2014 – First-person Hyper-lapse Videos
- 2014 – GloVe: Global Vectors for Word Representation
- 2014 – GraphX: Graph Processing in a Distributed Dataflow Framework
- 2014 – Guess Who Rated This Movie: Identifying Users Through Subspace Clustering
- 2014 – In Search of an Understandable Consensus Algorithm
- 2014 – Learning Everything about Anything: Webly-Supervised Visual Concept Learning
- 2014 – Learning to Solve Arithmetic Word Problems with Verb Categorization
- 2014 – Log-structured Memory for DRAM-based Storage
- 2014 – Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases
- 2014 – MapGraph: A High Level API for Fast Development of High Performance Graph Analytics on GPUs
- 2014 – Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing
- 2014 – Modeling Biological Processes for Reading Comprehension
- 2014 – Orca A Modular Query Optimizer Architecture for Big Data
- 2014 – Pigeon: A Spatial MapReduce Language
- 2014 – Project Adam: Building an Efficient and Scalable Deep Learning Training System
- 2014 – Quantum Deep Learning
- 2014 – R Markdown: Integrating A Reproducible Analysis Tool into Introductory Statistics
- 2014 – Salt: Combining ACID and BASE in a Distributed Database
- 2014 – Scalable Object Detection using Deep Neural Networks
- 2014 – Sequence to Sequence Learning with Neural Networks
- 2014 – Show and Tell: A Neural Image Caption Generator
- 2014 – Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems
- 2014 – The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services
- 2014 – The Trill Incremental Analytics Engine
Papers of 2013
- 2013 – A Demonstration of SpatailHadoop: An Efficient MapReduce Framework for Spatial Data
- 2013 – A Lightweight and High Performance Monolingual Word Aligner
- 2013 – Answer Extraction as Sequence Tagging with Tree Edit Distance
- 2013 – Automatic Coupling of Answer Extraction and Information Retrieval
- 2013 – CG_Hadoop: Computational Geometry in MapReduce
- 2013 – Consistency-Based Service Level Agreements for Cloud Storage
- 2013 – Dimension Independent Matrix Square using MapReduce
- 2013 – Druid A Real-time Analytical Data Store
- 2013 – Efficient Estimation of Word Representations in Vector Space
- 2013 – Event labeling combining ensemble detectors and background knowledge
- 2013 – Everything You Always Wanted to Know About Synchronization but Were Afraid to Ask
- 2013 – F1: A Distributed SQL Database That Scales
- 2013 – Fast Training of Convolutional Networks through FFTs
- 2013 – GraphX: A Resilient Distributed Graph System on Spark
- 2013 – HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality 2013 Estimation Algorithm
- 2013 – MillWheel: Fault-Tolerant Stream Processing at Internet Scale
- 2013 – MLbase: A Distributed Machine-learning System
- 2013 – Naiad: A Timely Dataflow System
- 2013 – Omega: flexible, scalable schedulers for large compute clusters
- 2013 – Online, Asynchronous Schema Change in F1
- 2013 – Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
- 2013 – Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
- 2013 – Rich feature hierarchies for accurate object detection and semantic segmentation
- 2013 – Scalable Progressive Analytics on Big Data in the Cloud
- 2013 – Scaling Memcache at Facebook
- 2013 – Scuba: Diving into Data at Facebook
- 2013 – Semi-Markov Phrase-based Monolingual Alignment
- 2013 – Shark: SQL and Rich Analytics at Scale
- 2013 – Some Improvements on Deep Convolutional Neural Network Based Image Classification
- 2013 – Sparrow: Distributed, Low Latency Scheduling
- 2013 – Sparrow: Scalable Scheduling for Sub-Second Parallel Jobs
- 2013 – TAO: Facebookâs Distributed Data Store for the Social Graph
- 2013 – Toward Common Patterns for Distributed, Concurrent, Fault-Tolerant Code
- 2013 – Unicorn: A System for Searching the Social Graph
- 2013 – Warp: Lightweight Multi-Key Transactions for Key-Value Stores
Papers of 2012
- 2012 – A Few Useful Things to Know about Machine Learning
- 2012 – A Sublinear Time Algorithm for PageRank Computations
- 2012 – Avatara: OLAP for Web-scale Analytics Products
- 2012 – Blink and It’s Done. Interactive Queries on Very Large Data
- 2012 – BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data
- 2012 – Building high-level features using large scale unsupervised learning
- 2012 – Dimension Independent Similarity Computation
- 2012 – Earlybird: Real-Time Search at Twitter
- 2012 – Fast and Interactive Analytics over Hadoop Data with Spark
- 2012 – HyperDex: A Distributed, Searchable Key-Value Store
- 2012 – ImageNet Classification with Deep Convolutional Neural Networks
- 2012 – Large Scale Distributed Deep Networks
- 2012 – Large:Scale Machine Learning at Twitter
- 2012 – Multi-Scale Matrix Sampling and Sublinear-Time PageRank Computation
- 2012 – Paxos Made Parallel
- 2012 – Paxos Replicated State Machines as the Basis of a High-Performance Data Store
- 2012 – Perspectives on the CAP Theorem
- 2012 – Processing a Trillion Cells per Mouse Click
- 2012 – Shark: Fast Data Analysis Using Coarse-grained Distributed Memory
- 2012 – Spanner: Google’s Globally-Distributed Database
- 2012 – Temporal Analytics on Big Data for Web Advertising
- 2012 – The Unified Logging Infrastructure for Data Analytics at Twitter
- 2012 – The Vertica Analytic Database- C-Store 7 Years Later
Papers of 2011
- 2011 – Consistency, Availability, and Convergence
- 2011 – CrowdDB: Answering Queries with Crowdsourcing
- 2011 – CrowdDB: Query Processing with the VLDB Crowd
- 2011 – Fast Crash Recovery in RAMCloud
- 2011 – Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
- 2011 – It’s Time for Low Latency
- 2011 – Matching Unstructured Product Offers to Structured Product Specifications
- 2011 – Megastore: Providing Scalable, Highly Available Storage for Interactive Services
- 2011 – Resilient Distributed Datasets- A Fault-Tolerant Abstraction for In-Memory Cluster Computing
- 2011 – Scarlett: Coping with Skewed Content Popularity in MapReduce Clusters
Papers of 2010
- 2010 – A Method of Automated Nonparametric Content Analysis for Social Science
- 2010 – Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
- 2010 – Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers
- 2010 – Dremel: Interactive Analysis of Web-Scale Datasets
- 2010 – Finding a needle in Haystack- Facebook’s photo storage
- 2010 – FlumeJava: Easy, Eff¥cient Data-Parallel Pipelines
- 2010 – Large:scale Incremental Processing Using Distributed Transactions and Notifications
- 2010 – Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
- 2010 – Pregel: A System for Large-Scale Graph Processing
- 2010 – S4: Distributed Stream Computing Platform
- 2010 – Spark: Cluster Computing with Working Sets
- 2010 – The Learning Behind Gmail Priority Inbox
- 2010 – ZooKeeper: Wait-free coordination for Internet-scale systems
Papers of 2009
- 2009 – Cassandra – A Decentralized Structured Storage System
- 2009 – Feature Hashing for Large Scale Multitask Learning
- 2009 – HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads
- 2009 – Vertical Paxos and Primary-Backup Replication
Papers of 2008
- 2008 – Chukwa: A large-scale monitoring system
- 2008 – Column:Stores vs. Row-Stores- How Different Are They Really?
- 2008 – PNUTS: Yahoo!Õs Hosted Data Serving Platform
- 2008 – Top 10 algorithms in data mining
Papers of 2007
- 2007 – Architecture of a Database System
- 2007 – Consistent Streaming Through Time: A Vision for Event Stream Processing
- 2007 – Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
- 2007 – Dynamo: Amazon’s Highly Available Key-value Store
- 2007 – Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments
- 2007 – Life beyond Distributed Transactions: an ApostateÕs Opinion
- 2007 – Paxos Made Live – An Engineering Perspective
Papers of 2006
- 2006 – Bigtable: A Distributed Storage System for Structured Data
- 2006 – Ceph: A Scalable, High-Performance Distributed File System
- 2006 – Map-Reduce for Machine Learning on Multicore
- 2006 – The Chubby lock service for loosely-coupled distributed systems
Papers of 2005
- 2005 – Fast Paxos
Papers of 2004
Papers of 2003
Papers of 2002
- 2002 – Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services
Papers of 2001
- 2001 – Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications
- 2001 – Paxos Made Simple
- 2001 – Random Forrest
Papers of 1999
- 1999 – Pasting Small Votes for Classification in Large Databases and On-Line
- 1999 – The PageRank Citation Ranking: Bringing Order to the Web
Papers of 1997
If you like this list you are probably interested in my list of big-data related projects.
Related projects
my assistant was looking for a form a few days ago and was made aware of an online platform with a searchable database . If others want it as well , here’s a
http://goo.gl/xLoUbc