Competence: Data Science Workflow Level: Foundation
“Deriving truth and insight from a pile of data is a powerful but error-prone job. The best data analysts and data-minded engineers develop a reputation for making credible pronouncements from data. But what are they doing that gives them credibility? I often hear adjectives like careful and methodical, but what do the most careful and methodical analysts actually do?
This is not a trivial question, especially given the type of data that we regularly gather at Google. Not only do we typically work with very large data sets, but those data sets are extremely rich. That is, each row of data typically has many, many attributes. When you combine this with the temporal sequences of events for a given user, there are an enormous number of ways of looking at the data. Contrast this with a typical academic psychology experiment where it’s trivial for the researcher to look at every single data point. The problems posed by our large, high-dimensional data sets are very different from those encountered throughout most of the history of scientific work.
This document summarizes the ideas and techniques that careful, methodical analysts use on large, high-dimensional data sets. Although this document focuses on data from logs and experimental analysis, many of these techniques are more widely applicable.”