Will We Ever Have Clean Data?

rw-book-cover

Metadata

Author: Benn Stancil
Full Title: Will We Ever Have Clean Data?
URL: https://benn.substack.com/p/will-we-ever-have-clean-data

Highlights

SDF is a compiler and build system that leverages static analysis to comprehensively examine SQL code at warehouse scale. By considering all queries in any dialect simultaneously, SDF builds a rich dependency graph and provides a holistic view of your data assets, empowering you to uncover problems proactively and optimize your data infrastructure like never before. (View Highlight)
The standout feature of SDF is its ability to annotate your SQL sources with rich metadata and reason about them together. SDF metadata can range from simple types and classifiers (PII) to table visibility and privacy policies (Anonymize). When SDF performs its static analysis, it takes this metadata into account, propagates it throughout your SQL sources with Information Flow Theory, and enforces built-in and user-defined rules. We call these Checks. Here are some simple examples of powerful SDF Checks: (View Highlight)
Because no company is the same, measuring a business, as was the case for us when we were measuring our win rates, is a creative process. Inevitably, even the best laid reporting plans give way to a lot of exploratory messes. Each potential metric produces a bunch of analyses to assess it; each analysis produces more questions and ad hoc offshoots. Multiply this by all the metrics and dashboards on your blueprint, and complicate it by constantly shifting the business underneath it, and the development process looks less like an organized construction site and more like an artist’s studio or a writer’s desk. (View Highlight)
This dynamic actively works against a lot of our existing data quality tools. Those tools typically encourage a slow march towards stability—over time, data teams should gradually add more models, tests, policies, and contracts. But data and the things people create with it are often more dynamic than that ( (View Highlight)
A high-quality dataset is one that is consistent with the business concept it represents. Excluding the datasets that sit behind legally-defined financial metrics,3 those business concepts are often fluid. New OKRs get spun up every quarter; projects take off or wind down; new data sources become urgent requirements as business initiatives change. Data teams have to absorb every change from every department they serve. (View Highlight)
In this sense, dbt models can be akin to functions in a piece of software. You want them to be short and non-repeating. To judge a dbt project by how many models it has is like judging a program by how many functions it has: More isn’t better, but fewer isn’t better either. It’s about what the functions do, not how many there are. (View Highlight)

Pelayo Arbués

Explorer

Recent Notes

I am cooking again

The 10x Manager

2025 Reading Wrapped

Will We Ever Have Clean Data?

Metadata

Highlights

Graph View

Table of Contents

Now Reading

Advisor Tool