Beyond Data Modeling

rw-book-cover

Metadata

Author: David Jayatillake
Full Title: Beyond Data Modeling
URL: https://davidsj.substack.com/p/beyond-data-modeling

Highlights

Why do we need a semantic layer? If we have a great data model built in dbt core (or otherwise), isn’t this enough to provide data access? (View Highlight)
Data modeling is the process of creating a diagram that represents your data system and defines the structure, attributes, and relationships of your data entities. Data modeling organizes and simplifies your data in a way that makes it easy to understand, manage, and query, while also ensuring data integrity and consistency. Data models inform your data architecture, database design, and restructuring legacy systems. (View Highlight)
The quote here is from the Qlik link the image above is also taken from - it talks about data modeling being a visual process. I don’t feel this is necessarily true. You can easily think of it as a code-first exercise, which then generates a visual artefact instead of the other way round. However, the end result is the same and I agree with the rest of the quote. (View Highlight)
In the quote above, it talks about making data “easy to query”. I feel this is talking about data analysts as the people who would be doing the querying. Yes, good data modeling massively simplifies the SQL a data analyst would have to write to answer a question or produce the input for a graph/data table, but they still have to write SQL. (View Highlight)
A data model describes how to join tables together, but this still requires an analyst to interpret the diagram and write the corresponding join, based on a request. Depending on the direction of the join for the query, a different join has to be written. For example, if you were counting the number of order lines by order month in the data model above, you might write a query like this: (View Highlight)
Organisations increasingly offer data products on top of their data models. These products are regularly extended and changed to meet new customer needs and wants. Having extra overhead in development time and higher risk in deployment, due to lack of consistency and governance in semantic definition, is unacceptable. (View Highlight)
Reliable retrieval needs to be independent of being able to write SQL. If you think that a well-documented data model is a kind of knowledge graph in terms of its metadata, you can still only access it by writing SQL manually to the data warehouse, using the knowledge graph as context. All semantic layers are knowledge graphs, but not all knowledge graphs are semantic layers. It would be possible in a knowledge graph to say how to calculate any given metric, define an entity or derive a dimension on top of a data model. The key addition the semantic layer offers is the compiler and interface/s to it. (View Highlight)
As the semantic layer can compile a more simple request into complex but consistent raw SQL, it can offer much more constrained interfaces than SQL, like a REST API. The compiler abstracts away the complexity, and risk of inconsistency, that a human writing SQL drives. (View Highlight)
Many engineers who build customer-facing applications aren’t great at writing SQL - they are much more used to using interfaces like REST to get the data they need. Simply asking them to write raw SQL, which then has to live in their applications and services, results in slower development time and higher risk of bugs being shipped. (View Highlight)
Semantic layers provide all of the context benefits of a knowledge graph and additionally provide the constrained interface and compiler to request data and execute queries - abstracting this from the requestor. The requestor need only know what the semantic layer contains and how to ask for it in a simple format - really a list of objects and possible associated values for filtering. (View Highlight)
Semantic layers are in prod, people know what they contain and serve. Where they are minimal and well-defined, the entities they contain are clear and obvious. This is the best place to enforce security policies. I have written about this recently, so won’t spend long here. I’m also interested in how this relates to ABAC and will co-author a post soon in that regard. (View Highlight)
TLDR; security policies are harder, slower and therefore more expensive and risky to maintain at the data warehouse level than at the semantic layer. (View Highlight)
The consistency of raw SQL generated by the semantic layer compiler mentioned above, means that the probability of hitting your data warehouse cache is much higher with a semantic layer, than by using a data model alone. It’s too easy for humans writing SQL on your data model to write it in slightly different ways for the same goal. Where the data warehouse cache is missed, then the query is effectively re-run at high cost and poor latency. (View Highlight)
Semantic layers can also allow for known query patterns to be pre-aggregated into their cache at regular intervals (eg after transformation job runs), offering low latency for expected incoming queries. Their caches can also be more sophisticated than data warehouse caches, providing more functionality to increase cache hit ratio and allow for partial hits, where only additional data is pulled and added to the cached data. Cloud data warehouse vendors and providers are disincentivised to invest much in the sophistication of their caching, as it reduces their revenue, which is usually based on compute usage. (View Highlight)
If you’ve invested the time to architect a data model, chances are it’s for an important use case you want to put into production. Oftentimes, this data model will need to be accessed with a high volume of requests that need to be served with low latency and low cost per request. Many companies now offer data to their customers as a data product in this way. Optimising cost and latency is essential here. The total cost of ownership of a semantic layer can easily be net negative when all other costs and customer experience improvements are considered. (View Highlight)
As you know if you’ve been reading this blog a while, this is something I’m well placed to answer! I think there are a few aspects to this: • Consistency and governance • Interface • Security • Performance and cost (View Highlight)

Pelayo Arbués

Explorer

Recent Notes

2025 Reading Wrapped

Self-proclaimed experts

My failure resume

Beyond Data Modeling

Metadata

Highlights

Graph View

Table of Contents

Now Reading

Beyond Data Modeling