Espresso #3: Data quality, unbundle to rebundle, and navigating data content

Make yourself an espresso and join me for a short break on a Wednesday afternoon ☕

Mar 09, 2022

Hello fellow data enthusiasts,

In this edition, we will talk about data quality, the future of orchestration in a fragmented modern data stack, and how to navigate the endless stream of data content. So without further ado, let’s talk data engineering while the espresso is still hot.

Having trust issues with your data?

A statement that I keep running into variations of online is how “you can’t create value with ML if you don’t have good data to begin with”. (I.e., garbage in, garbage out).

I personally believe that neglecting data quality and building data products without first ensuring robust data validation is one of the key issues that eventually lead to failed data initiatives. But why did data quality only gain attention rather recently?

For many years (think 2010 to 2016) data engineers were building pipelines without software engineering best practices - the goal was to deliver as much data as possible, as fast as possible. This didn’t cause any immediate issues at the time, because “big data” was still a secondary factor for decision-making at most companies. But now that data is a first-class citizen everywhere and metrics are consumed by a wide range of users/teams, we frequently find ourselves trying to understand why two dashboards present different values for the same metric, or how a failure in one pipeline would impact our end-users. Now, we’re paying our overdue debts for building data pipelines without having data quality in mind. So how should you tackle data trust issues?

Within the Modern Data Stack, you’ll find dedicated companies that concentrate on solving this particular issue. While this may be the way to go if you don’t actually have any data engineers within your company (you should though!), I find that it adds unnecessary complexity and costs to solve a rather simple problem - for most cases.

Data quality, at its core, can be achieved by being able to answer three main questions:

What types of checks do you want to implement for your pipelines? (schema checks, data checks, etc.)
Which operations should you implement to ensure that your data meets the standards defined by answering the first question? (row count, handling nulls, deduplicating the data, recasting fields, ensuring default values, etc.)
What action should happen when one of your checks fails? (raising a warning or an error based on the type of the check and the severity of the issue, sending alerts, etc.)

For example, if you have an architecture that relies on Spark-based pipelines, you can implement the checks and actions related to them as Spark-based nodes within your orchestrated pipelines, with the aim of running these checks as close to the source as possible (to minimize the impact on downstream processes). Or better yet, you can leverage a tool like Soda SQL or Great Expectations, which are open-source projects that will simplify your data quality checks (you’ll have plenty of pre-defined checks out of the box) without adding much complexity to your stack.

Data quality is a key pillar of having a successful data-driven strategy, and yet as a problem, it’s actually not that complex. Adding yet-one-more-vendor to your stack solely for data quality is unnecessary in most cases because there’s no hidden complexity behind the core problem.

Fresh off the press

A couple of weeks ago, the unbundle vs. rebundle debate took Data Twitter by storm.

First, Gorkem Yurtseven published The Unbundling of Airflow on Features & Labels’ blog, arguing that Airflow (and workflow engines in general) is being unbundled into separate tools each focusing on one part of the data stack, which would eventually make the orchestration engine redundant.

Then, Nick Schrock published Rebundling the Data Platform on Dagster’s blog, countering with the point that the “next thing” wouldn’t be giving up on workflow engines, but instead evolving them so that they can orchestrate software-defined assets - which Dagster now supports.

With the modern data stack being extremely fragmented in its current state, the debate about how to connect all these tools is just getting started.

Out of the comfort zone: navigating the endless stream of data content

A few weeks ago I came across a tweet about how it’s extremely hard to keep up with all the topics/discussions in the data community - and I immediately related to it:

data 🅱️oi @tayloramurphy

genuinely distressed about the volume of data content I don't have time to consume (let alone write) and the number of great conversations happening in the community. I need to spend time with my family people! stop being so great!!!

This is an extremely beneficial and healthy environment for the field itself since it’s boosting innovation and opening new possibilities every single day - but it can also get very overwhelming for us humans of data.

Personally, I try to stick to two rules when it comes to navigating data content:

Streamline your sources: Multiple people in the data space try to post content on a daily basis - while this is done with good intentions, I genuinely believe that as humans we simply can’t deliver thoughtful insights every single day. If you’re following many data “influencers” on LinkedIn for example, you’ll find yourself frequently consuming short pieces of content throughout the day that are written mostly because the writer wants to share content, not because there’s a genuine thought or reflection. Consuming these “bites” of content drains your precious energy and points you in dozens of directions in one single scrolling session. Instead of merely following people who post “daily”, I recommend seeing what the community is discussing on Twitter once or twice per day (personally I usually try to avoid threads), where the 280 character limit enforces everyone to go straight to the point - and then you can make your own deep dives into topics that interest you.
Streamline your thoughts: This second rule is a direct result of the first one and the fact that data engineering is a vast and ever-evolving field. By limiting your sources and consuming long-form content only when the topic genuinely interests you, you’ll be able to build knowledge in areas that matter to you. Instead of knowing a bit of everything, the aim should be to know a bit of everything and a lot about a few things. This would allow you to formulate your own thoughts and opinions instead of only consuming content, and will help you get a better understanding of the bigger picture. If you’re interested in the concept of a “metrics layer” for example, you can spend time going through articles and white papers by major tech companies who built their own internal metrics platforms, to learn the “why”s, “how”s, and the lessons they learned along the way.

My point is that trying to stay up to date with everything happening within data engineering / the data stack is a futile effort that won’t give you long-term knowledge, whereas focusing on a few areas that interest you and building meaningful and deep knowledge in them is a path towards widening your expertise and potential.

But again, this is the approach that works for me - and it won’t necessarily work for everyone.

A sound to code to

During the last few weeks I’ve been the most productive when playing Polo & Pan’s latest album, Cyclorama. With Attrape-rêve as a definite standout.

If you enjoyed this issue of Data Espresso, feel free to recommend the newsletter to people in your entourage.

Your feedback is also very welcome, and I’d be happy to discuss one of this issue’s topics in detail and hear your thoughts on it.

Stay safe and caffeinated ☕

Data Espresso