Espresso #4: Data orchestration philosophies, and why Coalesce matters

Make yourself an espresso and join me for a short break on a Friday afternoon ☕

Oct 14, 2022

Hello fellow data enthusiasts,

It has been a while since our last espresso, and the data space has evolved a lot in the meantime - but that gives us yet more topics to talk about.

In this edition, we will talk about the philosophies behind two different data orchestration approaches and why Coalesce (dbt Labs’ yearly conference) matters. So without further ado, let’s talk data engineering while the espresso is still hot.

Airflow vs. Dagster: it’s all about philosophy

Ever since it was open-sourced by Airbnb back in 2015, Apache Airflow established itself as the de-facto standard for orchestration within the data space. Thanks to its feature-rich User Interface (UI), its ability to manage a wide range of operations, and particularly its no-nonsense and intuitive approach to organizing workflows via DAGs and tasks, it quickly eclipsed existing orchestrators like Spotify’s Luigi (which was open-sourced in 2012) and the Hadoop ecosystem’s Oozie.

Airflow is used today by data engineering teams around the world for an ever-expanding list of use cases, supported by custom operators, in-house abstractions, and a myriad of hacks to leverage some of its aging features. On the other hand, the way we interact with data today is very different from how things were seven years ago:

We no longer just want to run Spark jobs. Instead, we think about the quality, state, and lineage of our heterogeneous data assets.
We can no longer tolerate weeks-long development cycles to generate new data assets. Instead, data pipelines should be written, tested, and deployed as efficiently and as fast as possible.
We can no longer rely on a small centralized data engineering team that builds and maintains all the DAGs. Instead, we aim for self-service capabilities and automation that would allow a larger set of contributors to build data assets and push them to production.
Finally, the data stack we built for the above is moving fast and changing old patterns. And so we no longer think about tasks, but we think about dbt models, Airbyte connectors, metrics, and a whole ecosystem of capabilities that are the modern incarnation of 2015’s Airflow operators we once had to write from scratch.

With the above in mind, it’s definitely time to ask the question: Is Airflow still the undisputed go-to orchestrator for data pipelines, or is Dagster, the new orchestrator that’s built with all the previous points in mind, the better option?

In my latest article, published on Restack’s blog, I go into the details of how Dagster differs from Airflow and whether data teams that have already invested a lot in Airflow should make the jump. But most importantly, I explain why “Airflow vs. Dagster” is not a technical question at all - it's just a matter of philosophy.

From binge-reading to binge-watching

I personally believe that the year’s most important Modern Data Stack conference isn’t Databricks’ Data + AI summit or Snowflake’s summit - instead, it’s dbt Labs’ Coalesce. This is not only because dbt is the tool that opened the door to the third wave of data technologies, but also because at Coalesce everyone within the data community belongs.

If you never attended Coalesce before, I totally recommend doing so this year. You’ll learn quite a lot about the Modern Data Stack and how fellow data practitioners are doing more with third-wave data technologies. You’ll see how fun and engaging the dbt Slack is. You’ll experience how welcoming and diverse the data community is. And most importantly, you’ll feel that you belong - because you do.

Throughout the next couple of weeks, let’s binge-watch Coalesce talks (there are way too many interesting ones) instead of binge-reading data content. And hey - it’s our yearly opportunity to tackle the endless debate: is Kimball's dimensional data modeling approach still relevant? (even though we all know that the only right answer is “mostly yes, but it depends”).

A sound to code to

After watching Cyberpunk Edgerunners, I’ve been listening to most of its soundtrack on repeat for the past couple of weeks (it’s really that good) - and I can confirm that it offers a productivity boost similar to Night City’s finest cyberware upgrades. My personal favorite is - unsurprisingly - “I Really Want to Stay At Your House”.

If you enjoyed this issue of Data Espresso, feel free to recommend the newsletter to people in your entourage.

Your feedback is also very welcome, and I’d be happy to discuss one of this issue’s topics in detail and hear your thoughts on it.

Stay safe and caffeinated ☕

Data Espresso

Espresso #4: Data orchestration philosophies, and why Coalesce matters

Make yourself an espresso and join me for a short break on a Friday afternoon ☕

Airflow vs. Dagster: it’s all about philosophy

From binge-reading to binge-watching

A sound to code to

Discussion about this post