Our first Espresso: Open-source, Modern Data Stack, and Range
Make yourself an espresso and join me for a short break on a Wednesday afternoon ☕
Hello fellow data enthusiasts,
First of all, I’m glad that you’re joining me on this ride - hopefully this will be a mutual learning experience. The newsletter is still going through the first cycles of its life, so probably this is not the finalized format and certain sections can be changed. Throughout the first few issues, you’ll basically see the newsletter go through adolescence.
As stated in my initial post, my aim for this newsletter is that it will be more of an exchange of ideas, so you’re definitely encouraged to reply to the emails with your thoughts and feedback.
Let’s talk data engineering
Can open-source survive in the modern data stack?
If we look back to the previous wave of data technologies, the wave of the Hadoop ecosystem, NoSQL databases, and the term “Big Data”, we’ll notice that open-source was the standard. The data technologies realm was one of the first battles that open-source won, and this in turn was one of the most important factors that led to the rapid pace of innovation during the Hadoop era. The whole ecosystem thrived on the open-source model, with multiple teams of engineers from the biggest tech giants contributing to the same projects under the umbrella of open-source foundations.
In contrast, if we look at the ecosystem of the modern data stack, we’ll notice that things are not so similar. We’ve somehow taken a step back in a way and decided to rethink open-source-based business models. This is the result of multiple factors, one of which is cloud providers offering managed versions of open-source projects that effectively render the companies built on top of these projects redundant.
The consequence of this is either closed-source business models (which are back in style) or stricter licenses similar to what Elastic did in 2021 - which would offer open-source-based companies a fighting chance against cloud providers.
With the tweaked licensing and the continued support of the tech communities, open-source is still a viable option for data companies. The best proof of that is Airbyte, which disrupted the data integration space with its open-source approach and just recently announced a $150 Million series B funding round, at a $1.5 billion valuation. (after also opting for a stricter license)
So, yes, open-source will survive (and thrive) in this era, but we’ll potentially witness a slower rate of innovation due to the fragmented ecosystem and the closed-source model that multiple modern data stack companies opted for.
Fresh off the press
Life with dbt
When considering adding a new tool to your stack, nothing is more helpful than the feedback of other engineering teams that are already using it. Well, if you’re considering using dbt (and you should be), the data engineering team at Devoted published a very insightful piece on their first year using it and how powerful it can be when integrated with the different components of your stack. (the post is written by Adam Boscarino and Jason Brownstein.)
A look back at 2021
2021 was an eventful year when it comes to the Modern Data Stack, and in an excellent article published on Towards Data Science, Salma Bakouk (co-founder & CEO of Sifflet) goes through the trends that shaped it.
… And a look ahead to 2022
In an article also on Towards Data Science, Prukalpa Sankar (co-founder of Atlan) discusses six ideas that will continue to shape the Modern Data Stack in 2022.
Out of the comfort zone
One of the best books I’ve ever read is Range: Why Generalists Triumph in a Specialized World by David Epstein, for a very simple reason - it proved something that I always believed in: you should always start gathering knowledge, expertise, and skills horizontally before specializing in a particular subfield.
If you just graduated from university and you want to start a career in a technology field (not necessarily data engineering), it might be tempting to jump headfirst into learning the most in-demand technology/tool of that field. While technically that might indeed help you land your first job, your long-term strategy should be built around widening your knowledge and expertise horizontally as much as possible, before making vertical deep dives into a specific technology/trend.
The reasoning behind this is that by first familiarizing yourself with concepts and paradigms, you’ll be able to:
Learn and master the technologies themselves faster
Recognize patterns more easily and know when and how to leverage the right technology for a specific use-case
Have the flexibility to switch between technologies without a significant learning curve
For example, if you want to get into data engineering, instead of starting with a deep dive into writing Apache Spark jobs, focus on learning concepts like distributed computing, massively parallel processing (MPP), MapReduce, ETL/ELT, and data modeling. This will first help you learn Spark in a shortened amount of time (because you’ll be comfortable with the concepts behind it), better understand how Spark implements specific concepts, and more importantly, recognize when Spark isn’t the right tool for a given problem. In the long term, this will give you one (very important) additional possibility throughout your career:
Yes, you’ll be able to pivot from one technology/field to another one without much trouble - and in a field that’s moving extremely fast like data engineering, this is a very valuable asset.
A great book that can help you understand a wide range of data-related concepts is Designing Data-Intensive Applications by Martin Kleppmann.
A sound to code to
Asking a developer whether they like to listen to music while coding is like asking someone whether they’re a cat person or a dog person: it’s something that feels like a personality tell, but you’re never quite sure what it actually means.
Personally, if I’m not working on a complex topic, I find myself more productive when listening to certain types of music - and the aim of this section is to share with you some of the records that I find the most suitable for coding sessions.
Since this is the first issue of the newsletter, my recommendation is actually the artist I listen to the most when working: Jamie xx (yes, the xx’s discreet DJ). I won’t burden you with unnecessary details - I’d just recommend that you give his 2015 album, In Colour, a listen during your next coding session. It’ll either turn into a dance session or you’ll finish what you’re working on in half the expected time: it’s a win-win.
If you enjoyed this issue of Data Espresso, feel free to recommend the newsletter to people in your entourage.
Your feedback is also very welcome, and I’d be happy to discuss one of this issue’s topics in detail and hear your thoughts on it.
Stay safe and caffeinated ☕