Espresso #2: Open table formats, real-time at scale, and what happens behind closed doors

Make yourself an espresso and join me for a short break on a Wednesday afternoon ☕

Feb 02, 2022

Hello fellow data enthusiasts,

If this is your first time reading Data Espresso, I recommend going through the first two posts (post #0 and post #1) to get familiar with the newsletter’s concept and the motivation behind it.

As mentioned in the previous edition, the newsletter is still not in its finalized format and so certain sections can be changed in the future - but enough with the introductions, let’s talk data engineering while the espresso is still hot.

Why open table formats are a game-changer

One word that characterized the second wave of data systems (the Hadoop ecosystem and co.) is compromise - to have horizontal scalability in storage and compute we had to let go of certain features that were a given in existing systems, like ACID transactions and schema enforcement. And although most of the compromises were necessary (as explained by the CAP theorem for example), some of them weren’t.

The de-facto metadata store of the Hadoop ecosystem, the Hive metastore, is where most of the compromises happened. To manage distributed and partitioned file-based tables, the Hive metastore was designed in a way completely disconnected from the data itself - it tracks directories (for both tables and partitions) instead of data files, and so most of the features that are offered by a typical RDBMS are effectively unattainable since the metastore doesn’t directly manage the data files.

This design choice in the Hive table format meant that Hive tables were problematic to manage (since the schema isn’t verified for every data file), hard to trust (since there’s no table transaction log), and also inefficient when they reach a certain scale (since we can’t optimize the queries below the partition level and each query necessitates listing the data files).

Due to such issues, many companies struggled with file-based tables. Additionally, advancements in distributed query engines and the metadata management space were slowed down by the inefficiencies of the Hive metastore - pushing companies to opt for a scalable data warehouse for their analytics architecture instead of relying on a file-based system.

But that all changed in 2019. In one year, Databricks open-sourced its Delta project under the name Delta Lake, and both Uber and Netflix submitted their own table formats, Hudi and Iceberg respectively, to the Apache Software Foundation.

Each one of these new table formats, in its own way, solved most of the issues related to the Hive metastore with a set of features and key strengths that differentiate it from the other formats - but most importantly, their introduction meant that data lakes are no longer inferior to data warehouses when it comes to table management and query optimization. The data lake is dead, long live the lakehouse!

To familiarize yourself with open table formats, and determine which one suits your needs the most, the following blogs would be a great place to start:

Iceberg:
- Metadata Indexing in Iceberg by Ryan Blue (Iceberg creator)
- A Short Introduction to Apache Iceberg by Christine Mathiesen (Expedia Group)
- Apache Iceberg – An Architectural Look Under the Covers by Jason Hughes
Delta Lake:
- Massive Data Processing in Adobe Experience Platform Using DeltaLake by the Adobe Experience Platform team
- Engagement Activity Delta Lake by the Salesforce Engineering team
- Diving Into Delta Lake: Unpacking The Transaction Log on the Databricks blog
Hudi:
- Apache Hudi - The Data Lake Platform on the Apache Hudi blog
- Data Lakehouse: Building the Next Generation of Data Lakes using Apache Hudi by Ryan D'Souza & Brandon Stanley

Fresh off the press

The Four Innovation Phases of Netflix’s Trillions Scale Real-time Data Infrastructure

Zhenzhong Xu, who led the Stream Processing Platform team at Netflix, dives deep into the different phases that the Netflix real-time data infrastructure went through, with their respective challenges and learnings.

Data to engineers ratio: A deep dive into 50 top European tech companies

A very interesting analysis by Mikkel Dengsøe, the head of data science at Monzo, in which he compares the data engineers ratio at 50 tech companies from different sectors.

Out of the comfort zone

From the outside, it’s easy to romanticize Silicon Valley and the idea of launching startups that turn into tech giants making the world a better place - but we all know that things don’t happen that way.

If you watched The Social Network then you probably already know that “you don't get to 500 million friends without making a few enemies”, and the power struggle that we witness in the movie isn’t an exception or something specific to Facebook (um, Meta), but it’s a story that’s rather familiar in Silicon Valley. Some of these stories that happen behind closed doors were documented by award-winning journalists via page-turner books. Below are my recommendations:

Super Pumped: The Battle for Uber by Mike Isaac: this one is my personal favorite because it not only compellingly tells the story behind Uber, but also delves into the dark side of Silicon Valley unicorns and the “hustlin’” that happens behind the scenes.
No Filter: The Inside Story of Instagram by Sarah Frier: even though Kevin Systrom and Mike Krieger (Instagram’s founders) agreed to sell the company to Facebook early on, life didn’t get any easier for them and their small team. Sarah Frier masterfully tells the story of a company within a company, and how the decisions of a small group of engineers can change the world we live in.
An Ugly Truth: Inside Facebook's Battle for Domination by Sheera Frenkel and Cecilia Kang: Yes, it’s Facebook/Meta again, and this time you get to discover the dynamics between Mark Zuckerberg and Sheryl Sandberg, how the Cambridge Analytica scandal unfolded, and what led to the company’s fall from grace.
Bad Blood: Secrets and Lies in a Silicon Valley Startup by John Carreyrou: Last but not least is the story of the unicorn that never was. Theranos was probably one of the biggest scams of the 21st century, and the story of Elizabeth Holmes’ startup is filled with lessons about what can go wrong when aiming to change the world - but most importantly, it shows that most people aren’t as smart as they think they are.

A sound to code to

Bonobo’s new album, Fragments, is all you need for a productive coding session.

If you enjoyed this issue of Data Espresso, feel free to recommend the newsletter to people in your entourage.

Your feedback is also very welcome, and I’d be happy to discuss one of this issue’s topics in detail and hear your thoughts on it.

Stay safe and caffeinated ☕

Data Espresso