Espresso #10: A new ice(berg) age, revisiting old designs, and thriving on constraints
The inevitable Apache Iceberg era, the often-overlooked benefits of frequently revisiting design decisions, and the upsides of working in constraint-heavy environments.
Hello data friends,
This month, we’re diving into the inevitable Apache Iceberg era, the often-overlooked benefits of frequently revisiting design decisions, and the upsides of working in constraint-heavy environments. So, without further ado, let’s talk data engineering while the espresso (or cappuccino if you’re feeling adventurous) is still hot.
How to navigate an Ice(berg) age
Apache Iceberg is all the rage in the data space. After a half-decade showdown between Hudi, Delta Lake, and Iceberg, the latter has emerged as the de facto modern table format, becoming the central piece of the increasingly composable data stack. While there’s been plenty of great content about Iceberg’s ascent, in this edition of Data Espresso, I want to offer my perspective on two questions many data teams are grappling with:
1. “I bet on Hudi or Delta Lake, should I migrate to Iceberg?”
If you built your data platform around Apache Hudi or Delta Lake, you might be wondering how Iceberg’s dominance will impact your data platform in the coming years, and whether a migration is warranted.
Here’s a heuristic I’ve observed: once a data technology gets its own dedicated conference, you know it’s going to dominate the discourse (and feature prominently in data stack diagrams) for at least a few years. This "de facto standard" status translates directly into ecosystem focus. While Hudi and Delta Lake aren’t disappearing overnight, the reality is that new tools and integrations across the data landscape will likely prioritize Iceberg compatibility first.
This shift will result in some friction for Hudi/Delta Lake shops. It’s not necessarily that existing Hudi/Delta integrations will break, but rather that the Iceberg path will likely receive new features, performance enhancements, and smoother integrations more rapidly. You might find yourself waiting longer for support or missing out on optimizations readily available to Iceberg users — and this gap is likely to widen over the next few years. A good historical parallel is the Parquet vs. ORC situation: While ORC remained in use, Parquet consistently had broader and earlier support across the ecosystem, making it the smoother choice for many.
So, does migration make sense for you? I highly recommend a structured evaluation:
Assess the tangible risks of sticking with your current format (e.g., integration limitations, potentially slower adoption of new query engines, developer friction).
Estimate the potential benefits of switching to Iceberg (e.g., access to specific features, performance gains, simplified architecture, broader community support).
Factor in the cost and effort of migration.
Crucially, run a Proof of Concept (POC) with Iceberg on your specific workloads.
It’s also worth noting that Iceberg has rapidly closed the feature gap with Hudi and Delta Lake, thanks to significant contributions from numerous “big tech” companies. A use case where Hudi/Delta might have been the clear winner a couple of years ago might be well-served, or even better served, by Iceberg today.
Migrating now, if the evaluation points that way, means doing it on your terms – defining a timeline that minimizes disruption while positioning your platform to leverage Iceberg’s momentum. However, if your assessment clearly shows minimal risk and limited benefit in switching right now, your engineering resources are likely better invested elsewhere — don’t migrate just for the sake of it.
2. “I’m not currently using a modern table format, should I add Iceberg to my stack?”
This really breaks down into two sub-questions:
Should I add any modern table format to my platform right now? (either Iceberg or Hudi/Delta Lake)
This is the classic "it depends", and the answer needs to be based on the value the tool would bring to your specific use cases. Considering the capabilities of modern table formats (and their increasingly available managed offerings like Iceberg Tables on AWS, GCP BigLake, etc.):How specifically would it benefit your platform and increase the business value derived from your data?
Can it streamline your architecture, perhaps removing redundant layers or simplifying data pipelines?
Does it unlock new use cases by allowing different compute engines (Spark, Trino, etc.) to seamlessly work with the same data?
Does it solve specific pain points you have today (e.g., schema evolution issues, partition management overhead, concurrent write conflicts)?
If I decide to adopt a modern table format, should it be Iceberg?
In most cases today, the answer is likely yes. Iceberg has achieved feature parity (or near enough) for many common use cases, and its widespread adoption means it’s the safest bet for future-proofing and ensuring broad compatibility across the data ecosystem.
Work through these kinds of questions honestly. The goal is to ensure your decision stems from a genuine, explainable need that Iceberg addresses, not just because it’s the technology du jour.
If you decide to adopt Iceberg:
It’s a fantastic technology that truly enables a more composable architecture (think true separation of storage and compute). However, be prepared for a journey. Expect some rough edges, especially with tooling maturity around specific engines or complex migration scenarios. Plan for a multi-phase rollout to de-risk the transition and gradually realize the benefits within your specific environment.
An exercise that always pays off: Revisiting design decisions
Within the data space, the focus is always on the new cool thing — be it a paradigm like data mesh or a technology like Iceberg. This focus, combined with a natural engineering tendency to build cool things with cool new tech (worrying about business value is never fun, after all), makes it easy to overlook significant improvement opportunities (in cost, maintenance, or performance) that aren’t necessarily "cool."
I recently came across a great article by GumGum detailing their switch from Snowpipe to a Data Lake ingestion pattern using Snowflake External Tables over S3 data. The result? A staggering 60% cut in their data ingestion costs. This might sound counterintuitive, since Snowpipe is often positioned as the go-to for easy and efficient Snowflake ingestion. But for GumGum, the combination of their scale, data structure (already nicely partitioned in S3!), and access patterns meant the Snowpipe approach – which involved copying data into raw, internal Snowflake tables – created unnecessary overhead and cost (think expensive table scans for processing and retention on unpartitioned raw data). By switching to querying External Tables directly (and so leveraging the S3 partitioning), they eliminated Snowpipe’s compute costs, avoided data duplication in Snowflake storage, and drastically cut down query times. It’s a prime example of how the ‘best’ approach is highly contextual and definitely not static.
As we chase the cutting edge, the tools and platforms we already use are constantly evolving: New features, pricing models, or even entirely new tools might make previously optimal design patterns suboptimal today. That’s why I believe every data team should schedule regular (every six months for example) review sessions to revisit their past design decisions and identify areas of improvement based on tech progress or changing business needs. Such sessions are a great opportunity to take a hard look at your current architecture:
What are the major cost drivers?
Where are the performance bottlenecks?
What are the biggest maintenance headaches?
Have the capabilities of your core tools (or viable alternatives) changed significantly? Are you leveraging their new features?
Have your business requirements or data volumes shifted?
This deliberate review often uncovers areas where significant gains can be made, sometimes with surprisingly simple changes, driven by technological progress or evolving business needs.
Out of the comfort zone: The joy of working in a constraint-heavy environment
I recently watched a fantastic talk by John Crepezzi from Jane Street about how they built their AI coding assistant. What struck me was the sheer ingenuity required to operate within their unique context: a codebase predominantly in a niche functional language (OCaml) and an environment lacking some tools many engineers take for granted (like having an in-house version control system instead of a Git-based system).
On the surface, this type of environment could frustrate many engineers (myself included), especially if constraints feel arbitrary or lack clear business or technical justification. (As in: "Is this limitation truly necessary, or just historical baggage?")
However, if you get the chance to operate within well-justified, albeit tight, constraints, you might find that it pushes you to elevate your game by having to be creative and solving complex problems. You’re forced to:
Think deeply about the core problem.
Navigate ambiguity with more rigour.
Be incredibly deliberate about architectural choices, minimizing irreversible "one-way door" decisions.
Often, find simpler, more fundamental solutions when fancy off-the-shelf options aren’t available or suitable.
I’ve personally worked in a couple of similar high-constraint environments (my investment banking days come to mind - so fun). While the roadblocks were certainly frustrating in the moment, looking back, those experiences were incredibly valuable learning grounds. They force a level of resourcefulness, experimentation, and rapid iteration that you don’t always encounter in less-constrained settings.
Hope you enjoyed this edition of Data Espresso! If you found it useful, feel free to share it with fellow data folks in your network.
I always appreciate hearing your feedback. What’s your take on the Iceberg wave, revisiting design decisions, or working with constraints? Share your thoughts in the comments or reach out directly – I’d love to discuss!
Until next time, stay safe and caffeinated ☕