The Next Era of Data at Instacart

Published in

tech-at-instacart

8 min readAug 16, 2023

Data is an integral part of how we do business at Instacart: for informing decisions, providing insights into how our users interact with the product, supporting ML/AI use, and much more. Over the years, we’ve accumulated a tremendous wealth of data to support all of these activities, but for a long time, deriving value from our data in most of these domains has been increasingly challenging, as the complexity of our business, product, and engineering systems has exploded.

This complexity ultimately produces a tax on using data effectively. Instead, we want to be in a place where everyone at Instacart has simple access to timely, correct, and reliable data — and is fully empowered to self-serve for all of their data needs.

We’ve managed to find many creative ways to circumvent these challenges and use data productively, despite the scale and complexity. But, this looked increasingly like the Monty Python sketch — we were building castles in the data swamp, reaching to get value from our data and instead, standing on the parapets watching the horizon lurch.

This year, recognizing the importance of addressing these challenges given the criticality of robust data for Instacart’s long-term success, we launched an engineering initiative called “Ground Truth” (or GT). As the name implies, Ground Truth has been all about getting our foundations right so that we can unlock all the ways we’d like to use our data — without getting bogged down by scale and complexity. Throughout the first half of 2023, the Ground Truth team has been hard at work on establishing these foundations with a sweeping set of projects spanning our most critical data tools and systems.

As we lower the curtain on Ground Truth after a successful half, we’re now at the next inflection point in our data journey. We’ve replaced swamp with bedrock, and now it’s time to get down to the real business of castle-building.

Here, we review the key accomplishments of the Ground Truth team, and lay out our strategy for the next era of data at Instacart, with an eye toward how recent developments in AI / LLMs provide us with a unique opportunity to capitalize on our GT investments to provide a much richer self-serve experience with data.

Ground Truth: Instacart’s Modern Data Stack

Over the course of H1 2023, we had a few dozen engineers come alongside the Data Infrastructure team — collectively, the Ground Truth program — working together to make a step-change improvement in Instacart’s data platforms. We’ve overhauled the end-to-end lifecycle of data at Instacart: from instrumentation to extraction to transformation and consumption. Whereas prior to GT, Instacart had a fragmented mess of data tools with different systems in use by each engineering pillar, our post-GT data stack looks like this:

With these tools in place, data producers can now:

Ingest all production database tables to the data warehouse using change data capture (CDC). We now continuously ingest data across thousands of tables from our production systems to Snowflake with this system built using Debezium and Kafka.
Instrument and validate the correctness of events data using the Snowgoose framework to better understand user flows / interactions throughout our product. We previously had several different event frameworks in use, including Snowplow, Segment, and an internal framework called Mongoose. We’ve consolidated these into a single events system, Snowgoose, which now handles billions of events totaling hundreds of terabytes per day across thousands of event types.
Build and maintain data pipelines using dbt + Airflow. We previously had a huge number of different scheduling and orchestration systems, including at least a dozen Airflow instances. We’ve now consolidated all of these on a single dbt + Airflow deployment for all of Instacart’s data pipelines. On dbt alone, we now have thousands of data models built by hundreds of authors across dozens of engineering, DS, and other data teams, comprised of both net-new data models and those migrated from legacy systems that have now been deprecated.
Secure access to data and implement fine-grained data access policies with Immuta. By separating policies from individual platforms and leveraging Immuta’s attribute-based access controls, we can dynamically enforce policies that are easy to scale and require far less manual management than static, role-based controls.
Implement data validations and quality checks with Declarative Data Checks (DDC). With DDC, we’ve built an internal company-wide platform for data quality; DDC now has hundreds of critical data validation checks in place, built by teams from every corner of Instacart.
Build sophisticated, low-cost dashboards in Mode Analytics. This year, we launched Mode at Instacart to replace our previous BI tools. In just 6 months, we’ve launched more than 8,000 dashboards used by almost everyone at Instacart. Adopting Mode also reduced our failed dashboard runs by 7x and reduced Snowflake costs by 3x vs. our previous BI tool.
Use the new Annotations framework to document and automate data governance controls on PI and sensitive data. As part of GT, we annotated tens of thousands of columns across dozens of databases. These annotations are used extensively by our data tools to enforce governance everywhere across Instacart.
Support data discovery using our data catalog, Amundsen; certify data as bronze/silver/gold to assist data consumers in dataset selection.

In aggregate this work has resulted in a big leap forward for our ability to work with our data productively. Moreover, we now have robust, end-to-end governance for our data, and the ability to create a chain of trust from product databases all the way through to critical reports.

GT has also netted us a number of improvements to day-to-day data workflows: for data producers, to more easily instrument behavioral events in the product, and to write, test, and maintain transformations of data, and for data consumers, to more readily understand the “nutrition facts” of data and make informed choices about using it. Most importantly, as part of GT, these tools have been tightly integrated to provide a seamless experience across the full lifecycle of data production and consumption.

What’s Next: Time to Build

With the GT chapter now concluded, we’re at an inflection point in Instacart’s data journey. Towards realizing a future where data is easy and self-serve for everyone at Instacart, we’ve now established the foundations with great tools in place to facilitate production of high-quality, ergonomic datasets. But, in many areas, we still need to use those tools to build: creating or rebuilding missing datasets, adding data quality checks, and annotating / documenting data to support compliance and discoverability.

So what’s next?

In addressing these data gaps, we have a unique opportunity to reconsider how we approach resourcing data work. Over the past decade, most tech companies have built large central data engineering teams, matrixed into the product engineering organization alongside Data Science to handle the business of building, supporting, and maintaining data pipelines. This made a lot of sense in the past, when data tools were extremely challenging to use and required specialized, arcane expertise to operate successfully. However, this approach has serious scaling challenges (companies of our size typically have huge teams of 100+ data engineers). It also introduces an organizational “toss it over the fence” boundary for data work, decoupling domain expertise from data expertise.

Today, there are two trends that make this the right time to revisit the approach of large centralized DE teams. First, the “modern data stack” has brought the advent of much better systems and tools, which require substantially less specialized expertise to operate. Second, LLMs are creating new ways of interacting with data that have never existed before, lowering the barrier to entry — especially for non-technical consumers.

For Instacart especially, one of the outcomes of GT that we’re really excited about is the creation of data expertise in a broad group of engineers from everywhere across Instacart. While previously, knowledge of our data tools and systems was localized to the data team and a few others, as the GT program concluded we now have many folks returning to their home teams, newly equipped with the skills needed to act as local ambassadors for data work.

The Modern Data Stack: Streamlined Tools For Data Producers & Consumers

When it comes to the tech stack for data systems and tools, we’ve come a very long way from the Hadoop days of highly complex distributed systems requiring huge teams to operate. Building on the substrate of modern tools like Snowflake, DBT, Amundsen, etc. makes data engineering work easier than it’s ever been, and accessible to any engineer who can write a bit of SQL.

As we increase adoption of this tech stack across Instacart in concert with the decentralization of DE responsibilities, we expect to unlock productivity improvements for data consumers like Data Science and analyst teams as well — particularly around the experience of data discovery. For our datasets, we now aggregate and expose documentation, lineage, ownership, freshness, dashboard usage, common queries, data check pass/fail status, and much more.

Future Innovation with AI: LLMs for Data

While we are still early in the journey, text-to-SQL generative AI models and similar forays into AI + data are being worked on by many companies, and we are quite confident that LLMs will dramatically simplify day-to-day data workflows for data producers and consumers. Preparing for this future was not explicitly a goal of GT, but is readily supported by the work; the rich metadata that we’ve created around our data in GT directly unlocks working with it using LLMs.

Conclusion

We believe this is the right time to make this change for Instacart, and announced the transition to a decentralized model at the beginning of H2 2023 — our expectation is that all product engineering teams will be self-serve with DE work from H2 onwards. In the coming months, we will be investing in the training and support needed to execute this change, and will continue to improve our data tools and systems to support engineering teams using them.

We’re very excited about the road ahead and the combined energy created by our Ground Truth systems and tools, the decentralization of data work, and the emerging opportunity of AI for data. The next decade is going to be a golden age for data, and we’re eager to take a leadership position in the industry to show the way — ultimately realizing a future where everyone at Instacart has easy, self-serve access to data.

Acknowledgements

This work was a joint effort across many teams, but I’d especially like to thank the GT leads Alex Charlton, Osman Khwaja, and Ritu Kothari for their leadership on this program. This work would not have been possible without the extensive technical and leadership contributions of the GT project leads Kieran Taylor, Simon Jenkins, Sean Cashin, Doug Hyde, Yiwen Luo, Nick Dujay, Sen Sivakumar, Ayush Kaul, Praveen Burgu, Anant Agarwal, and Sebastian Soto, along with the entire GT working team. Finally, I’d also like to thank the entire Data Infrastructure team for their support.