This blog post is a summary of a talk by Girish Baliga, Director of Engineering at Uber, at Open Source Data Summit 2023. Visit the event page to learn more or watch the talk here.
Uber is a global brand that operates in more than 10,000 cities worldwide. The business has a large operational footprint, servicing over 137 million monthly users and 25 million trips every day. And data drives — no pun intended — every action taken by passengers, drivers, and those who run the business. At this scale of data, the technical challenge of turning the raw data from all this activity into business insights is particularly difficult — especially doing so in a performant and reliable way.
Uber is also central to the Onehouse origin story. The Hudi project, which was then called a “transactional data lake,” was started at Uber by Onehouse Founder and CEO Vinoth Chandar. The project was then taken on by the Apache Software Foundation in 2016, and Vinoth went on to found and run Onehouse.
In his talk, Girish Baliga, Director of Engineering at Uber, shared how the company has built and evolved its data infrastructure to fulfill Uber's mission to help people go anywhere and get anything. For context, the Data Infrastructure team at Uber supports a broad base of internal customers with different data needs:
All these internal customers have the same basic goal: they want to get business insights from enormous troves of company data. But they don't have the same expectations in terms of data freshness, scale, or software integrations. Some customers need insights in real-time or near real-time, with data that's updated frequently (e.g. data freshness of less than one minute). Others can accommodate a longer wait, up to a day, for instance when running scheduled Uber Eats reports for restaurant owners.
Uber's data infrastructure team receives four main types of analytics requests. Each one represents a distinct use case with different implications from a technical standpoint. To illustrate, Girish shared the following graphic.
Streaming Analytics
This category demands extremely fresh data, typically requiring updates in less than a minute. A prime example at Uber is addressing surge pricing imbalances, where immediate adjustments to pricing algorithms are necessary. These applications are often integrated with real-time systems, such as Kafka topics, to facilitate the rapid processing and circulation of data.
Real-time Analytics
UberEats offers a feature that displays popular orders in a user's vicinity. This feature is a key example of real-time analytics, where data is refreshed rapidly but not as frequently as streaming analytics requires. However, this category of applications is more traffic-intensive, with queries sometimes reaching 2000 per second. These applications typically interact with a backend through an RPC (Remote Procedure Call) interface that queries an analytics engine.
Interactive Analytics
A typical use case here involves generating daily summary reports for restaurant managers, detailing their UberEats orders and revenue. For interactive analytics, the data freshness is about a day, and the scale is on the order of multiple reports per restaurant owner. These queries are executed by an orchestrator or query runner that handles the automation.
Batch Analytics
Batch analytics is used to examine historical data, such as order trends over the past year. Interactive tools such as query builders enable users to explore and analyze data with ease. These applications run automated queries on a predefined schedule.
In this architecture, an incoming data stream serves both real-time and batch cases. For the real-time case, a streaming analytics engine transports data from the data stream into a real-time data store. The data is then exposed to end-users through a query interface. For the batch case, the same data stream is ingested, but it goes into a data lake, where custom analytics and transforms are executed on it. Then, the engine creates data models from this data pipeline. The data is then exposed to users for reporting and further analysis.
To accommodate the range of internal customer needs, Uber built a data platform that follows the Lambda architecture, a widely adopted architecture in the data analytics space. This approach handles both low-latency streaming workloads as well as batch workloads. Consequently, Uber's data infrastructure platform can manage all four primary analytics use cases — streaming, real-time, batch, and interactive analytics — with a single design.
In this architecture, an incoming data stream serves both real-time and batch cases. For the real-time case, a streaming analytics engine transports data from the data stream into a real-time data store. The data is then exposed to end-users through a query interface. For the batch case, the same data stream is ingested, but it goes into a data lake, where custom analytics and transforms are executed on it. Then, the engine creates data models from this data pipeline. The data is then exposed to users for reporting and further analysis.
Uber used open-source technology as the foundation of their Lambda architecture. Apache Hudi is a prominent pillar of this setup. Developed to address the challenge of managing data at scale, Hudi can reduce upsert times to 10 mins and shrink end-to-end data freshness from 24 hours to just one hour. Before Hudi existed, the company was limited by how quickly they could re-ingest data, which was often slow. Hudi improved efficiency by allowing the team to incrementally process new data with low latency.
For batch workloads, Uber runs ingestion jobs on Spark. Parquet is used for file management and Hadoop as a storage layer. Hive jobs ingest data from the data lake and build the data model using a very similar stack.
On the streaming front, Uber uses Apache Kafka for the data stream and Flink for analytics. Real-time data is served on Pinot. And on top of Pinot, the team built a custom Presto query interface that allows users to write Presto SQL and run their queries on Pinot in real time as if it were a conventional production backend system.
The Lambda architecture describes how data is transported through different analytics engines. But once the appropriate data is available, how do internal customers query the data to gain valuable business insights?
The Data Infrastructure team supports three query languages to meet customer needs — from a high-level, common SQL approach to more customizable, low-level support for advanced users:
Girish highlighted several technical customizations implemented by the Uber Data Infrastructure team to boost reliability, enhance performance, and facilitate multi-cloud deployments.
Examining a subset of the data infrastructure stack — specifically, the HDFS clusters (data storage) and Presto clusters (data reads and data writes) — Girish highlighted a few innovations that help Uber achieve reliability and scale:
Uber's Data Infrastructure team designed and executed the following optimizations to enhance performance:
Uber operates in a hybrid data environment. Traditionally, the team used on-premises deployments of their stack. But they are currently building their cloud data on Google Cloud, using HiveSync to copy data from HDFS to Google Cloud Object Storage.
Two key improvements stand out here:
Collectively, these innovations highlight Uber's commitment to leveraging open-source technology to provide a robust data platform that is scalable and reliable across its global data infrastructure.
To watch the talk directly, you can access the recording here.
Be the first to read new posts