This article was originally published on the Confluent Blog and can be found here.
Since its beginnings just five years ago, Apna has become the leading jobs site for tens of millions of workers in India, the largest labor market in the world. Today, Apna has more than 50 million registered users, resulting in more than 5 million interviews and 100,000 jobs activated per month.
To support the platform’s continuous growth and cutting-edge innovation—including the use of artificial intelligence (AI) and machine learning (ML) for job matching—Apna needed to re-architect their site to be faster and more flexible, while maintaining reliability and gaining operational efficiencies. This change took Apna from a monolithic software architecture with batch updates and a data warehouse to microservices, streaming data pipelines, and a universal data lakehouse architecture for data integration.
Migrating from batch updates to real-time data streaming has allowed Apna to outperform competitors, creating an outstanding customer experience, while also reducing costs and delivering a 2x improvement in time to market for new business solutions.
The organization’s new data infrastructure is built on two core technologies:
The fact that both Confluent Cloud and Onehouse’s universal data lakehouse are built on open standards made these technologies a perfect fit for Apna’s technical strategy. “We generally prefer open source vendors to avoid lock-in,” said Subham Todi, lead data engineer at Apna. Using managed services from Confluent and Onehouse enables the company to stay “lean and mean,” allowing technical talent to focus on building and improving features that directly bring value to the business.
Apna began with a monolithic, back-end software architecture. Structured transactional data came in via CDC from PostgreSQL and MongoDB databases. That data was then ingested using Hevo, being integrated via a Google BigQuery data warehouse. Clickstream sources came in through Mixpanel. Updates were batched at intervals ranging anywhere from every hour to once a day.
This architecture came with numerous issues related to cost, efficiency, and maintenance. Examples include:
These challenges caused latency throughout the system, resulting in delayed data delivery, reduced developer productivity, and the inability to scale to meet user growth for Apna’s existing applications. New functionality, such as the use of AI/ML for job matching, was simply out of reach with this legacy architecture.
Apna’s new data platform, shown in Figure 2, is called Apna Analytics Ground (AAG)—the acronym AAG means “fire” in Hindi. AAG uses Confluent’s data streaming platform for software deployment and software operations, the Onehouse-powered universal data lakehouse for data storage and data access, and Apache Spark® to drive delta streams between data tables.
The use of managed services saves time and developer resources, which allows Apna to stay focused on “the core mission of connecting people to opportunities.” Additionally, Confluent Cloud provides 120+ pre-built connectors that save development time and deliver high availability on an elastically scaling platform with a 99.99% uptime SLA.
The data lakehouse at the core of the Onehouse managed service, kept up to date by data streaming from Confluent Cloud, serves as a “single source of truth” for all analytics needs, including AI/ML use cases. Confluent Schema Registry validates messages from Postgres, clickstream and CDC sources.
Transactional data is ingested from Postgres using CDC via an open source connector and a MongoDB Kafka Source Connector. After migrating to this new architecture, Apna wrote a logging service to bring clickstream data from web and mobile applications into a Kafka topic, one per data table. Ingested data is now stored in bronze tables—an immutable, append-only data log that can be used to re-create downstream tables to meet new needs or deal with errors.
The company implements advanced Onehouse functionality such as a schema validation check for updates. Records that fail the check are streamed and saved to a quarantine table for review. With the robust medallion architecture in place, as supported by the universal data lakehouse, Apna can always re-run a problematic job after it’s repaired.
Apna has cultivated a high level of engagement with both Confluent and Onehouse, giving them a seat at the table to mutually share insights on the evolution of these key technologies, while maintaining the focus of its technical staff on innovation relevant to their industry and business needs rather than building, scaling, or maintaining infrastructure.
Adopting Confluent Cloud and Onehouse has helped Apna not only solve problems with the previous infrastructure but also enabled new gains. For example, Apna worked with Confluent to completely revamp their software development, delivery, and operations processes.
Additionally, combining data streaming on Confluent Cloud with Onehouse’s universal data lakehouse delivers shared benefits across the entire infrastructure, including:
Apna now describes Confluent Cloud as the central nervous system of its DevOps and data infrastructure. The Kafka-based data streaming platform is used as an event manager, for communication between microservices, for ephemeral storage, for distributed task scheduling, and (of course) as the workhorse for any other application that requires data in motion.
Confluent provides Apna with many benefits:
Adoption of Confluent Cloud has also paved the way for building AAG on the Onehouse lakehouse by:
This fast and flexible infrastructure allows Apna to simultaneously manage costs, maintain uptime, and work at the cutting edge of innovation. Today, Apna easily uses an AI-powered matching service to optimize results for both job seekers and recruiters—all built on top of the data platform enabled by Confluent and Onehosue.
Through a great deal of hard work, including not only technical effort but strategic planning, partnering, and relationship-building, Apna has established a leadership position both in business and in the creative use of advanced technology. Through the use of managed services they’re able to improve scalability and time to market while containing costs.
Apna’s deployment of data streaming and the universal data lakehouse has enabled the company to establish a “new normal” in data integration, data management, and data analytics. Apna has created a fast, easy to use, easy to manage, efficient, and affordable architecture, on which they can now innovate further.
The company now has their eye on new goals such as an enhanced job recommendation engine that supports multiple overlapping objectives, a community feed powered by deep learning algorithms, and data democratization with enhanced transparency and greater end-user control of data. Though still young, Apna has established a benchmark for the deft use of technology to achieve its business goals and raise its standards for data management.
Interested in bringing the benefits of Confluent Cloud and Onehouse to your business? Start your free trial of Confluent Cloud today. New signups receive $400 to spend during their first 30 days.
You can also schedule a test-drive of Onehouse. You will receive $1,000 in free credit for use in your first 30 days.
Be the first to read new posts