August 22, 2024

Announcing: AI Vector Embeddings Generator for the Lakehouse

Written by:

Chandra Krishnan

and

Ryan Garrett

and

Chandra Krishnan and Ryan Garrett

Announcing: AI Vector Embeddings Generator for the Lakehouse

There are technology shifts that occur slowly, over years. The rise of the data warehouse and the shift to the cloud are good examples. Then there are those shifts that appear to take off overnight. Generative AI is a sterling example of such a shift, where organizations across industries are rapidly adapting their data architecture to a data lakehouse to support objects such as vector embeddings.

Today we are excited to announce a new AI vector embeddings generator, which automates the creation and management of vector embeddings for the data lakehouse. The AI vector embeddings generator makes AI projects more scalable - quickly and easily - for data practitioners.

What are vector embeddings?

Vector embeddings allow unstructured data to be represented as a mathematical construct that captures the content and relationships to other data in the relevant domain. These embeddings are used in a variety of AI-related applications, including semantic search, recommendation engines, anomaly detection, and more.

The data lakehouse-based solution from Onehouse

For step-by-step instructions on how to create your own vector embeddings with Onehouse, see our previous blog post. ‍

With Onehouse, it has never been easier to create and manage the foundation of your data lakehouse. With continuous data ingestion and multi-stage ETL pipelines, Onehouse offers a fully-managed experience to help organizations rapidly adopt and efficiently use the lakehouse architecture.

With the new AI vector embeddings generator, our customers can now generate embeddings directly as they ingest new data or transform existing data on Onehouse Cloud. Users simply need to select the column(s) that they want to encode as embeddings and the embedding model that they want to use. Onehouse supports a variety of models from OpenAI, Voyager AI, and others, with an extensible framework to support additional models.

Onehouse transfers unstructured data to the models, and the models return new vector embeddings. The embeddings land in Onehouse data lakehouse tables, enabling Onehouse customers to take advantage of the scale and cost efficiencies of the data lake.

Capabilities such as the extensible indexing system of Apache Hudi™ enable customers to query their vectors from the data lake. After landing the vector embeddings in their data lake, Onehouse customers have freedom of choice to serve these embeddings to their chosen downstream system(s). The workflows can look similar to what Notion built to serve embeddings to a vector database such as Pinecone, or to what NielsenIQ built, using Databricks to create vector search directly against embeddings from Hudi tables.

Onehouse provides powerful incremental processing capabilities. This unique advantage enables continuous updates to vectors as data changes, ensuring data freshness for your AI models. Customers do not need a separate “freshness layer."

*Onehouse as the source of truth for vector embeddings. For more details on this workflow, see the blog post “Managing AI Vector Embeddings with Onehouse*.”

“Data processing and storage are foundational for AI projects,” said Prashant Wason, Staff Software Engineer at Uber and Apache Hudi Project Management Committee (PMC) member. “Hudi, and lakehouses more broadly, should be a key part of this journey as companies build AI applications on their large datasets. The scale, openness, and extensible indexing that Hudi offers make this approach of bridging the lakehouse and operational vector databases a prime opportunity for value creation in the coming years.”

Bridging vector databases and the data lakehouse

Vector databases are ideal for delivering real-time responses for generative AI, and the data lakehouse complements vector databases with unique efficiencies around scale and cost for creating and managing vector embeddings, providing a single source of truth for AI initiatives. As NielsenIQ noted, a properly tuned data lakehouse can deliver query latency of seconds for text-based search applications, a common use case for generative AI.

Vector databases provide significant value for use cases that require low-latency serving of the embeddings. At Onehouse, we see the combination of the data lakehouse for batch serving and the vector database for low-latency becoming the new, de facto architecture for building AI applications such as large language models (LLMs) going forward, given the benefits that each of the components provides. The open and interoperable vision and capabilities of Onehouse mean that Onehouse customers can easily “reverse ETL” their embeddings from their data lake to their operational vector database, and vice versa.

With native vector embeddings generated and managed in Onehouse, customers can streamline their vector embedding pipelines to land embeddings directly on the lakehouse. This provides all of the lakehouse’s unique capabilities around update management, late-arriving data, concurrency control and more, at the data volumes needed to power large-scale AI applications.

*The Onehouse Cloud UI enables customers to generate vector embeddings during data ingestion and processing.*

Use cases for vector embeddings in the data lakehouse

The Onehouse vector embeddings solution is perfect for a wide number of AI use cases, including:

Retrieval Augmented Generation (RAG)
Generative AI
Intelligent search
Content generation
Natural language processing (NLP)

If you would like to learn more about how you can leverage Onehouse to generate and manage vector embeddings for AI, we are hosting a webinar with NielsenIQ and Onehouse next week, “Vector Embeddings in the Lakehouse: Bridging AI and Data Lake Technologies.” We invite you to sign up today - or, after the webinar happens, to watch the session on demand.

Authors

Chandra Krishnan

Solutions Engineer

Chandra is a Solutions Engineer at Onehouse, building large scale data products. Prior to this, he worked as a Data and ML Engineer at Amazon Web Services, focusing on data platforms and AI enabled applications. He holds degrees in Computer Science and Business from the University of Michigan.

Ryan Garrett

Director, Product Marketing

Director of product marketing. Experience includes Upsolver, Thoughtspot, Teradata, Treasure Data, Aster Data. Education: University of Kentucky; Boston University. Onehouse author and speaker.

Announcing: AI Vector Embeddings Generator for the Lakehouse

What are vector embeddings?

The data lakehouse-based solution from Onehouse

Bridging vector databases and the data lakehouse

Use cases for vector embeddings in the data lakehouse

Read More:

Announcing Open Engines™: Flipping defaults to “open” for both data and compute

ClickHouse vs StarRocks vs Presto vs Trino vs Apache Spark™ — Comparing Analytics Engines

Ray vs Dask vs Apache Spark™ — Comparing Data Science & Machine Learning Engines

Apache Flink™ vs Apache Kafka™ Streams vs Apache Spark™ Structured Streaming — Comparing Stream Processing Engines

Data Deduplication Strategies in an Open Lakehouse Architecture

The Open Table Format War: Merely a Battle on the Path to Engineering a Truly Open Data Platform

Announcing: AI Vector Embeddings Generator for the Lakehouse

What are vector embeddings?

The data lakehouse-based solution from Onehouse

Bridging vector databases and the data lakehouse

Use cases for vector embeddings in the data lakehouse

Read More:

Announcing Open Engines™: Flipping defaults to “open” for both data and compute

ClickHouse vs StarRocks vs Presto vs Trino vs Apache Spark™ — Comparing Analytics Engines

Ray vs Dask vs Apache Spark™ — Comparing Data Science & Machine Learning Engines

Apache Flink™ vs Apache Kafka™ Streams vs Apache Spark™ Structured Streaming — Comparing Stream Processing Engines

Data Deduplication Strategies in an Open Lakehouse Architecture

The Open Table Format War: Merely a Battle on the Path to Engineering a Truly Open Data Platform

Subscribe to the Blog