The launch of Open AI’s ChatGPT 3.0 in the fall of 2022 led to an exponential rise in the popularity of large language models (LLMs) and generative AI applications. Since then, an increasing emphasis has been placed on leveraging data to serve downstream AI use cases.
Onehouse was developed to implement the vision of the Universal Data Lakehouse, where enterprise data can be centralized and stored in a single source of truth. This provides organizations with a myriad of benefits, and is currently helping Onehouse customers gain a great deal of additional value from their data - not least for use in AI.
To create AI applications, especially those that work with unstructured data such as text or images, you will often need to create and use vector embeddings. Vector embeddings are encodings of data as numerical vectors, and being able to query these vectors via similarity search is a core part of GenAI applications. These vectors are what provide the key context to GenAI applications and allow them to characterize incoming questions and find relevant context with which to answer.
There are several operational databases that are specialized for use in this kind of vector search, such as Pinecone, Milvus, LanceDB, and others. These special-purpose databases are expensive to use, and costs are increased because developers often have a lot of vector embeddings they are storing - but may only need some of them for a given use case.
This causes two problems. The obvious one is that you are paying to store vectors you don’t need for the current use case. The more subtle one is that you may feel impelled to delete some vectors to save money in the short term only to have to re-create them when you need them again later - an even more expensive proposition.
In this blog post, we suggest that you follow an alternative, two-part approach:
This approach allows you to create all the vectors you may need, and keep them for as long as you need them, without undue concern about cost. And then, you only pay vector database costs for the specific vectors you need for a given use case.
Vector embeddings are the method that is used to represent unstructured data (words, sentences, images, videos, etc.) as numbers (and vectors) and capture any relationships between these words. As you may recall from linear algebra courses, a vector is simply a list of numbers that can be plotted in a space of N dimensions, where N is the number of elements in the vector.
Here, each phrase that forms an individual vector will be plotted with relationships attached. An example, plotting the words boy, girl, man, and woman on a two-dimensional space, is shown in Figure 1 below.
As phrases get longer and more complex, with deeper interlocking relationships, vectors are extended with more dimensions to encode additional attributes.
Many widely used machine learning (ML) models receive an input and generate embeddings of this type. Such models can also be used to encode a variety of different data types. Listed below are some common types of embeddings used in the industry:
….. and many more!
In this blog, we will work with generating sentence/phrase embeddings using OpenAI’s text-embedding-3-small model.
In RAG (Retrieval Augmented Generation) applications that are used for GenAI, these vectors are how the base dataset that is searched to generate relevant content is stored. These applications search the vectors and use the relevant vectors to generate responses. This is where the process of vector search - currently powered by vector databases - becomes critical.
Vector databases support efficient use of algorithms that compute similarity between vectors, through measures such as cosine similarity or Euclidean distance. When a query is made, it is also converted into a vector, using the same process that was used to create the stored data. The database then rapidly retrieves the most similar vectors by searching this high-dimensional space, making it incredibly efficient for tasks that require similarity searching.
One prominent example of this architecture is with the productivity company Notion. Notion takes data that they land on their Apache Hudi-powered data lake and uses it for AI-enabled Q&A in their applications. From their Hudi data lake, they create vector embeddings which are loaded into PineconeDB for querying and feeding into RAG applications. The results from Pinecone are then served back to the user via Notion’s application, as shown in Figure 2.
This process is fast, but expensive. And vectors accumulate in the database over time, where they are included in subsequent searches. Unneeded vectors generate extra storage costs, plus extra compute costs for queries where they aren’t relevant.
This leads us to Onehouse’s vision of using the lakehouse as the source of truth for GenAI data.
Since its founding in 2021, Onehouse has pioneered the vision of the universal data lakehouse - a world where data is open and interoperable; write once, query anywhere. With previous technology, businesses have to make many copies of their data to be served to reporting, business intelligence, analytics, and data science use cases. Onehouse aims to provide a single place where enterprises can land their data and have it available for all downstream use cases - including GenAI.
Now we will explore how Onehouse helps to serve as that source of truth for vector embedding use-cases. We have developed a pattern where users perform the following operations, as shown in Figure 3:
In this use case, we will be loading data from the IMDB reviews dataset and generating vector embeddings with it. This dataset contains thousands of movie and television reviews from the movie review site IMDB, and is widely used for natural language processing development and testing. It provides key insights into what opinionated language will look like and allows for effective sentiment analysis training. Selected embeddings from this dataset will then be loaded into Pinecone DB based on each specific use case.
Ingest raw data into a Onehouse raw table. In this case, we stored the training data in an S3 bucket and used Onehouse’s managed S3 ingestion to load the data into raw tables, as shown in Figure 4.
Once the raw data is landed in your lakehouse, you can use Onehouse custom transformers to generate vector embeddings from your chosen embedding model. In this case, we are using OpenAI’s text-embedding-3 model and OpenAI’s REST API to generate the embeddings.
The steps for this process are as follows:
Once the compiled custom transformer with OpenAI integration is loaded into your Onehouse project, you can use this transformer to enrich any data that you have landed raw in your lakehouse tables. In order to generate these vectors, you need to create a stream capture from an existing Onehouse table and add the custom transformation as a part of the stream capture, as shown in Figure 6.
After creating the pipeline in step 3, you should have a table that also contains vector embeddings loaded in it. Now, you can use any engine that can query lakehouse tables to pull this data out of your lake and load it into your vector database.
In this example, we will use pyathena, running on top of a Sagemaker instance, to query our enriched lakehouse table.
from pyathena import connect
import pandas as pd
conn = connect(s3_staging_dir='s3://kw-athena-queries/Athena/',
region_name='us-west-1')
df = pd.read_sql("SELECT * FROM demos.imdb_reviews_w_vectors_ro;", conn)
From there, you can create an index in your PineconeDB instance to upload your vectors to.
from pinecone import Pinecone, ServerlessSpec
pc = Pinecone(api_key="API_KEY")
index_name = "hudi-pinecone-demo"
pc.create_index(
name= index_name,
dimension=1536, # Replace with your model dimensions
metric="cosine", # Replace with your model metric
spec=ServerlessSpec(
cloud="aws",
region="us-east-1"
)
)
index = pc.Index(index_name)
From here, you can create your API requests for loading the data into the VectorDB and upload the data in chunks according to Pinecone’s best practices.
from ast import literal_eval
vectorList = []
for i in range(df['embeddings'].size):
object = {
"id":str(i),
"values": literal_eval(df['embeddings'][i])
}
vectorList.append(object)
import random
import itertools
def divide_chunks(l, n):
# looping till length l
for i in range(0, len(l), n):
yield l[i:i + n]
for ids_vectors_chunk in chunks(vectorList, 20):
index.upsert(vectors=ids_vectors_chunk, async_req=True)
You can now query results from your vector database when given a new input.
similar_vecs = index.query(
vector = query_vector,
top_k=3
)
output:
{'matches': [{'id': '10657', 'score': 0.680035114, 'values': []},
{'id': '11442', 'score': 0.614261627, 'values': []},
{'id': '1171', 'score': 0.595179379, 'values': []}],
'namespace': '',
'usage': {'read_units': 5}}
From the above results, you can see that we have an output of three most similar vectors to the input query vector. These vectors that are returned from the vectorDB can now be fed to your foundational model as a part of your RAG application.
Onehouse provides a robust and scalable solution for enterprises looking to integrate vector embeddings into their data strategies, particularly for generative AI applications. By creating a single, centralized source of truth within a Universal Data Lakehouse, Onehouse enables efficient management and utilization of data across various use cases, from business analytics to AI-powered applications.
The ability to generate, store, and selectively load vector embeddings into specialized vector databases addresses both performance and cost-efficiency challenges, making it an indispensable tool in today’s data-driven landscape. This architecture not only streamlines the workflow from data ingestion to actionable insights, but also optimizes resource allocation by ensuring that only necessary data vectors are actively maintained in costly operational databases.
By leveraging Onehouse, organizations can harness the full potential of their data, driving innovation and maintaining a competitive edge in the rapidly evolving field of artificial intelligence. If you would like to know more, try Onehouse today.
Be the first to read new posts