AI_EMBEDDING_VECTOR
This document provides an overview of the ai_embedding_vector function in Databend and demonstrates how to create document embeddings using this function.
The main code implementation can be found here.
By default, Databend leverages the text-embedding-ada model for generating embeddings.
Starting from Databend v1.1.47, Databend supports the Azure OpenAI service.
This integration offers improved data privacy.
To use Azure OpenAI, add the following configurations to the [query]
section:
# Azure OpenAI
openai_api_chat_base_url = "https://<name>.openai.azure.com/openai/deployments/<name>/"
openai_api_embedding_base_url = "https://<name>.openai.azure.com/openai/deployments/<name>/"
openai_api_version = "2023-03-15-preview"
Databend relies on (Azure) OpenAI for AI_EMBEDDING_VECTOR
and sends the embedding column data to (Azure) OpenAI.
They will only work when the Databend configuration includes the openai_api_key
, otherwise they will be inactive.
This function is available by default on Databend Cloud using our Azure OpenAI key. If you use them, you acknowledge that your data will be sent to Azure OpenAI by us.
Overview of ai_embedding_vector
The ai_embedding_vector
function in Databend is a built-in function that generates vector embeddings for text data. It is useful for natural language processing tasks, such as document similarity, clustering, and recommendation systems.
The function takes a text input and returns a high-dimensional vector that represents the input text's semantic meaning and context. The embeddings are created using pre-trained models on large text corpora, capturing the relationships between words and phrases in a continuous space.
Creating embeddings using ai_embedding_vector
To create embeddings for a text document using the ai_embedding_vector
function, follow the example below.
- Create a table to store the documents:
CREATE TABLE documents (
id INT,
title VARCHAR,
content VARCHAR,
embedding ARRAY(FLOAT32)
);
- Insert example documents into the table:
INSERT INTO documents(id, title, content)
VALUES
(1, 'A Brief History of AI', 'Artificial intelligence (AI) has been a fascinating concept of science fiction for decades...'),
(2, 'Machine Learning vs. Deep Learning', 'Machine learning and deep learning are two subsets of artificial intelligence...'),
(3, 'Neural Networks Explained', 'A neural network is a series of algorithms that endeavors to recognize underlying relationships...'),
- Generate the embeddings:
UPDATE documents SET embedding = ai_embedding_vector(content) WHERE length(embedding) = 0;
After running the query, the embedding column in the table will contain the generated embeddings.
The embeddings are stored as an array of FLOAT32
values in the embedding column, which has the ARRAY(FLOAT32)
column type.
You can now use these embeddings for various natural language processing tasks, such as finding similar documents or clustering documents based on their content.
- Inspect the embeddings:
SELECT length(embedding) FROM documents;
+-------------------+
| length(embedding) |
+-------------------+
| 1536 |
| 1536 |
| 1536 |
+-------------------+
The query above shows that the generated embeddings have a length of 1536(dimensions) for each document.