DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Low-Code Development: Leverage low and no code to streamline your workflow so that you can focus on higher priorities.

DZone Security Research: Tell us your top security strategies in 2024, influence our research, and enter for a chance to win $!

Launch your software development career: Dive head first into the SDLC and learn how to build high-quality software and teams.

Open Source Migration Practices and Patterns: Explore key traits of migrating open-source software and its impact on software development.

Related

  • Using Spring AI With AI/LLMs to Query Relational Databases
  • PostgresML: Streamlining AI Model Deployment With PostgreSQL Integration
  • The Evolution of Database Architectures: Navigating Big Data, Cloud, and AI Integration
  • How To Select the Right Vector Database for Your Enterprise GENERATIVE-AI Stack

Trending

  • How To Plan a (Successful) MuleSoft VPN Migration (Part II)
  • Benchmarking Java Streams
  • GBase 8a Implementation Guide: Performance Optimization
  • 7 Linux Commands and Tips to Improve Productivity
  1. DZone
  2. Data Engineering
  3. Databases
  4. Handling Vectors in AI Context via PostgreSQL pgVector

Handling Vectors in AI Context via PostgreSQL pgVector

An introduction to PostgreSQL pgVector extension and its supported metrics — L2 squared distance (Euclidian), dot product, cosine distance/similarity.

By 
Horatiu Dan user avatar
Horatiu Dan
·
Apr. 25, 24 · Tutorial
Like (1)
Save
Tweet
Share
985 Views

Join the DZone community and get the full member experience.

Join For Free

Relational databases are optimized for storing and querying structured data, yet most of the data today is unstructured. Artificial Intelligence and Machine Learning are now able to “structure” pieces of unstructured data without altering its semantics. First, they transform it ‘conveniently’ into arrays of numbers, structures that are called vectors. Then, the vectors are stored in dedicated databases and worked upon as needed, so that the initial data becomes useful and meaningful as part of a high-dimensional space.

In the context of AI, the numerical arrays are called vector embeddings and can be seen as sets of “characteristics” of the represented entities (objects). Their role is to allow AI models to infer from them and consequently on the initial input data.

This article is an introduction to how to turn PostgreSQL into a vector database using the pgVector extension. It also briefly presents a few general vector similarity concepts, concepts that are relevant, particularly in AI applications.

Concepts

As developers become more and more interested in constructing AI functionalities (or exploring them), it is useful and helpful to have a basic understanding of the concepts around vectors in a multi-dimensional space. Moreover, when using techniques such as Retrieval Augmentation Generation (RAG) where vector embeddings enrich the AI context to fulfill the user’s inquiries, this basic understanding becomes a prerequisite.

In general, Large Language Models (LLMs) are stateless, and it isn’t seldom when the responses they give to prompts are not satisfactory. In order to enhance their capabilities, they are improved with “state” represented as vector embeddings. It is said that this state is used to provide LLMs with “a long-term memory” so that the possibility of them hallucinating decreases. Basically, when a prompt is provided to a model, in case the results are not satisfactory, one may “help” it with relevant pieces of information – embeddings – extracted from the self-vector databases.

To accomplish this, at least two preconditions are needed:

  • Gather the data, transform it into vector embeddings, and have it stored in the database
  • When needed, quickly find the vectors related to the ones in the prompt and “hand” it to the model to use it

The former step represents the “magical” operation through which an ML model accepts text as input and through vectorization, transforms it into vectors. The latter is accomplished by leveraging the features offered by the chosen vector database.

When it comes to vector databases, there are quite a bunch of implementations available. A few are mentioned below:

  • PostgreSQL pgVector – in this article
  • AzureVector
  • ChromaVector
  • MilvusVector
  • Neo4jVector
  • PineconeVector
  • QdrantVector
  • RedisVector
  • WeaviateVector

According to OpenAI, “the distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.”

With this definition in mind, it is said that the similarity of two objects is the distance between their vector representations using a specific metric (method).

The “theory” defines several such metrics:

  • L1 distance (Manhattan)
  • L2 squared distance (Euclidian)
  • dot product
  • cosine distance/similarity
  • Hamming distance

Choosing a certain metric depends on several factors, yet this aspect won’t be detailed as part of this article. Nevertheless, as the pgVector extension currently supports L2, dot product, and cosine distance/similarity metrics, these are the ones analyzed next.

Installing pgVector

As of now, PostgreSQL does not support vector similarity search natively. In order to accommodate such a feature, the pgVector extension may be added. According to the documentation, one needs to perform a few simple steps [Resource 1].

Personally, I used an existing PostgreSQL 13.11 server running on Windows OS and I enhanced it with vector support by compiling, installing, and enabling the extension afterward.

Steps:

  • Install Visual Studio 2022 with C++ support, available here
  • Run vcvars64.bat, located in Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\ directory
  • Set PGROOT to point to the current PostgreSQL
PowerShell
 
>set "PGROOT=C:\Program Files\PostgreSQL13"


  • Get the pgVector source code
PowerShell
 
>git clone --branch v0.6.2 https://github.com/pgvector/pgvector.git


  • Go to the pgvector directory and run the make file using using nmake
PowerShell
 
>nmake /F Makefile.win
>nmake /F Makefile.win install


  • Enable the extension using the next SQL command
SQL
 
create extension vector;


At this point, one shall be able to use a new data type in their tables – vector, store data as vectors, and perform similarity search queries. 

Querying Vectors in PostgreSQL

The way a Machine Learning model transforms the text into vectors (“the magic”) is not considered as part of this article. Thus, it is assumed that the vector embeddings exist, and they are ready to be stored in the database.

Let the following three sentences have the following representations. For convenience, a space with three dimensions is used.

Plain Text
 
'Raccoons are silly and funny.'   - [3,0,4]
'Dogs are friendly and helpful.'  - [5,0,2]
'Cats are funny and foxy.'        - [4,0,3]


Since the vectors are available, they can be stored. Prior to that, a new schema and a minimal table are created. 

SQL
 
create schema vectors;
 
create table if not exists document (
    id integer primary key,
    name text not null,
    content text not null,
    embedding vector(3) not null
);


One may notice that the fourth column – embedding – and its vector data type. The parameter represents the number of dimensions, here 3. 

SQL
 
insert into document
    values (1, 'Raccoons', 'Raccoons are silly and funny.', '[3,0,4]');
 
insert into document
    values (2, 'Dogs', 'Dogs are friendly and helpful.', '[5,0,2]');
 
insert into document
    values (3, 'Cats', 'Cats are funny and foxy.', '[4,0,3]');


The data is in, one may start launching queries. 

SQL
 
select * from document;
Plain Text
 
+--+--------+------------------------------+---------+
|id|name    |content                       |embedding|
+--+--------+------------------------------+---------+
|1 |Raccoons|Raccoons are silly and funny. |[3,0,4]  |
|2 |Dogs    |Dogs are friendly and helpful.|[5,0,2]  |
|3 |Cats    |Cats are funny and foxy.      |[4,0,3]  |
+--+--------+------------------------------+---------+


Metrics

Next, each of the metrics pgVector currently supported is analyzed. The context is simple.

Let’s consider, in addition to the three vector embeddings, a fourth one – [1,2,3]. As the focus is not on how a vector is produced, we could say it may mean anything. The aim is to find out how similar the stored vectors are to this one.

Graphically, the four vectors can be represented as below.

four vectors

L2

L2 (Euclidian) squared distance between two vectors represents the straight-line distance between them. 

squared distance

The lower the distance, the more similar the vectors.

E.g. the L2 distance between [3,0,4] (“Raccoons are silly and funny.”) and [1,2,3] (our query) is

distance between

In PostgreSQL, the <-> operator computes the L2 distance.

SQL
 
select *, embedding <-> '[1,2,3]' as l2_distance
from document
order by l2_distance;
Plain Text
 
+--+--------+------------------------------+---------+-----------------+
|id|name    |content                       |embedding|l2_distance      |
+--+--------+------------------------------+---------+-----------------+
|1 |Raccoons|Raccoons are silly and funny. |[3,0,4]  |3                |
|3 |Cats    |Cats are funny and foxy.      |[4,0,3]  |3.605551275463989|
|2 |Dogs    |Dogs are friendly and helpful.|[5,0,2]  |4.58257569495584 |
+--+--------+------------------------------+---------+-----------------+


Dot Product

The dot product of two vectors is nothing but the scalar product between the two. The dot product distance between the vectors is the negative of their dot product. 

dot product

The lower the distance, the more similar the vectors.

E.g. the dot product distance between [3,0,4] (“Raccoons are silly and funny.”) and [1,2,3] (our query) is:

distance

In PostgreSQL, the <#> operator computes the dot product distance. 

SQL
 
select *, embedding <#> '[1,2,3]' as dp_distance
from document
order by dp_distance;
Plain Text
 
+--+--------+------------------------------+---------+-----------+
|id|name    |content                       |embedding|dp_distance|
+--+--------+------------------------------+---------+-----------+
|1 |Raccoons|Raccoons are silly and funny. |[3,0,4]  |-15        |
|3 |Cats    |Cats are funny and foxy.      |[4,0,3]  |-13        |
|2 |Dogs    |Dogs are friendly and helpful.|[5,0,2]  |-11        |
+--+--------+------------------------------+---------+-----------+


Cosine Similarity / Distance

The cosine similarity between two vectors is the cosine of the angle between the two in the considered space. 

cosine

The smaller the angle, the more similar the vectors. Also, it is said that similar vectors point towards the same direction.

The cosine distance is defined as the “complementary” of the cosine similarity.

lower distance

The lower the distance (the bigger the cosine similarity), the more similar the vectors.

e.g. the cosine similarity and the cosine distance between [3,0,4] (“Raccoons are silly and funny.”) and [1,2,3] (our query) are:

cosine similarity

In PostgreSQL, the <=> operator computes the cosine distance. 

SQL
 
select *, embedding <=> '[1,2,3]' as cosine_distance
from document
order by cosine_distance;
Plain Text
 
+--+--------+------------------------------+---------+-------------------+
|id|name    |content                       |embedding|cosine_distance    |
+--+--------+------------------------------+---------+-------------------+
|1 |Raccoons|Raccoons are silly and funny. |[3,0,4]  |0.19821627426272692|
|3 |Cats    |Cats are funny and foxy.      |[4,0,3]  |0.30512077102769664|
|2 |Dogs    |Dogs are friendly and helpful.|[5,0,2]  |0.45407916631598844|
+--+--------+------------------------------+---------+-------------------+


As cosine distance and cosine similarity are opposite, when working with cosine similarity, the results need to be ordered descending if the aim is to find more similar results. 

SQL
 
select *, 1 - (embedding <=> '[1,2,3]') as cosine_similarity
from document
order by cosine_similarity desc;
Plain Text
 
+--+--------+------------------------------+---------+------------------+
|id|name    |content                       |embedding|cosine_similarity |
+--+--------+------------------------------+---------+------------------+
|1 |Raccoons|Raccoons are silly and funny. |[3,0,4]  |0.8017837257372731|
|3 |Cats    |Cats are funny and foxy.      |[4,0,3]  |0.6948792289723034|
|2 |Dogs    |Dogs are friendly and helpful.|[5,0,2]  |0.5459208336840116|
+--+--------+------------------------------+---------+------------------+


To conclude, the cosine similarity measures how similar two vectors are, while the cosine distance, how different they are.

Enhancing the Search Speed

In the previous section, the aim was to find the vector embeddings that are closer to the chosen query vector – [1,2,3] – by using a specified distance metric. In order for such queries to be usable, they need to be fast. According to the documentation [Resource 1], pgVector uses by default the exact nearest neighbor search (kNN algorithms), which retrieves the nearest k vectors after comparing the queried one with each embedding in the database.

The complexity of kNN algorithms is O(n), thus scanning vectors with 200-300 dimensions against a large number of embeddings is computationally very expensive and not scalable.

The alternative is the Approximate Nearest Neighbor (ANN) approach which trades accuracy for a great deal of improvement in speed. pgVector supports two implementations:

  • Hierarchical Navigable Small World (HNSW) – creates an index based on a proximity graph (related vectors are stored next to one another)
  • IVVFFlat – divides the vectors into lists and searches subsets of these that are closest to the queried one

There are plenty of details in the documentation [Resource 1] on how to parameterize, adjust, and monitor “the progress” of the execution when creating and using indexes that work based on these algorithms.

In the case of this analysis, HNSW indexes were created, one for each of the distance metrics used.

SQL
 
create index if not exists idx_l2_document_embedding on document
    using hnsw (embedding vector_l2_ops)
    with (m = 16, ef_construction = 64);
 
create index if not exists idx_ip_document_embedding on document
    using hnsw (embedding vector_ip_ops);
 
create index if not exists idx_cosine_document_embedding on document
    using hnsw (embedding vector_cosine_ops);


Each of them may be parameterized with the maximum number of connections per layer and the size of the dynamic list of candidates when constructing the graph, respectively. 

Takeaways

Vector databases play an important role in the AI context, as they are used to integrate self-data with the actual AI model and consequently transform it into a less stateless one.

For thorough and reliable results, there are a few important factors to take into account when choosing a metric. The actual data, the model used and the application type should be primarily taken into account. Then, the meaning of the metric is important in the concrete context.

Concerning the metrics analyzed in this article, from a cosine similarity point of view, two vectors are considered related if the angle between them is small, irrespective of their magnitude. On the other hand, the L2 distance between them might be substantial and discussions around the normalization of the data, loss functions, and fine-tuning might arise additionally in this case.

All in all, it is advisable to use the same (distance) metric as the one the ML model (the vectorizer, the entity that produces the embeddings) uses.

Resources

  1. pgVector
  2. Spring AI Reference
  3. Open AI Platform Docs
  4. The formulae were created using Mathcha
  5. The picture was taken in Cluj, Romania
AI Data structure Database Relational database PostgreSQL

Published at DZone with permission of Horatiu Dan. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Using Spring AI With AI/LLMs to Query Relational Databases
  • PostgresML: Streamlining AI Model Deployment With PostgreSQL Integration
  • The Evolution of Database Architectures: Navigating Big Data, Cloud, and AI Integration
  • How To Select the Right Vector Database for Your Enterprise GENERATIVE-AI Stack

Partner Resources


Comments

ABOUT US

  • About DZone
  • Send feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: