WTF

← Back to blog

Published on 04/08/2023 by Kostas Pardalis

Hey, since SaaS is not in vogue anymore, and free tiers for mailing list software are not VC subsidized, for updates just follow us on Twitter: Kostas & Demetrios

This is a series of articles explaining WTF Vector Databases are.

Articles in this series

Let’s build a Vector Database

Remember what we talked about previously. Our aim is to build a database and not a DBMS. To do that, we first have to decide what part of the world we want to model.

To keep things interesting and meaningful, let’s find a practical problem to go after.

The problem we will go after is, how to perform semantic search in the audio of a podcast episode.

More specifically, given a podcast episode as an audio file and a text query, we want to find the segments in the audio that the speakers are talking about topics related to our query.

As a concrete example, we will be using the interview Sam Altman gave to Lex Fridman.

This is a pretty lengthy podcast episode that lasts almost 2:30 hours. Sam and Lex discuss about many different topics including AGI and meaning of life.

Our goal here is to search for a question and get as a result a number of timestamps in the episode where Sam and Lex discuss about something similar.

To do that we are going to work with the transcripts of the interview. How we got the transcripts from the raw audio is a whole different topic but for now you can assume that the transcripts are available and they also include timestamps from the audio.

The transcript files look like this:

{
    "text": " but I'm happy we did it this way.",
    "segments": [
        {
            "id": 0,
            "seek": 0,
            "start": 0.0,
            "end": 1.12,
            "text": " but I'm happy we did it this way.",
            "tokens": [
                50364,
                457,
                286,
                478,
                2055,
                321,
                630,
                309,
                341,
                636,
                13,
                50444
            ],
            "temperature": 0.0,
            "avg_logprob": -0.3862730906559871,
            "compression_ratio": 0.8048780487804879,
            "no_speech_prob": 0.0015712809981778264,
            "words": [
                {
                    "word": " but",
                    "start": 0.0,
                    "end": 0.16,
                    "probability": 0.4283820390701294
                },
                {
                    "word": " I'm",
                    "start": 0.16,
                    "end": 0.3,
                    "probability": 0.9783893525600433
                },
                {
                    "word": " happy",
                    "start": 0.3,
                    "end": 0.48,
                    "probability": 0.8216957449913025
                },
                {
                    "word": " we",
                    "start": 0.48,
                    "end": 0.62,
                    "probability": 0.9063420295715332
                },
                {
                    "word": " did",
                    "start": 0.62,
                    "end": 0.74,
                    "probability": 0.8946512341499329
                },
                {
                    "word": " it",
                    "start": 0.74,
                    "end": 0.82,
                    "probability": 0.9060916900634766
                },
                {
                    "word": " this",
                    "start": 0.82,
                    "end": 0.98,
                    "probability": 0.8845305442810059
                },
                {
                    "word": " way.",
                    "start": 0.98,
                    "end": 1.12,
                    "probability": 0.8896676898002625
                }
            ]
        }
    ],
    "language": "en"
}

There’s a lot going on but what we care about is the text together with the start and end fields.

Our plan is to get the text, start and end values. Use text to create embeddings - vectors and then create a database with all the generated vectors.

When a query comes in, we will compute an embedding from it and use it to compare with the database and find the most similar transcripts.

Preparing the data

The first part of our adventure includes some data preparation. We want to parse all the transcripts and create a small database. We will keep things extremely simple, our database will be based on a list data structure. Each element of the list will be a tuple with the following fields:

('2', 
232326, 
256423, 
" Also, if you want to work with our team, we''re always hiring, go to lexfriedman.com slash hiring. And now onto the full ad reads. As always, no ads in the middle. I try to make these interesting, but if you skip them, please still check out our sponsors. I enjoy their stuff. Maybe you will too. This show is brought to you by NetSuite, an all-in-one cloud business management system.",
 'SPEAKER_01')

With a list of tuples like the above, we are now ready to start enriching our data with embeddings.

There are plenty of ways to generate embeddings. OpenAI offers a popular API to create them using their models.

As OpenAI requires a subscription, we’ll be using Sentence Transformers which is a popular alternative and you can run it on your laptop. You can find more information about the package on their webpage and on HuggingFace 🤗.

The actual code to calculate the embedding for each transcript is pretty simple and it looks likes this.

model = SentenceTransformer('all-MiniLM-L6-v2')
#res is a list of tuples (id, from, to, text, speaker)
res = enrich_db("../data/transcripts", segments, times)
with_embeddings = [] # the initial db enriched with embeddings for each entry.
for r in res:
    embedding = model.encode(r[3])
		with_embeddings.append((r[0], r[1], r[2], r[3], r[4], embedding))

This simple for loop will give us at the end the embeddings - vectors that we need to perform semantic search.

Searching

After we have the database, we are ready to start searching. To do this we are going to implement the simplest searching method, we will brute force!

  1. For the provided query we will calculate its embedding.
  2. For each entry in the database we will calculate the similarity between the query embedding and the entry.
  3. We will sort them from most similar to less similar.
  4. we will choose the top-k most similar entries.
  5. We will return the results and make the user happy.

Remember what we said about using algebraic methods with the vectors? That’s exactly what we will do here. We will use Cosine Similarity which in a summary, calculates the cosine of the angle between two vectors and attempts to determine if the two vectors are pointing to a similar direction.

This is easy to visualize in two dimensions but keep in mind that our vectors are living in spaces with many more dimensions. The principle remains the same though.

The code to do that is pretty simple

def calc_similarity(db, query):
    query_embedding = model.encode(query)
    results = []
    for row in db:
				# the result contains tuples (id, cosine similarity value)
        results.append((row[0],  util.cos_sim(query_embedding, row[5])))
    return results

That’s it, now we can start searching. Let’s see an example.

q = 'Advanced General Intelligence'

Gives us the following top three results.

('160', tensor([[0.5964]])
"text": " is not a super intelligence. And to do that really well, I think we will need to expand on the GPT paradigm in pretty important ways that we're still missing ideas for. But I don't know what those ideas are. We're trying to find them.",
('33', tensor([[0.5243]])
"text": " remarkable, by the way, that there's like a law of science that lets you predict for these inputs, here's what's going to come out the other end, like here's the level of intelligence you can expect.",
('189', tensor([[0.4859]]))
"text": " when you build or somebody builds an artificial general intelligence, would that be fast or slow? Would we know what's happening or not? Would we go about our day on the weekend or not?",

Not perfect but not bad either, especially considering that we haven’t done any fine tuning on anything so far.

Using the above ids, we can also search in our database and retrieve the timestamp in the audio, so we can skip directly to that part of the interview.

The problem we have though is that our search is very inefficient. It takes about 21 ms to come up with the top-k similar items. This includes the process of calculating all the similarities and then sorting, with the similarity calculation being the most expensive operation by far.

And this is for a database with just 527 entries. Trust me, this doesn’t scale very well!

Regardless of the scalability issue, we just created our first vector database. As you can see, it just took us a couple lines of python code to do it.

But a database is not a DBMS, let’s see what the difference is.

let’s put everything together

We just created a simple vector database to use for searching inside the content of an audio file containing a podcast interview. We saw that building a toy database is not hard and hopefully it helped understand the mechanics of a vector database.

We also encountered the first limitation of our implementation and with that, we are ready to start exploring what a real vector DBMS looks like and why it’s a much more complex system than the toy database we just created.

Let’s go through a real Vector DBMS now to see how a real system is built!

Dissecting a Vector DBMS

Written by Kostas Pardalis

← Back to blog