Vector Search — How to Find the Best Results

A Quick Recap

In previous episodes, we learned why LLMs alone aren’t enough, the core idea of RAG, how Embedding converts text to vectors, how Vector Databases store these vectors, and how Chunking splits text into pieces. Now it’s time to tackle a very important question: when a user asks a question, how do we find the most relevant chunks?

This is where vector search enters the game. It’s the art of finding the closest vectors to the user’s query vector.

Distance Metrics — Three Main Tools

When you want to compare two vectors and determine how similar they are, you have three main metrics:

1. Cosine Similarity

Imagine two arrows starting from the same point. Cosine Similarity measures the angle between these two arrows. If the angle is zero (they point the same direction), similarity is 1. If they’re at 90 degrees, similarity is 0. If at 180 degrees (opposite), similarity is -1.

similarity = cos(theta) = (A . B) / (|A| x |B|)

Key point: Cosine Similarity doesn’t care about vector length, only direction. This means if a short text and a long text are about the same topic, they’ll still have high similarity. That’s why it’s the most popular metric in RAG.

2. Dot Product

Dot Product considers both direction and length of vectors. The formula is simple:

dot_product = A1xB1 + A2xB2 + ... + AnxBn

The main difference from Cosine is that if vectors are normalized (length of 1), Dot Product equals Cosine Similarity exactly. Many embedding models like OpenAI produce normalized vectors, so in practice there’s no difference.

But if vectors aren’t normalized, Dot Product gives higher scores to longer texts. Sometimes this is good (e.g., when you want to prefer more comprehensive articles), sometimes bad.

3. Euclidean Distance

This is the straight-line distance between two points you learned in high school math:

distance = sqrt((A1-B1)^2 + (A2-B2)^2 + ... + (An-Bn)^2)

Unlike the previous two, here a smaller number means more similarity. Euclidean is sensitive to both direction and length, but in high-dimensional spaces (like 1536 dimensions) it can sometimes give unexpected results. That’s why it’s used less frequently.

Which One Should I Choose?

Rule of Thumb: If your embedding model produces normalized vectors (like OpenAI models), all three give roughly the same results. But if in doubt, choose Cosine Similarity. It almost always works.

The Exact Search Problem — And Why We Need ANN

Think for a moment. If you have 1 million text chunks, each with a 1536-dimensional vector, for every question you need to compare the query vector with all 1 million vectors. That’s 1 million matrix multiplications. Slow! Especially when you want to respond in a fraction of a second.

Exact Search (or Brute Force) gives the most accurate results but doesn’t scale. This is where ANN or Approximate Nearest Neighbor comes in.

The ANN idea is simple: instead of checking all vectors, build a smart data structure that only checks “probably relevant” vectors. You might lose 1-2% accuracy, but search speed increases 100x.

HNSW — The Most Popular Algorithm

Hierarchical Navigable Small World or HNSW is currently the most popular ANN algorithm. Let me explain simply.

Imagine you have cities on a map. You build a multi-layered network:

Top layer: Only major cities — New York, London, Tokyo, Sydney
Middle layer: Medium cities added — Chicago, Manchester, Osaka
Bottom layer: All cities and towns

When searching for a specific town, first find the nearest major city in the top layer, then the nearest medium city in the middle layer, and finally the exact town in the bottom layer.

HNSW does exactly this with vectors. Its speed is excellent and accuracy is very high (usually above 95%).

Important HNSW parameters:

M — Number of connections per node. More = more accurate but more memory
ef_construction — Index build accuracy. More = slower build but better index
ef_search — Search accuracy. More = slower search but more accurate

IVF — The Clustering Algorithm

Inverted File Index or IVF takes a different approach. It first divides vectors into groups (clusters). During search, it first finds the nearest clusters and then only examines vectors within those clusters.

Like organizing a library into sections (history, science, literature). When looking for a physics book, you only check the science section.

IVF usually uses less memory than HNSW but is slightly less accurate.

Key parameter: nprobe — number of clusters checked during search. More = more accurate but slower.

Filtering with Metadata

Vector search is good but not always sufficient. Suppose you have a support system with documentation for different products. A user asks about “Product A” but vector search might return chunks from “Product B” documentation because they have similar words.

Solution: Metadata Filtering. When storing each chunk, also store some extra information (metadata):

{
  "text": "To reset the device, hold the power button for 10 seconds.",
  "metadata": {
    "product": "product-a",
    "category": "troubleshooting",
    "language": "en",
    "last_updated": "2025-01-15"
  }
}

Now when searching you can say: “Only return chunks where product equals product-a and category equals troubleshooting.”

Two filtering approaches:

Pre-filtering: Filter first, then vector search. Faster but may yield fewer results.
Post-filtering: Vector search first, then filter. Better results but may be slower.

Most modern Vector Databases like Pinecone, Weaviate, and Qdrant support both approaches.

Hybrid Search — The Best of Both Worlds

Vector search excels at understanding concepts but has a weakness: it sometimes misses exact terms. For example, if a user asks “What is error E-4021?”, vector search might catch the concept of “error” but ignore the exact code “E-4021”.

On the other hand, traditional keyword search (like BM25) matches exact words but doesn’t understand meaning.

Hybrid Search combines both:

# Pseudocode for Hybrid Search
vector_results = vector_search(query, top_k=20)
keyword_results = bm25_search(query, top_k=20)

# Combine results with Reciprocal Rank Fusion
final_results = rrf_merge(vector_results, keyword_results)
return final_results[:10]

Reciprocal Rank Fusion (RRF) is a simple but effective method for combining results. It scores each result based on its rank in each list and then sums the scores.

score(doc) = sum(1/(k + rank_i))   # k is typically 60

For example, if a document ranks 1st in vector search and 5th in BM25, its score is: 1/(60+1) + 1/(60+5) = 0.0164 + 0.0154 = 0.0318

Hybrid search typically performs 10-20% better than pure vector search.

Practical Tips for Improving Search Quality

1. Improve the Query Vector

Before searching, process the user’s query a bit. For example, if the user wrote “why doesn’t it work?”, you can ask the LLM to rewrite it: “Why is the user login system encountering an error?”

2. Tune the Number of Results (top_k)

Too few = you might miss the answer. Too many = noise is added and context window fills up. Usually 5 to 10 is a good starting point.

3. Set a Similarity Threshold

Only return results above a certain similarity score. For example, if Cosine Similarity is below 0.7, ignore that result. This helps avoid irrelevant answers when the user’s question has nothing to do with your data.

4. Choose the Right Embedding Model

Not all embedding models are equal. Some perform better with specific languages, others with technical content. Always test and compare with your own data.

5. Optimize the Index

Tune HNSW or IVF parameters based on your data volume. For 10,000 chunks, default settings suffice. For 10 million chunks, you need serious optimization.

Summary

In this episode you learned:

The three main distance metrics (Cosine, Dot Product, Euclidean) and when to use each
How ANN algorithms like HNSW and IVF multiply search speed
How Metadata Filtering makes results more precise
How Hybrid Search gives you the best of both worlds
Practical tips for improving search quality

Now that you’ve found the best chunks, the next question is: how do you feed them to the LLM to produce the best answer? In the next episode, we’ll discuss Prompt Engineering for RAG — a skill that separates an average RAG from an excellent one.