Find Relevant Answers: Semantic Search with Text Embeddings

Idea and Step-by-Step Plan

Today, we are going to use text embeddings to transform a list of phrases into vectors. When a user asks a question, we will convert it into a vector as well and find the phrases from the list that are semantically closest. This approach is useful, for example, to immediately suggest relevant FAQ sections to the user and reduce the need for full support requests.

So, here's a plan:

  1. Prepare the data: Create a numbered list of text phrases.

  2. Generate embeddings: Use a model to embed each phrase into a vector.

  3. Embed the question: When the user asks something, embed the question text.

  4. Find similar phrases: Calculate the similarity (e.g., cosine similarity) between the question vector and the list vectors. Show the top 1–3 most similar phrases as the answer.

Full Walkthrough

1. Prepare the data

We have compiled the following list of FAQ headings:

"How to grow tomatoes at home",
"Learning about birds",
"Best practices for machine learning models",
"How to train a dog",
"Tips for painting landscapes",
"Learning Python for data analysis",
"Everyday Life of a Cynologist"

2. Generate embeddings

Let's save our headings as a list and pass them to the model. We chose the text-embedding-3-large model — it has been trained on a large dataset and is powerful enough to build complex semantic connections.

Now each of our headings has a corresponding embedding vector.

3. Embed the question

Similarly, we process the user's query. We save the embedding vector generated by the model into a separate variable.

4. Find similar phrases

We calculate the similarity between the question vector and the list vectors.

There are different metrics and functions you can use for this, such as cosine similarity, dot product, or Euclidean distance.

In this example, we use cosine similarity because it measures the angle between two vectors and is a popular choice for comparing text embeddings, especially when the magnitude of the vectors is less important than their direction.

Please note that to use the cosine similarity function, you need to install the scikit-learn library separately. You can install it with the following command:

Full Code Example & Results

In this section, you will find the complete Python code for the described use case, along with an example of the program's output.

Python code
Response when using a large embedding model

Here is the program output after we switched to the small version of the model, text-embedding-3-small:

Response when using a small embedding model

Maybe it just wasn’t trained quite as thoroughly and doesn’t recognize who cynologists are 🤷 Or maybe the difference is simply that the default embedding size is 1536 for text-embedding-3-small or 3072 for text-embedding-3-large.

We didn't notice much difference in speed, but the larger version is somewhat more expensive.

If you're planning to perform semantic search over code snippets, a better choice might be the voyage-code-2 model, which is specifically trained to better distinguish between pieces of code.

Room for Improvement

Naturally, this is a simplified example. You can develop a more comprehensive implementation by introducing features such as:

  • Add a minimum similarity threshold to filter out irrelevant results,

  • Cache embeddings for faster lookup without recalculating them each time,

  • Allow partial matches or fuzzy search for broader results,

  • Handle multiple user questions at once (batch processing) — and more.

Last updated

Was this helpful?