Intermediate4 min read

How Do Vector Embeddings Work, and How Do You Choose the Right Embedding Model?

Explain what vector embeddings are, how embedding models convert text to vectors, and how you'd benchmark and improve retrieval accuracy for a production RAG system.

Prep for the full interview loop

Know the concepts. Now prove it. Practice GenAI, Coding, System Design, and AI/ML Design interviews with an AI that tells you exactly where you fell short.

Start a mock interview

Why This Is Asked

Embeddings are the foundation of every RAG system. Interviewers use this question to test whether you understand what a vector actually represents, why model choice matters, and whether you can diagnose and fix poor retrieval — not just wire up a library.

Key Concepts to Cover

  • What embeddings are — dense vector representations that encode semantic meaning
  • How embedding models work — encoder-only transformers (BERT-family, sentence transformers)
  • Dimensionality — tradeoffs between vector size, quality, and storage cost
  • Benchmarking — how to evaluate models on your specific data
  • Improving a low-accuracy embedding model — fine-tuning vs. switching models

How to Approach This

1. What is a Vector Embedding?

A vector embedding is a fixed-length array of floats (e.g., 384 or 1536 dimensions) where semantically similar texts have embeddings that are close together in vector space. The key insight is that geometric proximity approximates semantic proximity — "What is RAG?" and "How does retrieval-augmented generation work?" would map to nearby points.

Embeddings are generated by encoder models trained on large text corpora, usually with contrastive learning objectives (positive pairs pulled together, negatives pushed apart).

2. How Embedding Models Are Used in RAG

At indexing time:

  • Each document chunk is passed through the embedding model → vector stored in the vector DB

At query time:

  • The user query is embedded with the same model → nearest-neighbor search returns the most semantically similar chunks

Critical constraint: always use the same model for indexing and querying. Mismatched models produce meaningless similarity scores.

3. Short vs. Long Content

  • Short texts (sentences, queries): most models handle well; sentence transformers are purpose-built for this
  • Long texts (pages, chapters): most models truncate at 512 tokens; need chunking before embedding, or use long-context embedding models (e.g., text-embedding-3-large with up to 8191 tokens)
  • Asymmetric retrieval: queries are short, documents are long — models like msmarco-distilbert are trained specifically for this mismatch

4. Benchmarking Embedding Models on Your Data

Don't trust public benchmarks (MTEB) alone — they may not reflect your domain. Build a small evaluation set:

  1. Sample ~100-200 representative queries from real or expected user questions
  2. For each query, manually label which chunks are relevant (golden set)
  3. Run retrieval with each candidate model
  4. Measure precision@k, recall@k, MRR (mean reciprocal rank)
  5. Pick the model with the best recall at acceptable latency/cost

5. Improving a Low-Accuracy OpenAI Embedding Model

If accuracy is low after switching to a better model, systematic steps:

  1. Analyze failure modes first — are misses due to vocabulary gaps, domain mismatch, or query-document length asymmetry?
  2. Query rewriting — rephrase short queries into longer, more descriptive forms before embedding
  3. HyDE (Hypothetical Document Embeddings) — generate a hypothetical answer to the query, embed that instead of the raw query; puts query in "answer space"
  4. Fine-tune a sentence transformer — label positive/negative pairs from your domain and run contrastive fine-tuning (much cheaper than fine-tuning an LLM)
  5. Hybrid search — combine dense embeddings with sparse BM25; keyword matches catch what semantic search misses
  6. Re-ranking — add a cross-encoder re-ranker on the top-k results; cross-encoders see query+document jointly and are far more accurate than embedding similarity alone

Common Follow-ups

  1. "When would you fine-tune an embedding model vs. switch to a different one?" Switch first — it's faster and cheaper. Fine-tune when you have a specialized domain (legal, medical, code) where off-the-shelf models don't capture domain vocabulary, and you have sufficient labeled pairs (500+).

  2. "What's the difference between text-embedding-ada-002 and text-embedding-3-large?" 3-large has higher dimensionality (3072 vs 1536), supports dimension reduction (Matryoshka embeddings), and benchmarks better on MTEB. ada-002 is cheaper but largely superseded.

  3. "How do you handle multilingual content?" Use a multilingual embedding model (e.g., multilingual-e5-large, paraphrase-multilingual-mpnet). Don't translate first — it adds latency and loses nuance.

Related Questions

Prep for the full interview loop

Know the concepts. Now prove it. Practice GenAI, Coding, System Design, and AI/ML Design interviews with an AI that tells you exactly where you fell short.

Start a mock interview