Why This Is Asked
Embeddings are the foundation of every RAG system. Interviewers use this question to test whether you understand what a vector actually represents, why model choice matters, and whether you can diagnose and fix poor retrieval — not just wire up a library.
Key Concepts to Cover
- What embeddings are — dense vector representations that encode semantic meaning
- How embedding models work — encoder-only transformers (BERT-family, sentence transformers)
- Dimensionality — tradeoffs between vector size, quality, and storage cost
- Benchmarking — how to evaluate models on your specific data
- Improving a low-accuracy embedding model — fine-tuning vs. switching models
How to Approach This
1. What is a Vector Embedding?
A vector embedding is a fixed-length array of floats (e.g., 384 or 1536 dimensions) where semantically similar texts have embeddings that are close together in vector space. The key insight is that geometric proximity approximates semantic proximity — "What is RAG?" and "How does retrieval-augmented generation work?" would map to nearby points.
Embeddings are generated by encoder models trained on large text corpora, usually with contrastive learning objectives (positive pairs pulled together, negatives pushed apart).
2. How Embedding Models Are Used in RAG
At indexing time:
- Each document chunk is passed through the embedding model → vector stored in the vector DB
At query time:
- The user query is embedded with the same model → nearest-neighbor search returns the most semantically similar chunks
Critical constraint: always use the same model for indexing and querying. Mismatched models produce meaningless similarity scores.
3. Short vs. Long Content
- Short texts (sentences, queries): most models handle well; sentence transformers are purpose-built for this
- Long texts (pages, chapters): most models truncate at 512 tokens; need chunking before embedding, or use long-context embedding models (e.g.,
text-embedding-3-largewith up to 8191 tokens) - Asymmetric retrieval: queries are short, documents are long — models like
msmarco-distilbertare trained specifically for this mismatch
4. Benchmarking Embedding Models on Your Data
Don't trust public benchmarks (MTEB) alone — they may not reflect your domain. Build a small evaluation set:
- Sample ~100-200 representative queries from real or expected user questions
- For each query, manually label which chunks are relevant (golden set)
- Run retrieval with each candidate model
- Measure precision@k, recall@k, MRR (mean reciprocal rank)
- Pick the model with the best recall at acceptable latency/cost
5. Improving a Low-Accuracy OpenAI Embedding Model
If accuracy is low after switching to a better model, systematic steps:
- Analyze failure modes first — are misses due to vocabulary gaps, domain mismatch, or query-document length asymmetry?
- Query rewriting — rephrase short queries into longer, more descriptive forms before embedding
- HyDE (Hypothetical Document Embeddings) — generate a hypothetical answer to the query, embed that instead of the raw query; puts query in "answer space"
- Fine-tune a sentence transformer — label positive/negative pairs from your domain and run contrastive fine-tuning (much cheaper than fine-tuning an LLM)
- Hybrid search — combine dense embeddings with sparse BM25; keyword matches catch what semantic search misses
- Re-ranking — add a cross-encoder re-ranker on the top-k results; cross-encoders see query+document jointly and are far more accurate than embedding similarity alone
Common Follow-ups
-
"When would you fine-tune an embedding model vs. switch to a different one?" Switch first — it's faster and cheaper. Fine-tune when you have a specialized domain (legal, medical, code) where off-the-shelf models don't capture domain vocabulary, and you have sufficient labeled pairs (500+).
-
"What's the difference between
text-embedding-ada-002andtext-embedding-3-large?"3-largehas higher dimensionality (3072 vs 1536), supports dimension reduction (Matryoshka embeddings), and benchmarks better on MTEB.ada-002is cheaper but largely superseded. -
"How do you handle multilingual content?" Use a multilingual embedding model (e.g.,
multilingual-e5-large,paraphrase-multilingual-mpnet). Don't translate first — it adds latency and loses nuance.