Understand when to use few-shot prompting versus fine-tuning for classification — covering cost, data requirements, latency, and when each approach wins.

Compare few-shot prompting and fine-tuning for LLM classification tasks. When to choose each approach, tradeoffs, cost, and data requirements.

Few-Shot Prompting vs Fine-Tuning - Interview Question

Why This Is Asked

Interviewers want to see your decision-making framework — not just that you know both techniques, but that you can reason about when each is appropriate based on data availability, cost, latency, and maintenance overhead.

Key Concepts to Cover

Few-shot prompting — including examples in the prompt to guide behavior
Fine-tuning — updating model weights on a task-specific dataset
Data requirements — how much labeled data each approach needs
Latency and cost — few-shot sends more tokens; fine-tuning uses smaller models
Starting point — in most cases, start with few-shot, fine-tune when necessary

How to Approach This

1. Few-Shot Prompting

Include 3-20 labeled examples directly in the prompt before the test input.

Pros: Zero training required, easy to update, works immediately, interpretable.

Cons: Examples take up context window, higher per-request cost and latency, performance ceiling.

Best for: New tasks, tasks with few examples, frequently changing requirements, prototyping.

2. Fine-Tuning

Continue training a model on your labeled dataset, updating its weights.

Pros: Better performance on well-defined tasks, lower per-request cost, lower latency, better consistency.

Cons: Requires labeled data (100-10,000+ examples), training cost, risk of overfitting, re-training when task shifts.

Best for: High-volume tasks with stable definitions, tasks where examples alone don't achieve needed accuracy.

3. Decision Framework

| Condition | Recommendation | |-----------|---------------| | < 100 labeled examples | Few-shot | | Task changes frequently | Few-shot | | Prototyping | Few-shot | | > 1,000 examples, stable task | Consider fine-tuning | | Cost/latency critical at scale | Fine-tuning | | Few-shot accuracy is sufficient | Stay with few-shot |

4. Practical Recommendation

In most cases, start with few-shot. Fine-tune when:

Few-shot accuracy doesn't meet requirements after optimization
You have enough labeled data
The task is stable enough
Cost/latency savings justify the engineering investment

Common Follow-ups

"What about RLHF?" RLHF (and its modern variants — DPO, RLAIF, Constitutional AI) are used to shape model behavior based on human or AI feedback. They were initially associated with general alignment (safety, helpfulness), but are now widely applied for task-specific adaptation too. For a classification task, standard supervised fine-tuning is usually the right starting point. RLHF-style techniques become relevant when you want to optimize for nuanced human preferences that are hard to capture with simple labeled examples — for example, response quality in open-ended tasks or avoiding specific failure modes.
"How much data do you need to fine-tune?" Simple binary: 200-500 examples. Multi-class: 1,000-10,000. Highly specialized: 10,000+. Always evaluate on a held-out validation set.
"What is LoRA and when would you use it?" LoRA freezes most model weights and trains small adapter matrices, dramatically reducing training cost. Standard approach for fine-tuning large models without full compute budget.

Compare Few-Shot Prompting vs. Fine-Tuning for a Classification Task

Why This Is Asked

Key Concepts to Cover

How to Approach This

1. Few-Shot Prompting

2. Fine-Tuning

3. Decision Framework

4. Practical Recommendation

Common Follow-ups

Related Questions

How Do You Evaluate Whether a Prompt Is Working Well?

Explain Chain-of-Thought Prompting and When to Use It

Explain the Tradeoffs Between Latency, Cost, and Quality in LLM Selection

Prep for the full interview loop