Why This Is Asked
AI teams routinely underestimate production costs and get surprised after launch. Interviewers ask this to test whether you can think quantitatively about system design — and whether you understand the economic tradeoffs between managed APIs and self-hosted models.
Key Concepts to Cover
- SaaS API cost model — input/output token pricing, request volume projections
- Open source cost model — GPU types, utilization, inference throughput
- Cost drivers — prompt length, context window usage, batch vs. real-time
- Optimization levers — model selection, caching, batching, quantization
- Break-even analysis — when to switch from API to self-hosted
How to Approach This
1. SaaS API Cost Model
API costs are primarily driven by token count. Start with:
Monthly cost = (avg_input_tokens + avg_output_tokens) × requests_per_month × price_per_token
Example calculation:
- Application: customer support chatbot
- Avg input tokens: 500 (system prompt 200 + conversation history 200 + user query 100)
- Avg output tokens: 150
- Requests/day: 10,000 → 300,000/month
- Model: GPT-4o at $2.50/M input, $10/M output
Input cost: 500 × 300,000 × $2.50/1M = $375/month
Output cost: 150 × 300,000 × $10/1M = $450/month
Total: ~$825/month
Key insight: input tokens usually dominate because system prompts and retrieved RAG context can be 1,000-4,000 tokens per request.
2. Open Source Hosting Cost Model
Self-hosted models eliminate per-token fees but add GPU infrastructure costs.
Key variables:
- GPU type and hourly rate (A100 80GB ~$3-4/hr on-demand, H100 ~$5-8/hr)
- Model size and GPU memory requirements
- Throughput: tokens/second per GPU
- Utilization: what fraction of time is the GPU busy?
Example: hosting Llama 3 70B
- Requires 2× A100 80GB (or 4× A100 40GB) for FP16
- With quantization (INT8/INT4): fits on 1-2× A100
- 2× A100 reserved at ~$2/hr each = $4/hr → $2,880/month for 720 hrs
- Typical throughput: ~500-1,000 output tokens/sec at batch size 8
Break-even: if API cost > hosting cost at your volume, self-hosting wins economically. But add engineering cost, reliability overhead, and model update management.
3. The Context Window Is Your Biggest Cost Lever
Long system prompts and large RAG contexts inflate costs significantly:
- A 4,000-token system prompt doubles costs vs. a 500-token one
- RAG with 10 chunks × 500 tokens = 5,000 additional input tokens per query
- Conversation history grows with each turn — implement a rolling window or summarization
Optimizations:
- Minimize system prompt length (remove boilerplate, use concise instructions)
- Reduce k in RAG retrieval — retrieve 3 chunks, not 10
- Cache identical or near-identical prompts (semantic caching with a vector similarity threshold)
- Use a prompt cache if the API supports it (Anthropic, Google — prefix caching discounts repeated context)
4. Model Selection as a Cost Lever
Not all requests need GPT-4o. Route by complexity:
- Simple queries, classification, extraction → GPT-4o mini, Claude Haiku (~10-20× cheaper)
- Complex reasoning, open-ended generation → full model
A router that sends 70% of traffic to the smaller model can cut costs by 5-10× with minimal quality impact on simple tasks.
5. Quick Cost Comparison Framework
| Factor | SaaS API | Self-hosted | |---|---|---| | Fixed cost | None | GPU reservation | | Variable cost | Per token | Electricity/cloud marginal | | Best for | Variable traffic, early stage | Steady high volume | | Risk | Vendor price changes | GPU availability, ops overhead | | Break-even | — | ~$5-15K/month API spend |
Common Follow-ups
-
"How would you handle cost spikes from unexpectedly long conversations?" Implement hard token limits per session, use summarization to compress history above a threshold, and set up cost alerting. Charge-back to cost centers if multi-tenant.
-
"When does self-hosting not make sense even at high volume?" When the model is updated frequently (you'd need to re-deploy constantly), when you need the absolute frontier model (only available via API), or when your traffic is highly bursty and GPU utilization would be <20%.
-
"How would you estimate cost for a RAG system specifically?" Add embedding costs: embedding 1M document chunks at build time (~$0.02/1M tokens for
text-embedding-3-small) is a one-time cost. Query-time embedding is cheap — 100 tokens × 1M requests = $2. The dominant cost in a RAG system is LLM generation, not retrieval.