Walk through how to estimate and model the cost of running an LLM system in production — covering API token costs, open source GPU infra, and key levers for optimization.

Learn how to estimate infrastructure costs for LLM systems. Covers token pricing, open source hosting, GPU costs, and cost optimization strategies.

LLM Cost Estimation: SaaS vs Open Source - AI Interview Question

Why This Is Asked

AI teams routinely underestimate production costs and get surprised after launch. Interviewers ask this to test whether you can think quantitatively about system design — and whether you understand the economic tradeoffs between managed APIs and self-hosted models.

Key Concepts to Cover

SaaS API cost model — input/output token pricing, request volume projections
Open source cost model — GPU types, utilization, inference throughput
Cost drivers — prompt length, context window usage, batch vs. real-time
Optimization levers — model selection, caching, batching, quantization
Break-even analysis — when to switch from API to self-hosted

How to Approach This

1. SaaS API Cost Model

API costs are primarily driven by token count. Start with:

Monthly cost = (avg_input_tokens + avg_output_tokens) × requests_per_month × price_per_token

Example calculation:

Application: customer support chatbot
Avg input tokens: 500 (system prompt 200 + conversation history 200 + user query 100)
Avg output tokens: 150
Requests/day: 10,000 → 300,000/month
Model: GPT-4o at $2.50/M input, $10/M output

Input cost:  500 × 300,000 × $2.50/1M = $375/month
Output cost: 150 × 300,000 × $10/1M  = $450/month
Total: ~$825/month

Key insight: input tokens usually dominate because system prompts and retrieved RAG context can be 1,000-4,000 tokens per request.

2. Open Source Hosting Cost Model

Self-hosted models eliminate per-token fees but add GPU infrastructure costs.

Key variables:

GPU type and hourly rate (A100 80GB ~$3-4/hr on-demand, H100 ~$5-8/hr)
Model size and GPU memory requirements
Throughput: tokens/second per GPU
Utilization: what fraction of time is the GPU busy?

Example: hosting Llama 3 70B

Requires 2× A100 80GB (or 4× A100 40GB) for FP16
With quantization (INT8/INT4): fits on 1-2× A100
2× A100 reserved at ~$2/hr each = $4/hr → $2,880/month for 720 hrs
Typical throughput: ~500-1,000 output tokens/sec at batch size 8

Break-even: if API cost > hosting cost at your volume, self-hosting wins economically. But add engineering cost, reliability overhead, and model update management.

3. The Context Window Is Your Biggest Cost Lever

Long system prompts and large RAG contexts inflate costs significantly:

A 4,000-token system prompt doubles costs vs. a 500-token one
RAG with 10 chunks × 500 tokens = 5,000 additional input tokens per query
Conversation history grows with each turn — implement a rolling window or summarization

Optimizations:

Minimize system prompt length (remove boilerplate, use concise instructions)
Reduce k in RAG retrieval — retrieve 3 chunks, not 10
Cache identical or near-identical prompts (semantic caching with a vector similarity threshold)
Use a prompt cache if the API supports it (Anthropic, Google — prefix caching discounts repeated context)

4. Model Selection as a Cost Lever

Not all requests need GPT-4o. Route by complexity:

Simple queries, classification, extraction → GPT-4o mini, Claude Haiku (~10-20× cheaper)
Complex reasoning, open-ended generation → full model

A router that sends 70% of traffic to the smaller model can cut costs by 5-10× with minimal quality impact on simple tasks.

5. Quick Cost Comparison Framework

| Factor | SaaS API | Self-hosted | |---|---|---| | Fixed cost | None | GPU reservation | | Variable cost | Per token | Electricity/cloud marginal | | Best for | Variable traffic, early stage | Steady high volume | | Risk | Vendor price changes | GPU availability, ops overhead | | Break-even | — | ~$5-15K/month API spend |

Common Follow-ups

"How would you handle cost spikes from unexpectedly long conversations?" Implement hard token limits per session, use summarization to compress history above a threshold, and set up cost alerting. Charge-back to cost centers if multi-tenant.
"When does self-hosting not make sense even at high volume?" When the model is updated frequently (you'd need to re-deploy constantly), when you need the absolute frontier model (only available via API), or when your traffic is highly bursty and GPU utilization would be <20%.
"How would you estimate cost for a RAG system specifically?" Add embedding costs: embedding 1M document chunks at build time (~$0.02/1M tokens for text-embedding-3-small) is a one-time cost. Query-time embedding is cheap — 100 tokens × 1M requests = $2. The dominant cost in a RAG system is LLM generation, not retrieval.

How Do You Estimate the Cost of Running a Production LLM System?

Why This Is Asked

Key Concepts to Cover

How to Approach This

1. SaaS API Cost Model

2. Open Source Hosting Cost Model

3. The Context Window Is Your Biggest Cost Lever

4. Model Selection as a Cost Lever

5. Quick Cost Comparison Framework

Common Follow-ups

Related Questions

Explain the Tradeoffs Between Latency, Cost, and Quality in LLM Selection

How Would You Architect a Multi-Model AI Gateway?

How Do You Optimize LLM Inference for Higher Throughput and Lower Latency?

Prep for the full interview loop