Intermediate4 min read

How Do You Estimate the Cost of Running a Production LLM System?

Walk through how to estimate and model the cost of running an LLM system in production — covering API token costs, open source GPU infra, and key levers for optimization.

Prep for the full interview loop

Know the concepts. Now prove it. Practice GenAI, Coding, System Design, and AI/ML Design interviews with an AI that tells you exactly where you fell short.

Start a mock interview

Why This Is Asked

AI teams routinely underestimate production costs and get surprised after launch. Interviewers ask this to test whether you can think quantitatively about system design — and whether you understand the economic tradeoffs between managed APIs and self-hosted models.

Key Concepts to Cover

  • SaaS API cost model — input/output token pricing, request volume projections
  • Open source cost model — GPU types, utilization, inference throughput
  • Cost drivers — prompt length, context window usage, batch vs. real-time
  • Optimization levers — model selection, caching, batching, quantization
  • Break-even analysis — when to switch from API to self-hosted

How to Approach This

1. SaaS API Cost Model

API costs are primarily driven by token count. Start with:

Monthly cost = (avg_input_tokens + avg_output_tokens) × requests_per_month × price_per_token

Example calculation:

  • Application: customer support chatbot
  • Avg input tokens: 500 (system prompt 200 + conversation history 200 + user query 100)
  • Avg output tokens: 150
  • Requests/day: 10,000 → 300,000/month
  • Model: GPT-4o at $2.50/M input, $10/M output
Input cost:  500 × 300,000 × $2.50/1M = $375/month
Output cost: 150 × 300,000 × $10/1M  = $450/month
Total: ~$825/month

Key insight: input tokens usually dominate because system prompts and retrieved RAG context can be 1,000-4,000 tokens per request.

2. Open Source Hosting Cost Model

Self-hosted models eliminate per-token fees but add GPU infrastructure costs.

Key variables:

  • GPU type and hourly rate (A100 80GB ~$3-4/hr on-demand, H100 ~$5-8/hr)
  • Model size and GPU memory requirements
  • Throughput: tokens/second per GPU
  • Utilization: what fraction of time is the GPU busy?

Example: hosting Llama 3 70B

  • Requires 2× A100 80GB (or 4× A100 40GB) for FP16
  • With quantization (INT8/INT4): fits on 1-2× A100
  • 2× A100 reserved at ~$2/hr each = $4/hr → $2,880/month for 720 hrs
  • Typical throughput: ~500-1,000 output tokens/sec at batch size 8

Break-even: if API cost > hosting cost at your volume, self-hosting wins economically. But add engineering cost, reliability overhead, and model update management.

3. The Context Window Is Your Biggest Cost Lever

Long system prompts and large RAG contexts inflate costs significantly:

  • A 4,000-token system prompt doubles costs vs. a 500-token one
  • RAG with 10 chunks × 500 tokens = 5,000 additional input tokens per query
  • Conversation history grows with each turn — implement a rolling window or summarization

Optimizations:

  • Minimize system prompt length (remove boilerplate, use concise instructions)
  • Reduce k in RAG retrieval — retrieve 3 chunks, not 10
  • Cache identical or near-identical prompts (semantic caching with a vector similarity threshold)
  • Use a prompt cache if the API supports it (Anthropic, Google — prefix caching discounts repeated context)

4. Model Selection as a Cost Lever

Not all requests need GPT-4o. Route by complexity:

  • Simple queries, classification, extraction → GPT-4o mini, Claude Haiku (~10-20× cheaper)
  • Complex reasoning, open-ended generation → full model

A router that sends 70% of traffic to the smaller model can cut costs by 5-10× with minimal quality impact on simple tasks.

5. Quick Cost Comparison Framework

| Factor | SaaS API | Self-hosted | |---|---|---| | Fixed cost | None | GPU reservation | | Variable cost | Per token | Electricity/cloud marginal | | Best for | Variable traffic, early stage | Steady high volume | | Risk | Vendor price changes | GPU availability, ops overhead | | Break-even | — | ~$5-15K/month API spend |

Common Follow-ups

  1. "How would you handle cost spikes from unexpectedly long conversations?" Implement hard token limits per session, use summarization to compress history above a threshold, and set up cost alerting. Charge-back to cost centers if multi-tenant.

  2. "When does self-hosting not make sense even at high volume?" When the model is updated frequently (you'd need to re-deploy constantly), when you need the absolute frontier model (only available via API), or when your traffic is highly bursty and GPU utilization would be <20%.

  3. "How would you estimate cost for a RAG system specifically?" Add embedding costs: embedding 1M document chunks at build time (~$0.02/1M tokens for text-embedding-3-small) is a one-time cost. Query-time embedding is cheap — 100 tokens × 1M requests = $2. The dominant cost in a RAG system is LLM generation, not retrieval.

Related Questions

Prep for the full interview loop

Know the concepts. Now prove it. Practice GenAI, Coding, System Design, and AI/ML Design interviews with an AI that tells you exactly where you fell short.

Start a mock interview