- TheVowelsOfX's Newsletter
- Posts
- 💰 Cost-Optimized GPU Deployments for LLMs on AWS EKS
💰 Cost-Optimized GPU Deployments for LLMs on AWS EKS
How we cut LLM inference costs by 70% without sacrificing performance
We had built a blazing-fast LLM inference API—FastAPI-powered, containerized, and deployed on AWS EKS.
It served quantized versions of Mistral and LLaMA models for internal workloads and client projects. Everything worked great... until the AWS bill arrived.
📉 The Problem: GPU Burnout in the Cloud
We saw our monthly bill spike to $1,2000+—most of it from idle GPU instances. Even though requests were sporadic, the GPU nodes stayed warm 24/7.
Here's the raw truth of AWS GPU pricing (2025):
Instance Type | GPU | On-Demand | Spot (avg) |
g5.xlarge | 1x A10G | ~$1.20/hr | ~$0.40/hr |
p3.2xlarge | 1x V100 | ~$3.06/hr | ~$0.90/hr |
At one point, we had 20 GPU nodes with idle pods just waiting to be hit. That’s $7000+ per month, wasted.
So, we went back to the drawing board—and re-architected our setup with cost optimization as a core pillar.
🧭 Our Optimization Playbook: Lessons from the Field
Below are the five key strategies we applied while running LLMs in production—along with real YAML examples, theory, and tooling tips.
🏗️ Final Architecture Overview
We split the API and model workloads across CPU and GPU nodes:
+------------------------+
| ALB |
+----------+------------+
|
+-----------v-----------+
| Ingress (Nginx) |
+-----------+-----------+
|
+--------------v---------------+
| FastAPI (CPU Node) |
+--------------+---------------+
|
+-------------v-------------+
| LLM Worker Pods (GPU) |
| - Mistral / LLaMA |
| - Quantized models |
+---------------------------+FastAPI runs on cheaper CPU nodes
LLM workers are isolated to GPU nodes (spot-preferred, on-demand fallback)
Autoscaling handled by Karpenter and HPA
Visibility via Datadog and Kubecost
🎯 Strategy 1: Isolate GPU Workloads with Taints & Tolerations
We noticed non-LLM pods (FastAPI, DB sidecars) were scheduling on GPU nodes. This was a silent cost leak.
🎓 Taints prevent unwanted workloads from running on special nodes. Tolerations allow specific pods to bypass that restriction.
➤ What We Did
We tainted GPU nodes:
kubectl taint nodes <node-name> accelerator=nvidia:NoSchedule
And added tolerations to LLM workers:
yaml:
spec:
template:
spec:
tolerations:
- key: "accelerator"
operator: "Equal"
value: "nvidia"
effect: "NoSchedule"
containers:
- name: llm-worker
image: myregistry/mistral-api:latest
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1✅ This ensured GPU nodes were used only when needed, only by GPU workloads.
🎯 Strategy 2: Spot First, On-Demand Fallback with Karpenter
We initially ran everything on On-Demand nodes. Costly mistake.
🎓 Spot Instances can be 60–80% cheaper, but are interruptible. That’s where fallback provisioning saves the day.
➤ What We Did
Configured Karpenter with both Spot and On-Demand provisioners:
Spot Provisioner:
yaml:
kind: Provisioner
metadata:
name: gpu-spot
spec:
requirements:
- key: "karpenter.sh/capacity-type"
operator: In
values: ["spot"]
- key: "node.kubernetes.io/instance-type"
operator: In
values: ["g5.xlarge"]
taints:
- key: "accelerator"
value: "nvidia"
effect: "NoSchedule"On-Demand fallback:
yaml:
metadata:
name: gpu-ondemand
spec:
requirements:
- key: "karpenter.sh/capacity-type"
operator: In
values: ["on-demand"]✅ Now our cluster prefers Spot for GPU pods, but gracefully falls back to On-Demand during high traffic or Spot unavailability.
🎯 Strategy 3: Quantize Models to Fit More in Less
Loading full-precision (FP16) LLMs was overkill for our use case. Most of our responses were under 200 tokens.
🎓 Quantization reduces model size and GPU memory needs by converting weights to lower precision (e.g., INT8).
➤ What We Did
We quantized LLaMA2 7B using AutoGPTQ:
python:
from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_quantized("models/llama-7b-int8")Before: ~14 GB → After: ~3.8 GB
✅ This let us run 2–3 models per GPU, improving utilization and enabling batching for concurrent inference.
🎯 Strategy 4: Autoscale Based on Real Usage
Initially, we ran a static number of pods—leading to overprovisioning during quiet hours.
🎓 Karpenter scales nodes, HPA scales pods. Together, they handle traffic bursts intelligently.
➤ What We Did
Enabled autoscaling using HPA:
yaml:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-worker
minReplicas: 1
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70✅ Our pods now scale with traffic, and Karpenter provisions GPU nodes only when needed.
🎯 Strategy 5: Continuous Cost Monitoring with Cast AI + Kubecost
After optimization, we needed proof: Were we really saving money?
➤ What We Did
Used Kubecost to visualize per-pod and per-node cost
Integrated Cast AI for cost intelligence and autoscaling suggestions
🧠 Cast AI showed us idle GPU nodes and even recommended cheaper instance types!
✅ We trimmed ~40% extra costs just by switching to better instance types and enabling auto shutdown during off-hours.
🧵 TL;DR – Our Journey to Affordable LLM Inference on EKS
We started with expensive on-demand GPU nodes and no scaling.
Introduced taints, Spot provisioning, and quantization.
Adopted Karpenter, HPA, Cast AI, and Kubecost.
End result: ~70% savings, full performance, no vendor lock-in.
📬 Stay Connected with The Vowels of X
At The Vowels of X, we share insights across three key verticals:
🛠️ DevOps — Tools, automation, CI/CD, observability
🚗 Automobile — Industry trends, mobility tech, and innovation
🏥 Healthcare — Digital health, AI in medicine, and tech infrastructure
If any of this interests you, stay in the loop:
📬 Newsletter: thevowelsofx.beehiiv.com
💼 LinkedIn: linkedin.com/company/thevowelsofx
✍️ Medium: medium.com/@thevowelsofx
🔁 Feel free to follow, share, and drop a comment if you found this useful!
Reply