💰 Cost-Optimized GPU Deployments for LLMs on AWS EKS

How we cut LLM inference costs by 70% without sacrificing performance

We had built a blazing-fast LLM inference API—FastAPI-powered, containerized, and deployed on AWS EKS.
It served quantized versions of Mistral and LLaMA models for internal workloads and client projects. Everything worked great... until the AWS bill arrived.

📉 The Problem: GPU Burnout in the Cloud

We saw our monthly bill spike to $1,2000+most of it from idle GPU instances. Even though requests were sporadic, the GPU nodes stayed warm 24/7.
Here's the raw truth of AWS GPU pricing (2025):

Instance Type

GPU

On-Demand

Spot (avg)

g5.xlarge

1x A10G

~$1.20/hr

~$0.40/hr

p3.2xlarge

1x V100

~$3.06/hr

~$0.90/hr

At one point, we had 20 GPU nodes with idle pods just waiting to be hit. That’s $7000+ per month, wasted.

So, we went back to the drawing board—and re-architected our setup with cost optimization as a core pillar.

🧭 Our Optimization Playbook: Lessons from the Field

Below are the five key strategies we applied while running LLMs in production—along with real YAML examples, theory, and tooling tips.

🏗️ Final Architecture Overview

We split the API and model workloads across CPU and GPU nodes:

+------------------------+
         |        ALB            |
         +----------+------------+
                     |
         +-----------v-----------+
         |      Ingress (Nginx)  |
         +-----------+-----------+
                     |
      +--------------v---------------+
      | FastAPI (CPU Node)           |
      +--------------+---------------+
                     |
       +-------------v-------------+
       | LLM Worker Pods (GPU)     |
       |   - Mistral / LLaMA       |
       |   - Quantized models      |
       +---------------------------+
  • FastAPI runs on cheaper CPU nodes

  • LLM workers are isolated to GPU nodes (spot-preferred, on-demand fallback)

  • Autoscaling handled by Karpenter and HPA

  • Visibility via Datadog and Kubecost

🎯 Strategy 1: Isolate GPU Workloads with Taints & Tolerations

We noticed non-LLM pods (FastAPI, DB sidecars) were scheduling on GPU nodes. This was a silent cost leak.

🎓 Taints prevent unwanted workloads from running on special nodes. Tolerations allow specific pods to bypass that restriction.

➤ What We Did

We tainted GPU nodes:

kubectl taint nodes <node-name> accelerator=nvidia:NoSchedule

And added tolerations to LLM workers:

yaml:

spec:
  template:
    spec:
      tolerations:
        - key: "accelerator"
          operator: "Equal"
          value: "nvidia"
          effect: "NoSchedule"
      containers:
        - name: llm-worker
          image: myregistry/mistral-api:latest
          resources:
            requests:
              nvidia.com/gpu: 1
            limits:
              nvidia.com/gpu: 1

✅ This ensured GPU nodes were used only when needed, only by GPU workloads.

🎯 Strategy 2: Spot First, On-Demand Fallback with Karpenter

We initially ran everything on On-Demand nodes. Costly mistake.

🎓 Spot Instances can be 60–80% cheaper, but are interruptible. That’s where fallback provisioning saves the day.

➤ What We Did

Configured Karpenter with both Spot and On-Demand provisioners:

Spot Provisioner:

yaml:

kind: Provisioner
metadata:
  name: gpu-spot
spec:
  requirements:
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["spot"]
    - key: "node.kubernetes.io/instance-type"
      operator: In
      values: ["g5.xlarge"]
  taints:
    - key: "accelerator"
      value: "nvidia"
      effect: "NoSchedule"

On-Demand fallback:

yaml:

metadata:
  name: gpu-ondemand
spec:
  requirements:
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["on-demand"]

✅ Now our cluster prefers Spot for GPU pods, but gracefully falls back to On-Demand during high traffic or Spot unavailability.

🎯 Strategy 3: Quantize Models to Fit More in Less

Loading full-precision (FP16) LLMs was overkill for our use case. Most of our responses were under 200 tokens.

🎓 Quantization reduces model size and GPU memory needs by converting weights to lower precision (e.g., INT8).

➤ What We Did

We quantized LLaMA2 7B using AutoGPTQ:

python:

from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_quantized("models/llama-7b-int8")

Before: ~14 GB → After: ~3.8 GB
✅ This let us run 2–3 models per GPU, improving utilization and enabling batching for concurrent inference.

🎯 Strategy 4: Autoscale Based on Real Usage

Initially, we ran a static number of pods—leading to overprovisioning during quiet hours.

🎓 Karpenter scales nodes, HPA scales pods. Together, they handle traffic bursts intelligently.

➤ What We Did

Enabled autoscaling using HPA:

yaml:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-worker
  minReplicas: 1
  maxReplicas: 5
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

✅ Our pods now scale with traffic, and Karpenter provisions GPU nodes only when needed.

🎯 Strategy 5: Continuous Cost Monitoring with Cast AI + Kubecost

After optimization, we needed proof: Were we really saving money?

➤ What We Did

  • Used Kubecost to visualize per-pod and per-node cost

  • Integrated Cast AI for cost intelligence and autoscaling suggestions

🧠 Cast AI showed us idle GPU nodes and even recommended cheaper instance types!

✅ We trimmed ~40% extra costs just by switching to better instance types and enabling auto shutdown during off-hours.

🧵 TL;DR – Our Journey to Affordable LLM Inference on EKS

  • We started with expensive on-demand GPU nodes and no scaling.

  • Introduced taints, Spot provisioning, and quantization.

  • Adopted Karpenter, HPA, Cast AI, and Kubecost.

End result: ~70% savings, full performance, no vendor lock-in.

📬 Stay Connected with The Vowels of X

At The Vowels of X, we share insights across three key verticals:

🛠️ DevOps — Tools, automation, CI/CD, observability
🚗 Automobile — Industry trends, mobility tech, and innovation
🏥 Healthcare — Digital health, AI in medicine, and tech infrastructure

If any of this interests you, stay in the loop:

🔁 Feel free to follow, share, and drop a comment if you found this useful!

Reply

or to participate.