πŸš€ Build & Deploy a Lightweight LLM API with FastAPI, Docker & Kubernetes

In this blog, you’ll learn how to deploy a fast and efficient text generation API using FastAPI and Hugging Face Transformers. We’ll containerize the app with Docker, deploy it to Kubernetes (EKS-ready), and explore practical AWS strategies for running large language model (LLM) workloads β€” whether you're building for prototyping, production inference, or experimentation.

πŸ”§ What You’ll Build

βœ… A FastAPI server with a /generate endpoint
βœ… Powered by Hugging Face’s distilgpt2 for fast, CPU-friendly inference
βœ… Runs inside a lightweight Docker container
βœ… Fully compatible with Kubernetes (EKS-ready)
βœ… Includes optional GPU support for inference at scale

🧠 App Code – main.py

from fastapi import FastAPI
from pydantic import BaseModel
from transformers import pipeline
app = FastAPI()
generator = pipeline("text-generation", model="distilgpt2")
class PromptRequest(BaseModel):
    prompt: str
@app.post("/generate")
def generate(request: PromptRequest):
    result = generator(request.prompt, max_length=50, do_sample=True)
    return {"result": result[0]['generated_text']}

πŸ“¦ Dockerfile

FROM python:3.10-slim
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y \
    build-essential \
    git \
    && rm -rf /var/lib/apt/lists/*
RUN pip install --upgrade pip && \
    pip install numpy==1.26.4 torch==2.2.2 transformers==4.41.2 fastapi uvicorn accelerate
RUN python -c "import numpy; print('NumPy version:', numpy.__version__)" && \
    python -c "import torch; print('Torch version:', torch.__version__)"
WORKDIR /app
COPY main.py .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]

πŸ§ͺ Test Locally with Curl

docker build -t fastapi-llm .
docker run -p 8080:8080 fastapi-llm
curl -X POST http://localhost:8080/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "DevOps plays an"}'

βœ… Example Output:

{
  "result": "DevOps plays an important role in the delivery of the system..."
}

Docker Container Logs

Curl Output

☸️ Kubernetes + EKS Deployment

If you're deploying to AWS EKS or another Kubernetes cluster, here are the manifests:

k8s/deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-api
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llm-api
  template:
    metadata:
      labels:
        app: llm-api
    spec:
      containers:
        - name: llm-container
          image: <your-ecr-repo>/llm-api:latest
          ports:
            - containerPort: 8080
          resources:
            limits:
              nvidia.com/gpu: 1
          volumeMounts:
            - mountPath: /dev/shm
              name: dshm
      volumes:
        - name: dshm
          emptyDir:
            medium: Memory
      nodeSelector:
        eks.amazonaws.com/accelerator: "nvidia"

k8s/service.yaml

apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  type: LoadBalancer
  selector:
    app: llm-api
  ports:
    - port: 80
      targetPort: 8080

NGINX Ingress Controller

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx helm repo update  

helm install ingress-nginx ingress-nginx/ingress-nginx \ --namespace ingress-nginx --create-namespace

Install cert-manager

kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.12.0/cert-manager.crds.yaml
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager --namespace cert-manager --create-namespace

Let’s Encrypt with ClusterIssuer

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    email: "[email protected]"
    server: "https://acme-v02.api.letsencrypt.org/directory"
    privateKeySecretRef:
      name: letsencrypt-prod-private-key
    solvers:
    - http01:
        ingress:
          class: nginx

k8s/ingress.yaml

# k8s/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: llm-ingress
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /$1
    kubernetes.io/ingress.class: "nginx"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
  ingressClassName: nginx
  rules:
    - host: llm.yourdomain.com  # Replace with your actual domain
      http:
        paths:
          - path: /?(.*)
            pathType: Prefix
            backend:
              service:
                name: llm-service
                port:
                  number: 80

βœ… Don’t forget to apply the NVIDIA GPU device plugin:

πŸ”§ Practical AWS Guide: Strategic LLM Deployment

When you're ready to move beyond local and test ideas at scale, here’s a playbook:

🧠 1. Define the Use Case First

Ask:

  • Is it training, fine-tuning, or inference?

  • Real-time (chatbots) vs batch (e.g. document summarization)?

  • Open models (Hugging Face)? Proprietary (OpenAI)? Or hybrid?

🧱 2. AWS Options for LLM Workloads

Option

Best For

Notes

Amazon SageMaker

Managed training + inference

Includes Hugging Face DLCs and autoscaling

SageMaker Serverless

Bursty inference, zero infra mgmt

Pay-per-inference

Amazon Bedrock

No infra, plug-and-play LLMs

Use Claude, LLaMA, Mistral, etc.

EC2 + DLAMI

Full control (GPU-enabled)

Best for power users

EKS/ECS + Inferentia2

Cost-efficient long-term inference

Use Trainium for training

πŸ› οΈ 3. GPU Recommendations

Use Case

Instance Type

Small LLM inference

g5.xlarge, g5.2xlarge

Fine-tuning or batching

p4d, p5

Cost-efficient inference

inf2.xlarge

Distributed training

ml.p4d.24xlarge

πŸ’° 4. Cost & Scale Strategy

  • Use Spot Instances for training (save up to 90%)

  • Autoscale inference endpoints with SageMaker

  • Share training datasets using EFS

  • Use lifecycle policies or scheduled shutdowns

πŸ” 5. Security Best Practices

  • Use VPC-only endpoints (for Bedrock/SageMaker)

  • Apply IAM least privilege

  • Encrypt at rest & in transit with KMS

  • Use Amazon Macie to scan sensitive training data

πŸ“¦ 6. AI Integrations & Tools

  • LangChain or Haystack for RAG pipelines

  • Pinecone or OpenSearch for vector search

  • Step Functions for orchestration

πŸ§ͺ 7. Experimentation Strategy

  • Use SageMaker Studio Lab for quick tests

  • Try Hugging Face on SageMaker JumpStart

  • Monitor deployments via Model Monitor

πŸ“ Sample Reference Architecture

Client App (Chat UI/API)
         β†“
     API Gateway
         β†“
     Lambda (Auth, Routing)
         β†“
  SageMaker Endpoint (LLM)
         β†“
Optional: Vector DB (RAG)
         β†“
     Amazon S3 (Docs, Embeds)

πŸ’‘ Bonus: Fast POCs for Clients

  • Amazon Bedrock: instant LLMs, no infra

  • text-generation-webui on g5.xlarge for open-source

  • Hugging Face + SageMaker JumpStart: 3-click deploy

πŸ“¬ Final Thoughts

When it comes to LLM deployment, the model is only half the battle β€” the real challenge often lies in everything around it: getting the environment right, ensuring portability, and designing for scale.

With this approach, you get the best of all worlds:

βœ… Local development is a breeze with Docker β€” no heavy dependencies, no surprises.
βœ… Kubernetes-ready deployments (EKS or any cloud) give you flexibility to scale when needed.
βœ… AWS-native options ensure you can optimize for performance, cost, and compliance β€” whether you're prototyping or running production inference at scale.

This setup is designed to grow with you β€” from solo testing on your laptop to handling enterprise-level requests across multiple regions.

✨ Try it out, extend it, break it, rebuild it. And if you're working on something similar β€” or planning to β€” I'd love to hear about it and help shape it further.

Let’s simplify LLM deployment together. πŸš€

Reply

or to participate.