Rafael COELHO
← Writing

DECEMBER 3, 2025

Enterprise MLOps platform — production Kubernetes infrastructure with full CI/CD, GitOps, and workflow orchestration

Production-grade MLOps platform on Kubernetes — Terraform, k3d, GitLab, ArgoCD, Apache Airflow, MinIO, Rancher — with end-to-end IaC, GitOps continuous delivery, and DAG-orchestrated ML pipelines. Validated against a Real-Time ML project.

#kubernetes #mlops #terraform #argocd #airflow #gitops #cicd

Executive summary

This is the design and implementation of a production-grade MLOps platform on Kubernetes — integrating modern cloud-native technologies for end-to-end CI/CD pipelines, GitOps deployments, and scalable workflow orchestration.

It’s a significant evolution from the earlier local k3d cluster setup post: the source isn’t public this time (it’s a personal platform powering real production work), so this writeup focuses on how it was built and what it delivers rather than walking through config files line by line.

Business context

Over months of working on enterprise ML projects — a recommendation system for job positions, a Speech-To-Speech system (STT/ASR + TTS), an Agentic AI platform, and a personal Real-Time ML project — the need for a robust, production-ready MLOps infrastructure became evident. I needed an environment that could:

  • Replicate production conditions locally
  • Maintain full CI/CD capabilities end-to-end
  • Provide GitOps-based continuous delivery
  • Enable automated workflow orchestration for ML pipelines
  • Ensure reproducibility across operating systems
  • Deliver enterprise-grade reliability and scalability

Platform architecture

Enterprise MLOps platform architecture — Terraform provisions the k3d cluster; GitLab + ArgoCD provide CI/CD + GitOps; Apache Airflow orchestrates ML pipelines; MinIO/PostgreSQL/Redis form the data layer; Rancher provides cluster observability.

Infrastructure as Code

The entire infrastructure is provisioned with Terraform — an abstraction layer ensuring reproducibility across Linux, Windows, and macOS. Declarative infra management means the whole config is version-controlled, with no drift between dev, staging, and prod-like environments.

Core components

The platform consists of multiple integrated services, each serving a specific role in the MLOps ecosystem:

Container orchestration

  • k3d — lightweight Kubernetes distribution, 1 server + 3 agent nodes, prod-like behavior locally
  • Docker — container runtime
  • Helm — Kubernetes package manager for declarative app deployment
  • k3d Registry — local container registry for custom images

GitOps & CI/CD

  • GitLab — source control + CI/CD + container registry, all in one
  • ArgoCD — GitOps continuous delivery, auto-syncing from Git repositories
  • ArgoCD Image Updater — automated detection and deployment of new image versions

Workflow orchestration

  • Apache Airflow — production-grade DAG orchestration for ML pipelines and data workflows
  • Airflow DAG Processor — dedicated parser and scheduler
  • Git-Sync — DAGs synchronized automatically from GitLab repositories

Data & storage

  • MinIO — S3-compatible object storage for ML artifacts, models, datasets, pipeline outputs
  • PostgreSQL — relational metadata store for Airflow + GitLab
  • Redis — in-memory cache and message queue for workflow execution

Management & monitoring

  • Rancher — centralized cluster monitoring, resource management, operational insights
  • LocalStack — AWS service emulator for local development without external dependencies

Hardware footprint

ComponentSpecification
RAM64 GB
CPU8 cores (modern x86_64)
Storage2 TB NVMe SSD
Orchestrationk3d (local Kubernetes)

Full platform footprint: ~6–10 GB RAM total in steady state. GitLab dominates and can be trimmed via values.yaml for tighter resource budgets — most other components are lean.

CI/CD pipeline integration

End-to-end flow from a commit to a live deployment:

  1. Source control — code repositories in the local GitLab instance
  2. Continuous integration — GitLab CI pipelines automatically triggered on code commits
  3. Container build — Docker images built within GitLab CI and pushed to the k3d Registry
  4. Auto-deploy — ArgoCD Image Updater detects new image versions in the registry
  5. GitOps sync — ArgoCD applies Kubernetes manifests from Git repositories
  6. Health monitoring — Rancher provides real-time visibility into deployment health and resource utilization
ArgoCD UI showing a deployed project with synced application state — the GitOps loop closed end-to-end.

ArgoCD showing a deployed project — GitOps loop closed end-to-end, application state synced from Git to the cluster automatically.

Workflow orchestration

Apache Airflow provides enterprise-grade workflow orchestration capabilities:

  • DAG management — workflow definitions automatically synchronized from GitLab repositories via the git-sync sidecar
  • Scalable execution — distributed task execution across Kubernetes pods
  • Data pipeline integration — direct integration with MinIO for artifact storage and PostgreSQL for metadata management
  • Monitoring & alerting — built-in monitoring of pipeline execution with failure handling
An Airflow DAG running on the platform — the workflow execution graph visible in the Airflow UI.

An Airflow DAG running on the platform — tasks scheduled, executed across Kubernetes pods, monitored through the Airflow UI.

Validation — integration with a Real-Time ML project

The platform was validated by integrating it with a Real-Time Machine Learning project that simulates real-time data generation and trains classification, regression, and clustering models in real time.

End-to-end workflow validated:

  1. Repo pushed to the local GitLab instance
  2. GitLab CI pipeline built multiple Docker images using microservices architecture
  3. Images pushed to k3d Registry with semantic versioning
  4. ArgoCD Image Updater detected new versions and triggered automated deployment
  5. Helm charts defined all Kubernetes resources, applied and managed by ArgoCD
  6. Rancher provided live monitoring of resource health and pod status
  7. Airflow DAGs orchestrated ML training pipelines, preprocessing, and deployment workflows
Rancher UI showing all Helm packages installed into the k3d cluster — real-time visibility into deployment health and resource utilization.

Rancher showing all Helm packages installed into the k3d cluster — real-time visibility into deployment health and resource utilization across the full platform.

Key achievements

Technical excellence

  • Full infrastructure provisioning automated through Terraform
  • True GitOps workflow with declarative configuration and automated synchronization
  • Production-ready microservices deployment patterns
  • Enterprise-grade workflow orchestration with Apache Airflow

DevOps best practices

  • Infrastructure-as-Code for reproducibility
  • Container-based immutable infrastructure with automated rollbacks
  • Full continuous deployment from commit to deploy
  • Comprehensive observability through Rancher

MLOps capabilities

  • End-to-end ML pipelines: data ingestion → training → serving
  • S3-compatible artifact management via MinIO
  • DAG-based scheduling for ML workflows
  • Infrastructure supporting real-time training and inference

Conclusion

This platform integrates Infrastructure-as-Code, GitOps, containerization, and workflow orchestration into a production-grade MLOps environment. It bridges development and production cleanly — the ML side stays focused on modeling while operations stay automated, reproducible, and reliable.

It’s the infrastructure backbone for COELHO RealTime and several other production ML projects I run privately.