JANUARY 2026
COELHO RealTime
Production-grade Real-Time MLOps platform on Kubernetes combining incremental ML and batch ML across fraud detection, ETA prediction, and customer segmentation.
Executive summary
COELHO RealTime is a production-grade Real-Time MLOps platform running on Kubernetes that combines incremental machine learning with batch learning to solve three concurrent ML use cases:
- Transaction Fraud Detection (TFD) — binary classification
- Estimated Time of Arrival (ETA) — regression
- E-Commerce Customer Interactions (ECCI) — clustering
The platform implements a dual ML paradigm where River ML handles real-time incremental training directly from Kafka streams, while CatBoost and scikit-learn handle batch training on data accumulated in a Delta Lake. All experiments are tracked with MLflow, models are cached in Redis for sub-millisecond inference, and the entire system is monitored through Prometheus, Grafana, Alertmanager, and Karma.
See it deployed
The platform demands 16+ GB RAM to run, so you can’t spin it up casually — these 53 slides are the verifiable record of the system running in production: live dashboards, alert routing, MLflow tracking, CI/CD flows. Navigate with arrows or open fullscreen for the full read.
Platform architecture
Data flow
- Generation — Kafka producers emit realistic synthetic data for all three use cases via Faker
- Streaming — data lands in Kafka topics (KRaft mode, 3 partitions each, 1-week retention)
- Incremental processing — FastAPI consumers read Kafka streams and train River models in real time
- Batch processing — Spark Structured Streaming writes to Delta Lake on MinIO; DuckDB preprocesses for CatBoost / sklearn training
- Inference — trained models cached in Redis for sub-millisecond predictions
- Tracking — every experiment logged to MLflow with S3 artifacts on MinIO
- Monitoring — Prometheus scrapes metrics from all services; Grafana renders dashboards; Alertmanager fires alerts
ML use cases at a glance
| Use case | Task | Incremental model | Batch model | Metric |
|---|---|---|---|---|
| TFD Transaction Fraud Detection | Binary classification | Adaptive Random Forest (River) | CatBoostClassifier | F-Beta (β = 2.0) |
| ETA Estimated Time of Arrival | Regression | Adaptive Random Forest (River) | CatBoostRegressor | MAE |
| ECCI E-Commerce Customer Interactions | Clustering | DBSTREAM (River) | KMeans (scikit-learn) | Silhouette |
Key components
Unified FastAPI backend
A single service consolidates all ML functionality into 39 endpoints across three versioned routers:
| Router | Purpose | Endpoints |
|---|---|---|
/api/v1/incremental | River ML real-time training, predictions, metrics | 16 |
/api/v1/batch | CatBoost / sklearn batch training + YellowBrick / Scikit-Plot visualizations | 20 |
/api/v1/sql | DuckDB SQL queries against Delta Lake tables | 3 |
Includes MLflow model selection (best model by metric), Redis caching, Prometheus instrumentation, and visualization generation on demand.
SvelteKit frontend
Interactive dashboard for training, predictions, and model diagnostics via YellowBrick. Project pages for TFD, ETA, and ECCI with nested tabs (Incremental ML / Batch ML / SQL).
Infrastructure & deployment
The entire platform is deployed on a k3d Kubernetes cluster provisioned with Terraform and packaged as a Helm umbrella chart with seven dependencies:
| Dependency | Version | Source |
|---|---|---|
| MLflow | 1.8.1 | community-charts |
| Redis | 24.0.8 | Bitnami |
| MinIO | 5.4.0 | MinIO Official |
| PostgreSQL | 18.1.14 | Bitnami |
| kube-prometheus-stack | 80.6.0 | prometheus-community |
| Kafka | 32.4.3 | Bitnami |
| Spark | 10.0.3 | Bitnami |
CI/CD & GitOps
Developer push → GitLab CI builds images → Push to registry
↓
ArgoCD auto-sync ← Git commit [skip ci] ← ArgoCD Image Updater detects new tags
↓
Deploy to cluster → Prometheus monitors → Grafana dashboards → Alertmanager
- GitLab CI — automated container image builds on commit
- ArgoCD — GitOps continuous delivery with automated Kubernetes sync
- ArgoCD Image Updater — automatic detection & deployment of new image versions
Observability
Prometheus — 50+ custom metrics
Instrumented across all services:
- FastAPI — training status, prediction count / latency / errors, cache hits/misses, model load duration, MLflow operation duration, SQL query duration, visualization generation time
- Kafka producers — messages sent, errors, send duration, connection status, retries, fraud ratio (TFD), active sessions (ECCI)
Grafana — 11 dashboards
All provisioned via ConfigMaps with sidecar auto-discovery:
- COELHORealTime Overview — service health, total CPU/RAM aggregate panels with sparklines
- ML Pipeline — training metrics, predictions, model performance
- FastAPI Detailed — latency, error rates, throughput per endpoint
- Kafka Producers — message rates, send latency, errors, connections
- Kafka — consumer lag, throughput, partitions
- PostgreSQL — connections, queries, replication
- Redis — memory, connections, ops/sec
- MinIO — S3 operations, storage, buckets
- Spark — performance metrics
- Spark Streaming — structured streaming metrics
- SvelteKit — frontend performance
Alerting
- 30+ Prometheus alerting rules across 10 rule groups (FastAPI, Kafka, Kafka Producers, MLflow, PostgreSQL, Redis, MinIO, SvelteKit, Spark, Application General)
- Alertmanager with routing and inhibition rules
- Karma UI for alert visualization
- Pre-configured receivers for Slack, Discord, Email, and PagerDuty
Hardware footprint
| Component | Specification |
|---|---|
| RAM | 64 GB |
| CPU | 8 cores (modern x86_64) |
| Storage | 2 TB NVMe SSD |
| Orchestration | k3d (local Kubernetes) |
Minimum 32 GB RAM required. The platform runs multiple memory-intensive services concurrently (FastAPI ~8 GB, Spark Worker ~5 GB, Kafka ~2 GB, Prometheus ~2 GB, etc.). Systems with less than 32 GB will experience OOMKills and pod restart loops. 64 GB recommended for comfortable headroom.
What this project proves
- End-to-end MLOps lifecycle ownership — data engineering → training → serving → observability, all owned by one engineer in one repo
- Real-time and batch ML co-existing in a single platform — not theoretical; tested under continuous synthetic load
- Production rigor — tracing, alerting, GitOps, IaC, dashboards, alerting routing — not a notebook demo