APRIL 2024

COELHO VISION

Computer Vision platform combining OpenCV, Ultralytics, and MediaPipe behind four production pipelines — object detection, image segmentation, pose estimation, and real-time camera integration — exposed via a Streamlit interface.

Outcome

Four production CV pipelines shipped end-to-end · live Streamlit demo · backbone choices (OpenCV / Ultralytics / MediaPipe) picked per pipeline by accuracy-vs-latency tradeoff.

OpenCV Ultralytics YOLO MediaPipe RoboFlow OpenVINO Python Streamlit

Source ↗ Live demo ↗ Presentation ↗

Executive summary

COELHO VISION is a Computer Vision platform combining three best-in-class CV backbones — OpenCV, Ultralytics YOLO, and MediaPipe — behind four production pipelines exposed via a Streamlit interface:

Object Detection — object / image classification + face detection
Image Segmentation — pixel-level segmentation for fine-grained analysis
Pose Estimation — gesture recognition and motion analysis
Live Camera Integration — applies all three pipelines in real time from the user’s webcam

The platform is publicly live (coelhovision.streamlit.app) — no PDF deck required to verify it runs, just open the link and a webcam.

See it deployed

The Streamlit app handles the lightweight live demo. These 23 slides are the deeper record — backbone selection rationale per pipeline, accuracy/latency tradeoffs across OpenCV / Ultralytics / MediaPipe, and the design choices behind each of the four production pipelines.

Loading viewer…

Open PDF in new tab ↗

Four pipelines, three backbones

Pipeline	Primary backbone	Why this backbone
Object Detection	Ultralytics YOLO	SOTA accuracy/speed tradeoff; easy to swap YOLO variants for the workload
Image Segmentation	OpenCV + Ultralytics	Classical OpenCV ops where they win; YOLO-seg where neural is required
Pose Estimation	MediaPipe	Purpose-built for real-time skeleton estimation; runs CPU-only at acceptable fps
Live Camera Integration	All three composed	Each backbone runs on the same webcam feed in parallel — direct apples-to-apples comparison for users

The selection isn’t dogmatic — each pipeline picks the backbone that wins on its specific accuracy-vs-latency curve, rather than forcing one model family to do everything.

Why mix three backbones instead of one

A single CV backbone (say, YOLOv11 for everything) is simpler — but you lose:

Pose estimation gracefully on CPU → MediaPipe is built for it; YOLO is overkill and slower
Classical OpenCV operations that don’t need a neural net at all (color filters, edge detection, geometric warps)
Optionality when a new SOTA detector ships — you can swap one backbone without rewriting the whole stack

This three-backbone choice is a deliberate engineering tradeoff: slightly more code to maintain, materially better behavior across the four pipelines.

Stack

OpenCV — classical CV operations + foundation for several pipeline steps
Ultralytics YOLO — detection and segmentation neural backbones
MediaPipe — real-time pose and gesture estimation
RoboFlow — dataset management and labeling for custom-fine-tuned models
OpenVINO — Intel inference optimization for CPU-only deployment scenarios
Streamlit — public-facing demo interface
Python — implementation

What this project proves

Backbone-selection discipline — choosing the right tool per pipeline rather than forcing one family to fit every problem
Real-time CV runs on commodity hardware — MediaPipe pose + Ultralytics detection both ship at usable fps without dedicated GPU acceleration
Public, runnable proof — the Streamlit app is live; anyone can verify the claims with a webcam in 30 seconds

Live demo → · Source on GitHub →