Rafael COELHO
← Work

APRIL 2024

COELHO VISION

Computer Vision platform combining OpenCV, Ultralytics, and MediaPipe behind four production pipelines — object detection, image segmentation, pose estimation, and real-time camera integration — exposed via a Streamlit interface.

Outcome
Four production CV pipelines shipped end-to-end · live Streamlit demo · backbone choices (OpenCV / Ultralytics / MediaPipe) picked per pipeline by accuracy-vs-latency tradeoff.
OpenCV Ultralytics YOLO MediaPipe RoboFlow OpenVINO Python Streamlit

Executive summary

COELHO VISION is a Computer Vision platform combining three best-in-class CV backbones — OpenCV, Ultralytics YOLO, and MediaPipe — behind four production pipelines exposed via a Streamlit interface:

  1. Object Detection — object / image classification + face detection
  2. Image Segmentation — pixel-level segmentation for fine-grained analysis
  3. Pose Estimation — gesture recognition and motion analysis
  4. Live Camera Integration — applies all three pipelines in real time from the user’s webcam

The platform is publicly live (coelhovision.streamlit.app) — no PDF deck required to verify it runs, just open the link and a webcam.

See it deployed

The Streamlit app handles the lightweight live demo. These 23 slides are the deeper record — backbone selection rationale per pipeline, accuracy/latency tradeoffs across OpenCV / Ultralytics / MediaPipe, and the design choices behind each of the four production pipelines.

Loading viewer…

Four pipelines, three backbones

PipelinePrimary backboneWhy this backbone
Object DetectionUltralytics YOLOSOTA accuracy/speed tradeoff; easy to swap YOLO variants for the workload
Image SegmentationOpenCV + UltralyticsClassical OpenCV ops where they win; YOLO-seg where neural is required
Pose EstimationMediaPipePurpose-built for real-time skeleton estimation; runs CPU-only at acceptable fps
Live Camera IntegrationAll three composedEach backbone runs on the same webcam feed in parallel — direct apples-to-apples comparison for users

The selection isn’t dogmatic — each pipeline picks the backbone that wins on its specific accuracy-vs-latency curve, rather than forcing one model family to do everything.

Why mix three backbones instead of one

A single CV backbone (say, YOLOv11 for everything) is simpler — but you lose:

  • Pose estimation gracefully on CPU → MediaPipe is built for it; YOLO is overkill and slower
  • Classical OpenCV operations that don’t need a neural net at all (color filters, edge detection, geometric warps)
  • Optionality when a new SOTA detector ships — you can swap one backbone without rewriting the whole stack

This three-backbone choice is a deliberate engineering tradeoff: slightly more code to maintain, materially better behavior across the four pipelines.

Stack

  • OpenCV — classical CV operations + foundation for several pipeline steps
  • Ultralytics YOLO — detection and segmentation neural backbones
  • MediaPipe — real-time pose and gesture estimation
  • RoboFlow — dataset management and labeling for custom-fine-tuned models
  • OpenVINO — Intel inference optimization for CPU-only deployment scenarios
  • Streamlit — public-facing demo interface
  • Python — implementation

What this project proves

  • Backbone-selection discipline — choosing the right tool per pipeline rather than forcing one family to fit every problem
  • Real-time CV runs on commodity hardware — MediaPipe pose + Ultralytics detection both ship at usable fps without dedicated GPU acceleration
  • Public, runnable proof — the Streamlit app is live; anyone can verify the claims with a webcam in 30 seconds

Live demo → · Source on GitHub →