YOLOv12: Redefining Real-Time Object Detection 🚀

February 19, 2025

Introducing the Pioneering Features and Performance of YOLOv12 from the Latest Research

Introduction 🎯

In a groundbreaking release, the developers of YOLOv12 have set new standards in computer vision with their latest model. Known for its unmatched speed, accuracy, and versatility, YOLOv12 is an evolution in the YOLO series that pushes the boundaries of what’s possible in artificial vision.

YOLOv12 was born from the collaboration of three AI researchers:

Yunjie Tian, University at Buffalo, 🇺🇸
Qixiang Ye, University of Chinese Academy of Sciences, 🇨🇳
David Doermann, University at Buffalo, 🇺🇸

Its innovative design, centered around attention mechanisms, ensures faster, more accurate, and versatile performance, solidifying its place as an essential tool for developers and researchers.

Yes, I am part of this storm that is comming 🤣

Technical Architecture Overview of YOLOv12 🔍

YOLOv12 introduces a holistic enhancement to the YOLO framework, focusing on integrating attention mechanisms without sacrificing real-time inference capabilities.

Architectural highlights:

Attention-Centric Design: YOLOv12 features an area attention module that maintains efficiency by segmenting the feature map, reducing computational complexity by half while using FlashAttention to mitigate memory bandwidth limitations for real-time detection.
Hierarchical Structure: The model incorporates a residual efficient layer aggregation network (R-ELAN) to optimize feature integration and reduce gradient blockages, with a streamlined final stage for a lighter, faster architecture.
Architectural Enhancements: By replacing traditional positional encodings with a 7×7 separable convolution, YOLOv12 preserves positional information effectively.
Training and Optimization: Trained over 600 epochs using SGD and custom learning schedules with data augmentations like Mosaic and Mixup to boost generalization.

How YOLOv12 Performs on the COCO Dataset 📊

The Common Objects in Context (COCO) dataset remains the gold standard benchmark for object detection. YOLOv12 excels by achieving new state-of-the-art mAP scores — YOLOv12-N reaches 40.6% mAP, while YOLOv12-X achieves 55.2%.

Performance of YOLOv12 compared with state of the art.

Key Features of YOLOv12 💡

Attention-Centric Design: Captures detailed image features efficiently, ensuring precise detection in complex scenes.
Optimized for Speed and Efficiency: Enhances processing speed through refined architecture and training methods.
Improved Accuracy with Fewer Resources: Achieves higher mAP using fewer parameters.
Versatile Across Platforms: Adapts seamlessly from edge to GPU systems.
Comprehensive Task Support: Handles detection, segmentation, classification, and more.

Installation Guide for YOLOv12 🛠️

Setting up YOLOv12 involves a few key steps to ensure CUDA compatibility.

git clone https://github.com/sunsmarterjie/yolov12.git
cd yolov12

1. Verify CUDA Version:

nvcc -version

Example output:

nvcc: NVIDIA (R) Cuda compiler driver
Cuda compilation tools, release 12.4, V12.4.131

Since CUDA v12.4 requires torch==2.2.0, install matching versions:

pip install torch==2.2.2 torchvision==0.17.2 --index-url https://download.pytorch.org/whl/cu121

2. Install additional dependencies:

# Install thop for FLOPs estimation
pip install thop

# Install optimized FlashAttention for CUDA
pip install flash-attn==2.7.3 --no-build-isolation

# Install remaining packages
pip install -r requirements.txt

Attention attention (never better said before 🤣). Important note ⚠️

YOLOv12 doesn’t support CPU-only environments, basically because it requires FlashAttention that depends on CUDA. Ensure CUDA is properly configured on your system.

Using YOLOv12 with Gradio Interface

The repository includes a Gradio template for interactive demos:

python app.py

This launches an interface for model interaction.

Gradio app for YOLOv12. It has some bugs but cool for images inference

YOLOv11 vs YOLOv12 — the constant battle in artificial intelligence 👊

It looks like we have a model that can detect better the ties 😂😂

In left YOLOv11, that detects two ties for Ancelotti, meanwhile YOLOv12 correctly detects just one.

Predict on Videos, Images, or Cameras Using YOLOv12 📹

from ultralytics import YOLO

# Load and predict with a model
model = YOLO('yolov12x.pt')
model.predict(0)         # Webcam
model.predict("video.mp4")  # Video file
model.predict("image.jpg")  # Image file

Leveraging YOLOv12 for Your Projects 🌐

YOLOv12 offers modes for:

Training Mode
Validation Mode
Prediction Mode
Export Mode
Tracking Mode

Conclusion ✨

YOLOv12 represents a breakthrough in object detection technology. Key achievements:

State-of-the-art across scales (40.6% — 55.2% mAP)
Efficient area attention — faster, lighter
Introduces R-ELAN for superior feature integration
Better visualization vs YOLOv10 and YOLOv11
Maintains inference speeds between 1.64 – 11.79 ms
Fewer parameters, higher accuracy ✅

Discover more at the YOLOv12 Repository.

Happy Coding! 💻🚀

Keywords: YOLOv12 detection, YOLO 12 architecture, YOLO v12 inference, YOLOv12 training, YOLOv12 performance, YOLOv12 installation, YOLOv12, YOLO v12, YOLO 12

Originally published on Medium on February 19, 2025.