Vucense

Computer Vision with YOLOv11 2026: Local Object Detection Pipeline

🟡Intermediate

Build sovereign computer vision pipelines with YOLOv11: dataset preparation, training, inference, OpenCV integration, and local CV deployment with zero cloud inference dependency.

Kofi Mensah

Author

Kofi Mensah

Inference Economics & Hardware Architect

Published

Duration

Reading

22 min

Build

40 min (training: 2-4 hrs)

Computer Vision with YOLOv11 2026: Local Object Detection Pipeline
Article Roadmap

Key Takeaways

  • Pre-trained COCO weights for quick start: YOLO("yolo11n.pt") detects 80 common object classes out of the box — no training needed for people, cars, animals, etc. For advanced computer vision, see Embedding Models 2026.
  • YOLO format for custom training: One .txt label file per image, one line per bounding box: class_id cx cy w h (all normalised 0–1). Use annotation tools (Roboflow, LabelImg, CVAT) to prepare your dataset.
  • model.train() handles everything: Augmentation, validation splits, learning rate scheduling, and checkpoint saving are automatic.
  • Export for deployment: model.export(format="onnx") or format="tflite" converts trained weights for edge deployment. See GGUF Quantization Explained for optimization techniques.

Introduction

Direct Answer: How do I use YOLOv11 for object detection on my own hardware in 2026?

Install with pip install ultralytics, load pre-trained weights YOLO("yolo11n.pt"), and run inference on images or video. For custom training, prepare dataset in YOLO format (images + normalized bounding box labels), create data.yaml, and run model.train(data="data.yaml", epochs=50). All training and inference runs locally without cloud APIs.

Can YOLOv11 run 100% offline?

Yes. Ultralytics YOLOv11 supports local CPU/GPU/MPS inference with zero network calls. Unlike AWS Rekognition or Google Vision, local YOLOv11 processes frames entirely on your hardware. No image data, no metadata, and no telemetry leave your machine.

The Vucense 2026 Computer Vision Sovereignty Index

Inference MethodData RetentionLatencyAuditabilitySovereignty Score
Cloud Vision API🔴 Logged & Retained⚠️ 200-800ms❌ Black-box12/100
Hybrid (Local + Cloud Fallback)🟡 Partial🟢 50-150ms⚠️ Partial48/100
Local YOLOv11 (CPU/GPU)🟢 100% On-Device🟢 15-45ms✅ Open weights89/100
Air-Gapped Edge (Jetson/RPi)🟢 Physically Isolated🟢 30-60ms✅ Verifiable pipeline94/100

Why Local Computer Vision Matters

Cloud vision APIs (AWS Rekognition, Google Vision, Azure Computer Vision) offer convenience but require sending images to third-party servers. For sovereign deployments, running YOLOv11 locally is critical:

Use CaseCloud API RiskLocal YOLO Benefit
Home security cameraVideo streams to vendor cloudFootage never leaves your network
Industrial inspectionProprietary models + data retentionFull control over model + data
Medical imagingHIPAA compliance complexityAir-gapped inference possible
Privacy-sensitive analysisMetadata harvesting, facial recognitionDetection results stay local

Key principle: If the image contains people, locations, or sensitive assets, local inference is the only way to guarantee the data never leaves your control.


Part 1: Installation and Setup

YOLOv11 is available via Ultralytics. The nano model (yolo11n.pt) is best for CPU inference; larger models require GPU.

pip install ultralytics --break-system-packages
python3 -c "from ultralytics import YOLO; print('Ultralytics version:', YOLO.__module__.split('.')[0])"
# 01_quick_start.py — run pre-trained YOLO on an image
from ultralytics import YOLO
from pathlib import Path

# Load pre-trained YOLOv11 (downloads ~6MB on first run)
# Variants: yolo11n (nano), yolo11s (small), yolo11m (medium), yolo11l (large), yolo11x (xlarge)
model = YOLO("yolo11n.pt")   # Nano: fastest, smallest

# Inference on a single image
results = model("https://ultralytics.com/images/bus.jpg")   # or local path

# Parse results
for result in results:
    for box in result.boxes:
        cls = model.names[int(box.cls)]
        conf = float(box.conf)
        x1, y1, x2, y2 = map(int, box.xyxy[0])
        print(f"  {cls}: {conf:.2f}  at [{x1},{y1},{x2},{y2}]")

# Save annotated image
results[0].save(filename="detected.jpg")
print("Saved: detected.jpg")

Expected output:

  person: 0.94  at [54, 192, 244, 756]
  person: 0.87  at [310, 155, 501, 750]
  bus:    0.82  at [5, 228, 640, 756]
  person: 0.72  at [478, 208, 564, 748]
Saved: detected.jpg

Part 2: Custom Dataset Training

Transfer learning from COCO-pretrained weights significantly accelerates training on custom data. The model already understands object shapes, textures, and spatial relationships. You only need to fine-tune for your specific classes and domain.

# 02_batch_inference.py — process multiple images efficiently
from ultralytics import YOLO
from pathlib import Path
import time

model = YOLO("yolo11n.pt")

# Warmup (first inference is slower due to CUDA initialisation)
model("warmup.jpg", verbose=False)

# Batch process all images in a directory
image_dir = Path("/tmp/images")
images = list(image_dir.glob("*.jpg")) + list(image_dir.glob("*.png"))

start = time.perf_counter()
results = model(
    [str(img) for img in images],
    batch=16,        # Process 16 images simultaneously
    device="cuda",
    conf=0.5,        # Minimum confidence threshold
    iou=0.45,        # IoU threshold for NMS
    verbose=False
)
elapsed = time.perf_counter() - start

print(f"Processed {len(images)} images in {elapsed:.2f}s ({len(images)/elapsed:.0f} img/s)")

# Aggregate results
detection_counts = {}
for result in results:
    for box in result.boxes:
        cls = model.names[int(box.cls)]
        detection_counts[cls] = detection_counts.get(cls, 0) + 1

print("\nDetection summary:")
for cls, count in sorted(detection_counts.items(), key=lambda x: -x[1])[:10]:
    print(f"  {cls}: {count}")

Expected output (RTX 4090):

Processed 500 images in 8.3s (60 img/s)

Detection summary:
  person: 847
  car: 312
  bicycle: 89
  truck: 67

Part 2.5: Annotation Tools for Custom Datasets

Before training on custom data, you need to annotate (label) images with bounding boxes. Here’s a comparison of the best tools:

ToolCostEase of UseYOLO Format ExportTeam SupportSpeed per ImageBest For
RoboflowFree / $50/mo⭐⭐⭐⭐⭐✓ DirectTeams~2-3 minSmall teams, quick iteration
Labelimg (LabelImg)Free⭐⭐⭐✓ Manual convertSolo~3-5 minSingle annotator, low budget
CVAT (Computer Vision Annotation Tool)Free (self-host)⭐⭐⭐⭐✓ PluginTeams~2 minEnterprise, self-hosted
Makesense.aiFree⭐⭐⭐⭐✓ YesSolo/Small~2 minQuick browser-based labeling
SuperviselyFreemium⭐⭐⭐⭐✓ YesTeams~1-2 minProfessional CV teams
Label StudioFree (self-host)⭐⭐⭐⭐✓ YesTeams~2 minCustom workflows, privacy

Recommended workflow for 100-500 images:

Option 1: Roboflow (Fastest for small projects)

# 1. Upload images to Roboflow Web UI
# 2. Draw bounding boxes in browser
# 3. Export in YOLO format
# 4. Roboflow automatically creates train/val split and applies augmentation

# Download and extract
unzip roboflow-dataset.zip
cd roboflow-dataset

# Ready to train immediately
python3 train.py --data data.yaml --epochs 50

Pros: Zero setup, browser-based, automatic data augmentation
Cons: Cloud-dependent, free tier limited to 3 models/month

Option 2: LabelImg (Free, self-contained)

# Install
pip install labelimg

# Run GUI
labelimg custom_dataset/images/train

# Draw boxes and save — creates .xml files
# Convert XML to YOLO format
python3 - << 'EOF'
import xml.etree.ElementTree as ET
from pathlib import Path

def xml_to_yolo(xml_file, img_width, img_height):
    tree = ET.parse(xml_file)
    root = tree.getroot()
    
    yolo_lines = []
    for obj in root.findall('object'):
        class_name = obj.find('name').text
        class_id = 0  # Change based on your classes
        
        bbox = obj.find('bndbox')
        xmin = float(bbox.find('xmin').text)
        ymin = float(bbox.find('ymin').text)
        xmax = float(bbox.find('xmax').text)
        ymax = float(bbox.find('ymax').text)
        
        cx = (xmin + xmax) / 2 / img_width
        cy = (ymin + ymax) / 2 / img_height
        w = (xmax - xmin) / img_width
        h = (ymax - ymin) / img_height
        
        yolo_lines.append(f"{class_id} {cx:.6f} {cy:.6f} {w:.6f} {h:.6f}")
    
    txt_file = str(xml_file).replace('.xml', '.txt')
    Path(txt_file).write_text('\n'.join(yolo_lines))

# Convert all XMLs
for xml in Path('custom_dataset/images/train').glob('*.xml'):
    xml_to_yolo(xml, 640, 480)
EOF

Pros: Completely offline, no account needed
Cons: Manual XML→YOLO conversion, no augmentation

Option 3: CVAT (Best for teams, self-hosted privacy)

# Docker-based self-hosted setup
docker run -d -p 8080:8080 \
  -e DJANGO_SU_NAME=admin \
  -e DJANGO_SU_EMAIL=admin@localhost \
  -e DJANGO_SU_PASSWORD=admin \
  cvat/cvat:latest

# Access at http://localhost:8080
# Create project, upload images, annotate in browser
# Export as YOLO format directly

Pros: Enterprise-grade, full control, team collaboration
Cons: Higher setup complexity, requires Docker

Recommendation:

  • 100 images, quick prototype: Roboflow (fastest)
  • 500+ images, privacy critical: CVAT self-hosted (full control)
  • Solo, zero cost: LabelImg (works offline)

Part 3: Custom Dataset Training

# Directory structure for custom training
mkdir -p custom_dataset/{images,labels}/{train,val}

# data.yaml — dataset configuration
cat > custom_dataset/data.yaml << 'EOF'
path: /home/user/custom_dataset   # Absolute path to dataset root
train: images/train
val:   images/val

nc: 3   # Number of classes
names: ["cat", "dog", "bird"]
EOF
# Create YOLO format labels from your annotations
# YOLO format: class_id cx cy w h (all normalised to [0, 1])

def convert_bbox_to_yolo(img_width: int, img_height: int,
                          x1: int, y1: int, x2: int, y2: int,
                          class_id: int) -> str:
    """Convert pixel bounding box to YOLO normalised format."""
    cx = ((x1 + x2) / 2) / img_width
    cy = ((y1 + y2) / 2) / img_height
    w = (x2 - x1) / img_width
    h = (y2 - y1) / img_height
    return f"{class_id} {cx:.6f} {cy:.6f} {w:.6f} {h:.6f}"

# Example: cat at pixels (100, 50, 300, 250) in 640x480 image
line = convert_bbox_to_yolo(640, 480, 100, 50, 300, 250, class_id=0)
print(line)   # → 0 0.312500 0.312500 0.312500 0.416667
# 03_training.py — Train YOLOv11 on Custom Dataset using Transfer Learning
# Transfer learning: fine-tune pre-trained weights instead of training from scratch
# Benefit: converges 10× faster, requires fewer images (100+ instead of 1000+)

from ultralytics import YOLO

# ══════════════════════════════════════════════════════════════════════════════════════════════
# Load Pre-trained Model (Transfer Learning)
# ══════════════════════════════════════════════════════════════════════════════════════════════

# YOLO("yolo11n.pt"): load YOLOv11 Nano with weights from Coco pre-training
# "yolo11n": nano size — lightweight (smallest accuracy, fastest inference)
# Alternatives: yolo11s (small), yolo11m (medium), yolo11l (large), yolo11x (extra-large)
# Trade-off: Nano=18M params, fastest; Large=93M params, most accurate
model = YOLO("yolo11n.pt")

# ══════════════════════════════════════════════════════════════════════════════════════════════
# Training Configuration — Hyperparameters Optimized for Custom Data
# ══════════════════════════════════════════════════════════════════════════════════════════════

results = model.train(
    # ── Dataset Configuration ──────────────────────────────────────────────────────────────
    data="custom_dataset/data.yaml",  # Path to data.yaml (train/val paths, class names)
    
    # ── Training Duration ──────────────────────────────────────────────────────────────────
    epochs=50,              # Number of times to iterate over entire dataset
    # Each epoch: load all images, compute loss, backprop, update weights
    # 50 epochs typical for 100-500 images; increase to 100+ for larger datasets
    
    # ── Image Preprocessing ────────────────────────────────────────────────────────────────
    imgsz=640,              # Input image size for training (640×640 pixels)
    # Ultralytics auto-splits to 32-pixel multiples: 640 = 20×20 grid for predictions
    # Larger (832): more detail, slower; smaller (416): faster, less detail
    
    batch=16,               # Images per batch (batch size)
    # Memory requirement: roughly batch_size × 2 GB for RTX 3090
    # 16 = ~32 GB VRAM; reduce to 8 for RTX 3080, 4 for RTX 2080
    # Recommended: use largest batch that fits in GPU memory
    
    device="cuda",          # Device to train on: "cuda" (NVIDIA), "mps" (Apple Silicon), "cpu"
    # GPU training 50–100× faster than CPU; CUDA availability: python -c "import torch; print(torch.cuda.is_available())"
    
    # ── Learning Rate Schedule ────────────────────────────────────────────────────────────
    lr0=1e-3,               # Initial learning rate (0.001)
    # Learning rate controls step size for weight updates
    # Too high: training diverges (loss explodes)
    # Too low: training stalls (loss plateaus)
    # Typical range: 1e-4 to 1e-2
    
    lrf=1e-2,               # Final learning rate multiplier (0.01)
    # End learning rate = lr0 × lrf = 0.001 × 0.01 = 0.00001
    # Cosine annealing: learning rate decreases smoothly over epochs
    # Helps model converge to local minima
    
    # ── Regularization (Prevent Overfitting) ────────────────────────────────────────────
    weight_decay=5e-4,      # L2 regularization penalty (0.0005)
    # Penalizes large weights; encourages simpler, more generalizable model
    # Typical values: 1e-5 to 1e-3
    
    augment=True,           # Enable data augmentation
    # Augmentation techniques applied each epoch:
    # - Mosaic: combine 4 images into one (increases effective dataset size)
    # - Mixup: blend two images together (smoother transitions, better robustness)
    # - Rotations, flips, color jitter (invariance to lighting, orientation changes)
    
    # ── Early Stopping (Prevent Overfitting) ────────────────────────────────────────────
    patience=20,            # Stop training if validation mAP doesn't improve for 20 epochs
    # Prevents overfitting: training loss decreases, but validation loss increases
    # Example: best mAP at epoch 30, no improvement by epoch 50 → stop at epoch 50
    
    # ── Checkpointing ──────────────────────────────────────────────────────────────────
    save_period=10,         # Save checkpoint every 10 epochs
    # Allows resuming if interrupted: yolo train resume=runs/custom_yolo11/weights/last.pt
    # Best model automatically saved as best.pt
    
    # ── Output Directory ───────────────────────────────────────────────────────────────
    project="runs",         # Root directory for output
    name="custom_yolo11"    # Subdirectory: runs/custom_yolo11/
    # Final structure: runs/custom_yolo11/weights/{best,last}.pt, runs/custom_yolo11/results.csv
)

# ══════════════════════════════════════════════════════════════════════════════════════════════
# Training Results Analysis
# ══════════════════════════════════════════════════════════════════════════════════════════════

# Extract performance metrics from training results
# mAP50(B): mean Average Precision at 0.5 IoU threshold
# IoU (Intersection over Union): overlap between predicted and ground-truth boxes
# mAP50 > 0.85 is good; > 0.90 is excellent for custom datasets
print(f"Best mAP50: {results.results_dict['metrics/mAP50(B)']:.4f}")

# Path to best model (lowest validation loss)
# Use this for inference, not the last checkpoint (which may be overfit)
print(f"Best model: runs/custom_yolo11/weights/best.pt")

Expected output (during training):

Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
  1/50     21.4G     1.2341     2.3847     1.1234         47        640
 10/50     21.4G     0.8231     1.2347     0.9123         52        640
 50/50     21.4G     0.5123     0.7234     0.7891         61        640
                 Class     Images  Instances      Box(P          R      mAP50
                   all       200        847      0.891      0.843      0.872

Part 4: Real-Time Video Detection

# 04_webcam_detection.py — live object detection from webcam
import cv2
from ultralytics import YOLO
import time

# Load your trained model (or use pre-trained)
model = YOLO("yolo11n.pt")

cap = cv2.VideoCapture(0)   # 0 = default webcam; or video file path
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1280)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 720)

fps_tracker = time.time()
frame_count = 0

while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    # Inference
    results = model(frame, conf=0.5, verbose=False)

    # Annotate frame with bounding boxes
    annotated = results[0].plot()

    # Add FPS counter
    frame_count += 1
    if frame_count % 30 == 0:
        fps = 30 / (time.time() - fps_tracker)
        fps_tracker = time.time()
        print(f"FPS: {fps:.1f}")
    cv2.putText(annotated, f"YOLOv11 | Local GPU", (10, 30),
                cv2.FONT_HERSHEY_SIMPLEX, 0.8, (0, 255, 0), 2)

    cv2.imshow("YOLOv11 Detection", annotated)

    if cv2.waitKey(1) & 0xFF == ord("q"):
        break

cap.release()
cv2.destroyAllWindows()

Expected output (terminal):

FPS: 62.3
FPS: 64.1
FPS: 61.8

62+ FPS on RTX 4090 with YOLOv11n — real-time detection with headroom.


Part 5: Export for Edge Deployment

# 05_export.py — export to various formats
model = YOLO("runs/custom_yolo11/weights/best.pt")

# ONNX (runs on any hardware via ONNX Runtime)
model.export(format="onnx", imgsz=640, simplify=True)

# TensorRT (fastest on NVIDIA GPUs)
model.export(format="engine", imgsz=640, half=True)   # FP16

# TFLite (for Raspberry Pi / mobile)
model.export(format="tflite", imgsz=320)   # Smaller resolution for edge

# Run ONNX inference
from ultralytics import YOLO
onnx_model = YOLO("runs/custom_yolo11/weights/best.onnx")
results = onnx_model("test.jpg")

Sovereign Deployment: Local Inference Endpoint

For production computer vision workloads, expose YOLOv11 inference through a private API endpoint that runs 100% on your infrastructure:

# local_inference.py
from fastapi import FastAPI, UploadFile, File
from ultralytics import YOLO
import cv2
import numpy as np
import io
from PIL import Image

app = FastAPI(title="Sovereign CV Endpoint", version="1.0.0")
model = YOLO("yolo11n.pt", task="detect")  # Loads once at startup, stays in memory

@app.post("/api/detect")
async def detect(file: UploadFile = File(...)):
    """Run YOLO inference on uploaded image. Results never leave this server."""
    # Read image from upload
    contents = await file.read()
    img_array = np.frombuffer(contents, np.uint8)
    img = cv2.imdecode(img_array, cv2.IMREAD_COLOR)
    
    # Inference on CPU or GPU (auto-detect)
    results = model(img, verbose=False, device="cpu")  # device="mps" or "cuda" if available
    
    # Extract detections
    detections = []
    for box in results[0].boxes:
        detections.append({
            "class": model.names[int(box.cls.item())],
            "confidence": float(box.conf.item()),
            "bbox": box.xyxy[0].tolist()
        })
    
    return {"detections": detections, "num_objects": len(detections)}

@app.get("/health")
async def health():
    """Health check endpoint."""
    return {"status": "ok", "model": "yolo11n", "inference_device": "cpu"}

# Run locally only (no network exposure by default)
if __name__ == "__main__":
    import uvicorn
    # Bind to localhost only — production access via Tailscale/VPN
    uvicorn.run(app, host="127.0.0.1", port=8000)

Usage (local machine only):

pip install fastapi uvicorn
python local_inference.py

# In another terminal
curl -X POST "http://127.0.0.1:8000/api/detect" -F "[email protected]"
# Returns: {"detections": [...], "num_objects": 3}

Sovereign Design: The endpoint binds to 127.0.0.1 (not 0.0.0.0), so it’s inaccessible from the network. For remote access, use Tailscale VPN (see Docker Networking 2026) to securely route inference requests without exposing the API to the public internet.


PII Protection: Frame Sanitization Before Storage

If your CV pipeline logs, archives, or trains on frames, you must strip identifiable data before write. This is a legal requirement for GDPR/CCPA compliance and a sovereignty best practice.

# sanitize_frames.py
import cv2
import numpy as np
from ultralytics import YOLO

def blur_faces_and_licenses(frame, results, blur_strength=51):
    """
    Blur detected faces and license plates before storage.
    Sovereignty rule: If the image contains people, locations, or vehicles, 
    redact before archival.
    """
    for box in results[0].boxes:
        x1, y1, x2, y2 = map(int, box.xyxy[0])
        
        # Class 0 = person (in COCO, adjust for your custom dataset)
        if int(box.cls.item()) == 0:
            roi = frame[y1:y2, x1:x2]
            frame[y1:y2, x1:x2] = cv2.GaussianBlur(roi, (blur_strength, blur_strength), 30)
        
        # Class 2 = car (license plate inside, blur entire vehicle if desired)
        if int(box.cls.item()) == 2:
            roi = frame[y1:y2, x1:x2]
            # Blur only upper 15% (where license plate likely is)
            plate_height = int((y2 - y1) * 0.15)
            frame[y1:y1+plate_height, x1:x2] = cv2.GaussianBlur(
                roi[:plate_height], (99, 99), 30
            )
    
    return frame

# Example: process video and save sanitized frames
model = YOLO("yolo11n.pt")
cap = cv2.VideoCapture("security_camera.mp4")

frame_count = 0
while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break
    
    results = model(frame, verbose=False)
    sanitized = blur_faces_and_licenses(frame, results)
    
    # Save sanitized frame to archive (safe for logs, training, etc.)
    cv2.imwrite(f"archive/sanitized_frame_{frame_count}.jpg", sanitized)
    frame_count += 1
    
    if frame_count % 100 == 0:
        print(f"Processed {frame_count} frames")

cap.release()
print(f"Total frames archived (sanitized): {frame_count}")

Key principle: Blur faces, license plates, and location identifiers before any storage, logging, or training. This ensures your CV pipeline respects user privacy and complies with regulations. For security cameras on your property, this is still best practice — it protects visitors and guests.


YOLOv11 provides the full computer vision pipeline: pre-trained COCO detection in 3 lines, custom dataset training with automatic augmentation, real-time webcam inference at 60+ FPS, and export to ONNX/TensorRT for production. Everything runs locally — no cloud inference API, no per-image cost.

Sovereign Deployment Tip: Combine YOLOv11 with Ollama for multimodal reasoning: “Detect objects with YOLO, then ask a local LLM to explain what it sees — all on-device.” See CrewAI + Ollama 2026 for orchestrating local AI agents.

See Embedding Models 2026 for training custom neural network architectures from scratch, and Edge Computing Guide 2026 for hardware selection.


People Also Ask

What is the difference between YOLOv11 nano, small, medium, large, and xlarge?

The variants trade accuracy for speed and memory: YOLOv11n (nano, 2.6M params) runs at 80+ FPS on an RTX 4090 and is best for real-time applications with modest accuracy needs. YOLOv11s (small) and YOLOv11m (medium) are the balanced choices for most production applications. YOLOv11l (large) and YOLOv11x (xlarge) maximize accuracy at the cost of speed. Start with YOLOv11n for real-time requirements, YOLOv11s for balanced performance, and YOLOv11m if accuracy is the priority.

How many images do I need to train a custom YOLOv11 model?

For transfer learning (fine-tuning from COCO pre-trained weights): 100–500 images per class is typically sufficient for common-looking objects. For unusual or highly specific objects: 500–2000 images per class. YOLOv11’s data augmentation (mosaic, mixup, flips, colour jitter) effectively multiplies your dataset 10–20×, reducing the data requirement significantly. Start with 100 images per class, train, and evaluate — add more data only if accuracy is insufficient.


Troubleshooting & Common Issues

Issue: CUDA out of memory during training

Cause: Batch size too large for GPU VRAM.

# Fix: Reduce batch size
# In code: batch=8 instead of batch=16
model.train(data="data.yaml", batch=8, imgsz=640)

# Or reduce image size
model.train(data="data.yaml", batch=16, imgsz=416)

Issue: FileNotFoundError: data.yaml not found

Cause: data.yaml path incorrect or missing.

# Fix: Verify file exists and path is correct
ls -la custom_dataset/data.yaml
# Should show the file; if not, create it:
cat > custom_dataset/data.yaml << 'EOF'
path: /full/path/to/custom_dataset
train: images/train
val: images/val
nc: 3
names: ["cat", "dog", "bird"]
EOF

Issue: mAP stuck at 0.0 after 50 epochs

Cause: Dataset format wrong or labels invalid.

# Fix: Validate dataset format
# YOLO format: .txt files with: class_id cx cy w h (0-1 normalized)
# Verify: images/train/img1.jpg has images/train/img1.txt with same name
ls images/train/*.jpg | wc -l  # Should match:
ls images/train/*.txt | wc -l

# Check label format:
head -1 images/train/*.txt  # Should show: 0 0.5 0.5 0.3 0.4

Issue: Low mAP (0.4–0.5) even with 1000 images

Cause: Images too blurry, inconsistent classes, or poor labeling.

# Fix: Inspect dataset quality
# 1. Check images are clear (not blurry, right resolution)
# 2. Verify labels are correct (spot-check 10 images)
# 3. Check for class imbalance (some classes rare)

# For class imbalance, try weighted loss:
model.train(data="data.yaml", batch=16, weights_loss_weight=[1.0, 2.0, 0.5])

Issue: Inference slow (1 FPS) on edge device

Cause: Model too large or device has low GPU.

# Fix: Use smaller model variant
model = YOLO("yolo11n.pt")  # Nano: 80+ FPS on RTX 4090

# Or quantize model
from ultralytics import YOLO
model = YOLO("yolo11n.pt")
model.export(format="onnx", half=True)  # FP16 precision: 2× faster, minimal accuracy loss

Issue: Model overfitting: training loss decreases but validation mAP plateaus

Cause: Not enough training data or regularization.

# Fix: Increase regularization
model.train(
    data="data.yaml",
    epochs=50,
    weight_decay=1e-3,  # Increase from default 5e-4
    augment=True,
    mosaic=1.0,  # Enable mosaic augmentation
    mixup=0.1    # Enable mixup
)

Model Selection Decision Tree

What's your primary goal?
├─ Real-time inference (>30 FPS required)
│  └─ Use YOLOv11n (nano)
├─ Balanced speed + accuracy
│  └─ Use YOLOv11s or YOLOv11m (small/medium)
├─ Maximum accuracy, speed not critical
│  └─ Use YOLOv11l or YOLOv11x (large/xlarge)
└─ Edge device (mobile, Raspberry Pi)
   └─ Use YOLOv11n with quantization (INT8)

Quick Reference: Dataset Preparation Checklist

  • ✅ Images in images/train and images/val folders
  • ✅ Labels in labels/train and labels/val folders (matching filenames)
  • ✅ Each image has a corresponding .txt label file
  • ✅ Label format: class_id cx cy w h (all 0-1 normalized)
  • data.yaml with correct paths and class names
  • ✅ 70–80% images in train, 20–30% in val
  • ✅ No images in both train and val (prevents data leakage)
  • ✅ All classes represented in train set (no missing classes)

Annotation Workflow Comparison

WorkflowSpeedCostQualityBest For
Roboflow (cloud)Fast (2 min/img)Free/PaidGood (auto-augment)Small teams, quick iteration
LabelImg (desktop)Medium (3–5 min/img)FreeGoodSolo developers, privacy
CVAT (self-hosted)Medium (2–3 min/img)FreeExcellentTeams, enterprise
Freelancers (MTurk)Slow (cost: $0.10–0.50/img)Medium–HighVariesLarge datasets (1000+ images)

Frequently Asked Questions (FAQ)

Q: What’s the difference between YOLOv11 and YOLOv10?

A: YOLOv11 (2026) is 2–5% more accurate and slightly faster than YOLOv10 (2024). Use YOLOv11 for new projects. Upgrade from v10 only if accuracy improvement matters.

Q: Can I train YOLOv11 on CPU?

A: Yes, but expect 10–50× slower training. On CPU: ~1 epoch/minute. On RTX 4090: ~1 epoch/second. Use CPU only for testing; switch to GPU for production training.

Q: How do I export YOLOv11 to run on phone/browser?

A: Three options:

# ONNX (runs on CPU, cross-platform)
model.export(format="onnx")

# TensorFlow Lite (mobile)
model.export(format="tflite")

# NCNN (fast inference, mobile/edge)
model.export(format="ncnn")

Q: Can I train on video directly instead of images?

A: YOLOv11 trains on images, not video. Extract frames from video first:

import cv2
cap = cv2.VideoCapture("video.mp4")
frame_id = 0
while cap.isOpened():
    ret, frame = cap.read()
    if not ret: break
    cv2.imwrite(f"frames/frame_{frame_id:06d}.jpg", frame)
    frame_id += 1

Q: How do I handle class imbalance (some classes rare)?

A: Use weighted loss:

# If class 0 is rare (50 images) vs class 1 (500 images)
# Set weight: rare class = 10, common class = 1
model.train(data="data.yaml", class_weights=[10, 1, 1])

Q: What’s the minimum number of classes I can train?

A: 1 class is possible (detect “cat” anywhere in image). YOLOv11 handles single-class well. Multi-class (5+) yields better results due to learned feature diversity.

Q: Can I detect small objects (<50 pixels)?

A: YOLOv11 struggles with very small objects. Options:

  1. Use larger input: imgsz=1280 instead of 640 (4× more computation)
  2. Tile images: Split large images into 640×640 tiles, detect, merge results
  3. Use different model: Faster R-CNN better for small objects

Q: How do I use YOLOv11 for video surveillance 24/7?

A: Use the async inference pattern:

import cv2, threading
from ultralytics import YOLO

model = YOLO("best.pt")
cap = cv2.VideoCapture("rtsp://camera_url")

def process_frames():
    while True:
        ret, frame = cap.read()
        if not ret: break
        results = model(frame)  # Non-blocking inference
        display(results)

thread = threading.Thread(target=process_frames, daemon=True)
thread.start()

Q: What’s the cost to train a custom YOLOv11 vs cloud services?

A: Local training: ~$1 electricity (on RTX 4090). Cloud (AWS/Google): $5–50/hour GPU rental. 50-epoch training on cloud: ~$10–50. Local is cheaper if you already own GPU.

Q: How do I evaluate model performance beyond mAP?

A: Use confusion matrix:

results = model.val()  # Validation
print(results.confusion_matrix)  # Shows misclassified pairs
# Example: "dog" confused as "cat" → may need more dog examples


Further Reading

Vucense Guides

Official Documentation & Frameworks

Annotation & Labeling Tools

Benchmarks & Deployment

Tested on: Ubuntu 24.04 LTS (RTX 4090, CUDA 12.4). Ultralytics 8.3.42, PyTorch 2.5.1. Last verified: May 16, 2026.

Further Reading

All Dev Corner

Comments