Skip to main content

Training Guide

Overview

SpikeYoloV8-Tracker provides a comprehensive training pipeline with configurable hyperparameters, advanced training techniques, and robust logging.

Quick Start

Basic Training

python scripts/training/comprehensive_training.py \
--slice_duration_us 100 \
--time_steps 8 \
--batch_size 8

High-Frequency Training

python scripts/training/comprehensive_training.py \
--slice_duration_us 100 \
--time_steps 8 \
--batch_size 8 \
--max_epochs 100 \
--learning_rate 1e-3 \
--device cuda

Training Scripts

Main Training Script

Location: scripts/training/comprehensive_training.py

This is the primary script for training the model. It supports:

  • Configurable training parameters
  • Learning rate scheduling
  • Warmup epochs
  • Multiple optimizers (SGD, AdamW)
  • Comprehensive logging

Location: scripts/training/hyperparameter_search.py

Automated hyperparameter optimization:

python scripts/training/hyperparameter_search.py

Training Parameters

Command-Line Arguments

  • --slice_duration_us: Time window duration in microseconds (default: 100)
  • --time_steps: Number of temporal steps (default: 8)
  • --batch_size: Batch size (default: 8)
  • --max_epochs: Maximum training epochs (default: 25)
  • --learning_rate: Initial learning rate (default: 0.001)
  • --device: Device to use (default: "cuda")

Configuration File Parameters

See Configuration Guide for all configurable parameters.

Training Optimizations

1. Warmup Learning Rate Schedule

Gradual learning rate increase prevents cold start issues:

def warmup_learning_rate(optimizer, epoch, warmup_epochs, base_lr):
"""Apply linear warmup learning rate."""
if epoch < warmup_epochs:
lr = base_lr * (epoch + 1) / warmup_epochs
for param_group in optimizer.param_groups:
param_group['lr'] = lr

Configuration:

training:
warmup_epochs: 3

Expected Impact: +15-25% mAP

2. Adjusted Loss Function Balance

Stronger emphasis on bounding box localization:

yolo_loss:
box_loss_weight: 5.0 # Increased from 2.0

Expected Impact: +10-20% mAP

3. Increased Training Epochs

More time for model to learn spatial relationships:

training:
epochs: 25 # Increased from 5

Expected Impact: +10-15% mAP

4. Prioritized Annotated Windows

Larger temporal buffer captures more annotations:

data_processing:
temporal_buffer: 0.2 # Increased from 0.1

Expected Impact: +20-30% mAP

5. Increased Sample Diversity

Better dataset coverage:

data_processing:
max_samples_per_file: 50 # Increased from 30

Expected Impact: +15-25% mAP

6. Focal Loss for Hard Examples

Focuses learning on hard-to-classify examples:

yolo_loss:
use_focal_loss: true
focal_alpha: 0.25
focal_gamma: 2.0

Expected Impact: +10-15% mAP

7. SGD Optimizer with Momentum

Better for localization tasks:

training:
optimizer: "sgd"
momentum: 0.9

Expected Impact: +5-10% mAP

8. Cyclic Learning Rate

Helps escape poor local minima:

training:
lr_scheduler: "cyclic"
max_learning_rate: 0.002

Expected Impact: +5-10% mAP

9. Label Smoothing

Prevents overconfidence:

yolo_loss:
label_smoothing: 0.1

Expected Impact: +5-10% mAP, better calibration

10. Temporal-Aware Processing

Preserves fine-grained temporal information:

  • Removed temporal aggregation that was averaging across time
  • Loss computed separately for each temporal step
  • Better temporal matching

Expected Impact: +15-25% mAP

Training Process

Dynamic Batching

The training process uses dynamic batching:

  1. Sample Calculation: Exact number of samples calculated per file
  2. Overlapping Windows: 25% overlap ensures complete coverage
  3. Temporal Matching: Annotations matched to specific event windows

Training Log Example

INFO: Temporal matching: 10000 events [1000000-1000999μs] -> 15 annotations
INFO: Temporal matching: 10000 events [2000000-2000999μs] -> 23 annotations
INFO: Epoch 1, Batch 100/1400: Total Loss = 0.723692, Box Loss = 0.000000,
Class Loss = 0.000000, Obj Loss = 0.361720, Track Loss = 0.000168

Monitoring Training

GPU Monitoring

# Check GPU status
nvidia-smi

# Real-time monitoring
watch -n 1 nvidia-smi

Log Files

Training logs are automatically saved to:

{model.logs_dir}/training_{timestamp}.log

Logs include:

  • Training progress
  • Loss values
  • Learning rate schedules
  • File processing information
  • Temporal matching statistics

Expected Performance

Combined Optimizations Impact

  • Total mAP Improvement: +75-100% (from 0% to meaningful values)
  • Training Time: ~1.3-1.6x longer due to more epochs and SGD
  • Rationale: Focal loss + SGD + cyclic LR + label smoothing = robust model

Troubleshooting

Common Issues

  1. Out of Memory: Reduce batch size or max_events_per_sample
  2. Slow Training: Check GPU utilization, reduce prefetch_factor
  3. Poor Convergence: Adjust learning rate, check data loading
  4. Zero mAP: Verify temporal matching, check annotation format

Debug Mode

Enable detailed logging:

# In config.yaml
logging:
level: "DEBUG"

Next Steps