Training Guide
Overview
SpikeYoloV8-Tracker provides a comprehensive training pipeline with configurable hyperparameters, advanced training techniques, and robust logging.
Quick Start
Basic Training
python scripts/training/comprehensive_training.py \
--slice_duration_us 100 \
--time_steps 8 \
--batch_size 8
High-Frequency Training
python scripts/training/comprehensive_training.py \
--slice_duration_us 100 \
--time_steps 8 \
--batch_size 8 \
--max_epochs 100 \
--learning_rate 1e-3 \
--device cuda
Training Scripts
Main Training Script
Location: scripts/training/comprehensive_training.py
This is the primary script for training the model. It supports:
- Configurable training parameters
- Learning rate scheduling
- Warmup epochs
- Multiple optimizers (SGD, AdamW)
- Comprehensive logging
Hyperparameter Search
Location: scripts/training/hyperparameter_search.py
Automated hyperparameter optimization:
python scripts/training/hyperparameter_search.py
Training Parameters
Command-Line Arguments
--slice_duration_us: Time window duration in microseconds (default: 100)--time_steps: Number of temporal steps (default: 8)--batch_size: Batch size (default: 8)--max_epochs: Maximum training epochs (default: 25)--learning_rate: Initial learning rate (default: 0.001)--device: Device to use (default: "cuda")
Configuration File Parameters
See Configuration Guide for all configurable parameters.
Training Optimizations
1. Warmup Learning Rate Schedule
Gradual learning rate increase prevents cold start issues:
def warmup_learning_rate(optimizer, epoch, warmup_epochs, base_lr):
"""Apply linear warmup learning rate."""
if epoch < warmup_epochs:
lr = base_lr * (epoch + 1) / warmup_epochs
for param_group in optimizer.param_groups:
param_group['lr'] = lr
Configuration:
training:
warmup_epochs: 3
Expected Impact: +15-25% mAP
2. Adjusted Loss Function Balance
Stronger emphasis on bounding box localization:
yolo_loss:
box_loss_weight: 5.0 # Increased from 2.0
Expected Impact: +10-20% mAP
3. Increased Training Epochs
More time for model to learn spatial relationships:
training:
epochs: 25 # Increased from 5
Expected Impact: +10-15% mAP
4. Prioritized Annotated Windows
Larger temporal buffer captures more annotations:
data_processing:
temporal_buffer: 0.2 # Increased from 0.1
Expected Impact: +20-30% mAP
5. Increased Sample Diversity
Better dataset coverage:
data_processing:
max_samples_per_file: 50 # Increased from 30
Expected Impact: +15-25% mAP
6. Focal Loss for Hard Examples
Focuses learning on hard-to-classify examples:
yolo_loss:
use_focal_loss: true
focal_alpha: 0.25
focal_gamma: 2.0
Expected Impact: +10-15% mAP
7. SGD Optimizer with Momentum
Better for localization tasks:
training:
optimizer: "sgd"
momentum: 0.9
Expected Impact: +5-10% mAP
8. Cyclic Learning Rate
Helps escape poor local minima:
training:
lr_scheduler: "cyclic"
max_learning_rate: 0.002
Expected Impact: +5-10% mAP
9. Label Smoothing
Prevents overconfidence:
yolo_loss:
label_smoothing: 0.1
Expected Impact: +5-10% mAP, better calibration
10. Temporal-Aware Processing
Preserves fine-grained temporal information:
- Removed temporal aggregation that was averaging across time
- Loss computed separately for each temporal step
- Better temporal matching
Expected Impact: +15-25% mAP
Training Process
Dynamic Batching
The training process uses dynamic batching:
- Sample Calculation: Exact number of samples calculated per file
- Overlapping Windows: 25% overlap ensures complete coverage
- Temporal Matching: Annotations matched to specific event windows
Training Log Example
INFO: Temporal matching: 10000 events [1000000-1000999μs] -> 15 annotations
INFO: Temporal matching: 10000 events [2000000-2000999μs] -> 23 annotations
INFO: Epoch 1, Batch 100/1400: Total Loss = 0.723692, Box Loss = 0.000000,
Class Loss = 0.000000, Obj Loss = 0.361720, Track Loss = 0.000168
Monitoring Training
GPU Monitoring
# Check GPU status
nvidia-smi
# Real-time monitoring
watch -n 1 nvidia-smi
Log Files
Training logs are automatically saved to:
{model.logs_dir}/training_{timestamp}.log
Logs include:
- Training progress
- Loss values
- Learning rate schedules
- File processing information
- Temporal matching statistics
Expected Performance
Combined Optimizations Impact
- Total mAP Improvement: +75-100% (from 0% to meaningful values)
- Training Time: ~1.3-1.6x longer due to more epochs and SGD
- Rationale: Focal loss + SGD + cyclic LR + label smoothing = robust model
Troubleshooting
Common Issues
- Out of Memory: Reduce batch size or
max_events_per_sample - Slow Training: Check GPU utilization, reduce
prefetch_factor - Poor Convergence: Adjust learning rate, check data loading
- Zero mAP: Verify temporal matching, check annotation format
Debug Mode
Enable detailed logging:
# In config.yaml
logging:
level: "DEBUG"