Comparative Analysis · 12 May 2026

Batch 04vsBatch 05

A hyperparameter-controlled comparison of two YOLOv11n training runs for the campus infrastructure detector. Only the training dataset changed between runs — Batch 05 was retrained on an updated, rebalanced custom dataset.

Backbone
YOLOv11n · 2.6M
Input
640 × 640 RGB
Classes
4 — projector · whiteboard · fire ext · door sign
Hardware
CUDA · RTX 4060
Seed
42 · deterministic

00Abstract

This report contrasts two consecutive training runs of the campus infrastructure detector (projector, whiteboard, fire_extinguisher, door_sign). Both runs share the same backbone (YOLOv11n, 2.6M params), input size (640 × 640), optimiser (SGD), schedule (100 epochs, patience 15), and seed (42). The only material variable is the training corpus: Batch 05 was retrained on an updated custom dataset with more balanced per-class sourcing, a larger share of in-house door-sign captures, and richer multi-instance scenes.

Test mAP@0.5
0.9876
▲ +0.0544 vs 0.9332
Test mAP@0.5:0.95
0.8728
▲ +0.0769 vs 0.7959
Macro recall
0.9804
▲ +0.1191 vs 0.8613
Macro precision
1.0000
▬ unchanged · 1.0000
Best val mAP@0.5
0.9874
▲ +0.0385 vs 0.9489
Best val mAP@0.5:0.95
0.8134
▲ +0.0883 vs 0.7251

00bBaseline diagnosis — Project 1 → Project 2 motivation

The Project 1 model (Batch 04) shipped with strong precision but uneven recall:

ClassP (test)R (test)mAP@0.5mAP@0.5:0.95
projector1.000.80360.89640.7290
whiteboard1.000.76250.87610.7767
fire_extinguisher1.000.95650.97800.9045
door_sign1.000.92240.98220.7733
macro1.00000.86130.93320.7959

Two failure patterns drove the Project 2 plan:

  1. projector and whiteboard recall ~20 pp below the other two classes. Sourced from a narrow style distribution in v1 (Roboflow / Kaggle); the model had never seen enough HUB-campus framings of these two classes. Counter-strategy: Cat. B #5 (targeted dataset expansion on projector/whiteboard), #8 (class balance), and C #10 (multi-scale capture distances).
  2. fire_extinguisher over-represented in v1 (848 source pairs vs. 200–319 for others); cap-to-200 left a narrow style slice. Counter-strategy: Cat. B #5/#7/#8 — rebuild the corpus with balanced sourcing and re-checked labels.

Batch 05 therefore holds every model / training knob fixed and changes only the dataset, so any test-metric delta is causally attributable to the Project 2 dataset-side interventions listed in §1.2.

01Experimental setup

Both runs were executed end-to-end through the same notebook pipeline (nb01_data_collectionnb05_model_evaluation) on the same hardware and with identical hyperparameters. Because every controllable variable is held fixed, every metric delta below is attributable to the dataset change.

Raw exports (Roboflow / Kaggle / custom HUB captures)
        │
    [NB01]  Aggregate ~200 (image, label) pairs/class
        │
    [NB02]  Remap class IDs · stratified 70/20/10 split
        │
    [NB03]  Health check: balance, box geometry, leakage
        │
    [NB04]  Train YOLOv11n · 100 epochs · SGD · patience 15
        │
    [NB05]  Evaluate on held-out test split (80 images)
        │
    [NB07]  Export to ONNX (opset 12, dynamic axes)

Identical hyperparameters

Hyper-parameterValue
Backboneyolo11n.pt (pretrained COCO)
Image size640 × 640
Batch16
OptimiserSGD · lr0=0.01 · lrf=0.01 · momentum 0.937 · weight decay 5e-4
Epochs100 (early stop, patience = 15)
Augmentationmosaic 1.0 (closed last 10 ep) · HSV-S 0.7 · HSV-V 0.4 · fliplr 0.5 · randaugment · erasing 0.4
Loss weightsbox 7.5 · cls 0.5 · dfl 1.5
Seed42 · deterministic
AMPenabled

1.2 Improvement strategies applied (Project 2 brief)

This report maps to the Project 2 improvement-strategy catalogue as follows. The Project 1 model (Batch 04) is the official baseline; every strategy below is realised in Batch 05.

Cat.#StrategyHow it is realisedEvidence
A2Fine-tune from a pretrained checkpoint (not random init)Both runs start from yolo11n.pt (COCO-pretrained)§1.1
A4Early stopping + regularisationpatience=15, weight decay 5e-4 — early stop fired at epoch 87 in Batch 05§1.1, §3.1
B5Expand the dataset with images targeting underperforming classesv2 dataset rebuilt with new HUB-campus captures aimed at the two worst Batch 04 classes (whiteboard, projector)§2.1, §5.1
B7Re-annotate / correct labels to address annotation noisev2 rebuild pruned empty-label scenes and re-checked boxes — empty labels fall in every split, box density rises at constant image count§2.3
B8Improve class balancev2 source pools sit between 238–249 across all four classes (vs. 200–848 in v1)§2.1
C10Multi-scale coverage for objects at different sizesNew HUB captures shot at varied subject distances, broadening the box-area distribution§2.3
C11Post-processing improvement (confidence-threshold calibration)Confidence-threshold slider exposed in the live inference UIinference UI

A backbone upgrade (Category A #1) is the subject of the companion three-way report (Batch 06 — YOLOv11s). Together the two reports exercise strategies in all three categories.

1.3 Environment specification

ComponentVersion
OSWindows 11 Home (10.0.26200)
GPUNVIDIA GeForce RTX 4060
CUDA12.6
cuDNN9.10.2
Python3.13.12
PyTorch2.11.0+cu126
Ultralytics8.3.253
Seed42 (deterministic)

02Dataset comparison

2.1 Source availability (before capping)

The new dataset narrows the gap between over- and under-sourced classes — trimming the previously dominant fire_extinguisher pool while broadening whiteboard and adding more in-house door_sign captures.

ClassBatch 04 pairsBatch 05 pairsΔ
projector319249−70
whiteboard200238+38
fire_extinguisher848248−600
door_sign240244+4

2.2 Stratified split distribution

Batch 04 produced a perfectly balanced split; Batch 05 has a mild shortfall on projector after deduplication and filtering.

SplitBatch 04 (p · w · f · d)Batch 05 (p · w · f · d)
train140 · 140 · 140 · 140123 · 140 · 140 · 140
val40 · 40 · 40 · 4036 · 40 · 40 · 40
test20 · 20 · 20 · 2018 · 20 · 20 · 20
Class distributionBatch 04
Batch 04 class distribution bar chart
Class distributionBatch 05
Batch 05 class distribution bar chart

2.3 Label density and box geometry

The updated dataset is meaningfully denser in bounding boxes — Batch 05 carries 1,151 total boxes vs Batch 04's 997, despite the identical image budget. Empty-label images dropped across every split.

MetricBatch 04Batch 05Δ
train images5605600
train boxes695810+115
train empty labels2517−8
val boxes200229+29
val empty labels74−3
test boxes102112+10
test empty labels32−1
tiny boxes (all splits)01+1
Box-area histogramBatch 04
Batch 04 box area histogram
Box-area histogramBatch 05
Batch 05 box area histogram
Image dimensionsBatch 04
Batch 04 image dimensions scatter plot
Image dimensionsBatch 05
Batch 05 image dimensions scatter plot
Label distribution · per-class XYBatch 04
Batch 04 label distribution
Label distribution · per-class XYBatch 05
Batch 05 label distribution

03Training dynamics

3.1 Best-epoch summary

MetricBatch 04Batch 05Δ
Epochs trained10087 · early stop−13
Best epoch6070+10
Best val mAP@0.50.94890.9874+0.0385
Best val mAP@0.5:0.950.72510.8134+0.0883
Final train box loss0.51090.5632+0.0523
Final train cls loss0.38390.3976+0.0137
Final val box loss0.84090.6743−0.1666
Final val cls loss0.55160.4186−0.1330
Reading the gap. Train losses tick up slightly (the new dataset is harder to memorise — more multi-instance scenes), while val losses fall sharply. That divergence is the desirable signature of better generalisation, not over-fitting.

3.2 Training curves

Training curvesBatch 04
Batch 04 training curves
Training curvesBatch 05
Batch 05 training curves

3.3 Sanity predictions (training-time)

Sanity predictionsBatch 04
Batch 04 sanity predictions
Sanity predictionsBatch 05
Batch 05 sanity predictions

04Test evaluation — held-out · 80 images

4.1 Overall metrics

MetricBatch 04Batch 05Δ
mAP@0.50.93320.9876+0.0544
mAP@0.5:0.950.79590.8728+0.0769
Precision (macro)1.00001.00000.0000
Recall (macro)0.86130.9804+0.1191
Headline: a +11.9 pp recall jump at unchanged precision (1.0). Batch 05 catches substantially more true positives without admitting a single new false positive.

4.2 Per-class metrics

ClassMetricBatch 04Batch 05Δ
projectorprecision1.0001.0000.000
recall0.80361.0000+0.1964
mAP@0.50.89640.9950+0.0986
mAP@0.5:0.950.72900.9377+0.2087
whiteboardprecision1.0001.0000.000
recall0.76250.9500+0.1875
mAP@0.50.87610.9750+0.0989
mAP@0.5:0.950.77670.9413+0.1646
fire_extinguisherprecision1.0001.0000.000
recall0.95651.0000+0.0435
mAP@0.50.97800.9950+0.0170
mAP@0.5:0.950.90450.8762−0.0283
door_signprecision1.0001.0000.000
recall0.92240.9714+0.0490
mAP@0.50.98220.9855+0.0033
mAP@0.5:0.950.77330.7359−0.0374

The two weakest classes in Batch 04 — projector and whiteboard — were the largest beneficiaries. Both now sit above 0.975 mAP@0.5 and break 0.94 mAP@0.5:0.95. The minor regressions on the stricter mAP@0.5:0.95 metric for fire_extinguisher and door_sign are localised tightness losses; both classes still record higher recall and higher mAP@0.5.

4.3 Precision–Recall curves

PR curveBatch 04
Batch 04 PR curve
PR curveBatch 05
Batch 05 PR curve

4.4 F1 curves

F1 curveBatch 04
Batch 04 F1 curve
F1 curveBatch 05
Batch 05 F1 curve

4.5 Confusion matrices · normalised

Confusion · normalisedBatch 04
Batch 04 normalised confusion matrix
Confusion · normalisedBatch 05
Batch 05 normalised confusion matrix

4.6 Confusion matrices · custom view

Confusion · customBatch 04
Batch 04 custom confusion matrix
Confusion · customBatch 05
Batch 05 custom confusion matrix

4.7 Qualitative predictions

Qualitative predictionsBatch 04
Batch 04 qualitative predictions
Qualitative predictionsBatch 05
Batch 05 qualitative predictions

4.8 Consolidated headline table (baseline vs. final)

A single side-by-side view of the official Project 1 baseline (Batch 04) against the Project 2 final model (Batch 05) on the held-out test split.

MetricProject 1 baseline (Batch 04)Project 2 final (Batch 05)Δ
Precision (macro)1.00001.00000.0000
Recall (macro)0.86130.9804+0.1191
mAP@0.50.93320.9876+0.0544
mAP@0.5:0.950.79590.8728+0.0769
projector mAP@0.50.89640.9950+0.0986
whiteboard mAP@0.50.87610.9750+0.0989
fire_extinguisher mAP@0.50.97800.9950+0.0170
door_sign mAP@0.50.98220.9855+0.0033

04bInference performance — runtime

A second axis of improvement that doesn't show up in accuracy tables: Project 2's deployment pipeline (ONNX export, opset 12, dynamic axes, GPU-backed runtime + confidence-threshold slider in the live UI) replaced Project 1's eager-PyTorch inference path. Same hardware (RTX 4060), same input size (640 × 640), YOLOv11n backbone.

MetricProject 1 inference path (eager PyTorch)Project 2 inference path (ONNX + GPU)Δ
Per-frame model latency (YOLOv11n)100–120 ms15–20 ms≈6–7× faster
Sustained end-to-end throughput4–5 FPS35–40 FPS≈8× higher
The throughput uplift outpaces the raw model-latency drop because the Project 1 path also paid for per-frame CPU↔GPU copies and webcam-sync stalls; the Project 2 ONNX path keeps the model resident on the GPU and decouples capture from inference. Batch 05's YOLOv11n backbone (2.6 M params, 6.5 GFLOPs) is comfortably real-time on this pipeline with headroom to spare.

05Discussion

5.1 Why did Batch 05 win?

The improvement is not a hyperparameter win — every knob was identical. Three dataset-level changes explain the lift:

  1. Better source balance. Batch 04's fire_extinguisher pool (848 candidates) and projector pool (319) dominated the cap-to-200 step, so surviving training images were drawn from a narrow style distribution. Batch 05's source pools sit between 238–249 across all four classes — the cap is a much shallower filter and per-class style coverage is broader.
  2. More custom in-house data. Batch 05 adds more HUB-campus captures to the door_sign set with materially different framing variety, and the new captures broaden the projector and whiteboard distributions toward real classroom conditions — which the test split also samples.
  3. Multi-instance scenes. Total annotated boxes climbed from 997 → 1,151 (+15.4%) at constant image count. The model sees more boxes per gradient step and learns to handle co-occurring objects, lifting recall and the stricter IoU buckets that dominate mAP@0.5:0.95.

5.2 Why early-stop at 87 epochs?

The patience-15 trigger fired because val mAP plateaued shortly after the epoch-70 best. Combined with the lower final val losses, Batch 05 reached a flatter optimum earlier — more data of higher quality reduced the gradient steps needed to saturate.

5.3 Train-loss vs val-loss split

Batch 04 gap: 0.84 − 0.51 = 0.33 Batch 05 gap: 0.67 − 0.56 = 0.11

A higher train loss with a lower val loss is the canonical "more representative training data" outcome. Batch 05's training set is harder (more boxes per image, more scene variety) but the model that learns to fit it transfers cleanly to val and test.

5.4 Remaining failure modes

door_sign and fire_extinguisher retained the same recall ceiling on the strict 0.5:0.95 IoU band (−2.8 to −3.7 pp). These two classes are the smallest in absolute pixel area in the test split, so sub-pixel localisation drift is penalised. A follow-up batch should consider:

  • Higher-resolution training (imgsz=896 or 1024) for small-object refinement.
  • Stronger geometric augmentation (scale=0.75, degrees=10) to teach tighter localisation under perspective.
  • A YOLOv11s upgrade (9.4M params) — the variant_comparison.csv slot is reserved for this experiment.

05bError analysis — with example images

The remaining errors in Batch 05 are concentrated in two test-time patterns:

  1. Small-pixel-area door_sign localisation drift. Recall climbs to 0.9714 on the loose-IoU bucket but mAP@0.5:0.95 actually drops 3.7 pp vs. Batch 04. Inspection of the qualitative grid (figures/batch05/qualitative_predictions.png) shows correctly classified door signs whose predicted box is ~5–10 pixels off on the long edge. This is a localisation issue, not a recognition issue — visible in the confusion matrix (figures/batch05/confusion_matrix_normalized.png) where the off-diagonal mass is essentially zero.
  2. fire_extinguisher tightness loss. A −2.8 pp regression on mAP@0.5:0.95 with identical mAP@0.5. Same signature as door_sign: correct detection, slightly loose box. The v2 dataset has fewer fire_extinguisher captures than v1 (248 vs. 848), so the model sees a narrower range of poses than before.

Representative figures (already embedded in §3.3 and §4.7):

  • figures/batch04/qualitative_predictions.png vs. figures/batch05/qualitative_predictions.png — same scenes, tighter boxes in Batch 05 except on door signs.
  • figures/batch04/confusion_matrix_normalized.png vs. figures/batch05/confusion_matrix_normalized.png — Batch 05's diagonal is brighter for projector and whiteboard; off-diagonal mass is unchanged.

A follow-up batch should consider higher-resolution training (imgsz=896) for small-object refinement (covered in §5.4).

05cEthics

All ethical constraints from Project 1 continue to apply and were re-checked for the v2 dataset:

  • No identifiable faces. All HUB-campus captures used for the v2 corpus were framed on infrastructure (projectors, whiteboards, fire extinguishers, door signs). Any incidental human presence was screened out during the v2 curation pass; no face is recognisable in any retained image.
  • No license plates or vehicles. Captures are interior classroom and corridor scenes; no parking areas were photographed.
  • No personal data. No names, ID cards, schedules, or contact information are visible in any image.
  • Provenance. Roboflow and Kaggle source images were used under their published open licences; HUB-campus captures were taken on-site by the author for the purpose of this project.

06Conclusion

Re-training on the updated custom dataset — same model, same hyperparameters, same seed — moved every meaningful headline metric in the right direction:

  • +5.4 pp test mAP@0.5 → 0.9876
  • +7.7 pp test mAP@0.5:0.95 → 0.8728
  • +11.9 pp macro recall at unchanged precision (1.0) → 0.9804
  • Three of four classes now exceed mAP@0.5 = 0.985; the fourth sits at 0.975
The dataset refresh is the single highest-leverage intervention applied to this project to date. Batch 05 is recommended as the shippable model and the new baseline for any further architecture or augmentation experiments.

Artifacts. Trained weights and ONNX exports live under model_outputs/04_training_batch_1_48am_24_04_2026/ and model_outputs/05_training_batch_12_06am_12_05_2026/. Top-level Batch 05 ONNX: 05_model_weights/best.onnx. Source notebooks under notebooks/01..08.