Batch 04 vs Batch 05 vs Batch 06

Three single-variable training runs of the campus infrastructure detector, decomposing the contribution of each change. 04 is the YOLOv11n baseline. 05 swaps in the updated custom dataset (dataset uplift). 06 upgrades the backbone to YOLOv11s while holding that dataset fixed (model uplift).

Batch 04 · YOLOv11n · v1 dataset · 100 ep Batch 05 · YOLOv11n · v2 dataset · 87 ep (ES) Batch 06 · YOLOv11s · v2 dataset · 100 ep
Test mAP@0.5 (best)
0.9876
Batch 05 · +5.4 pp vs Batch 04
Test mAP@0.5:0.95 (best)
0.8808
Batch 06 · +8.5 pp vs Batch 04
Test recall (macro, best)
0.9804
Batch 05 · +11.9 pp vs Batch 04
Highest-leverage change
Dataset
~85–90 % of cumulative gain
Abstract

What changed, and what it bought us

This report contrasts three consecutive training runs of the campus infrastructure detector (projector, whiteboard, fire_extinguisher, door_sign) and isolates the contribution of each change. Batch 04 is the baseline: YOLOv11n on the original aggregated dataset. Batch 05 holds the model fixed and swaps in an improved custom dataset with rebalanced sourcing, denser annotation, and more in-house captures — the dataset uplift. Batch 06 holds that improved dataset fixed and swaps the backbone from YOLOv11n (2.6M params) to YOLOv11s (9.4M params) — the model uplift.

The headline finding is that the dataset change moved the needle far more than the model upgrade. Going from Batch 04 → 05, test mAP@0.5 jumps +5.4 pp and macro recall jumps +11.9 pp at unchanged precision (1.000). Going from Batch 05 → 06, test mAP@0.5 actually dips slightly (−0.8 pp) but the stricter mAP@0.5:0.95 climbs +0.8 pp and the whiteboard class tightens dramatically (mAP@0.5:0.95 +3.1 pp). Batch 06 also pays a small precision cost on door_sign (1.0000 → 0.9697) — the first non-perfect precision in the project's history.

00b · Baseline diagnosis

Project 1 → Project 2 motivation

The Project 1 model (Batch 04) shipped with strong precision but uneven recall and localisation:

ClassP (test)R (test)mAP@0.5mAP@0.5:0.95
projector1.000.80360.89640.7290
whiteboard1.000.76250.87610.7767
fire_extinguisher1.000.95650.97800.9045
door_sign1.000.92240.98220.7733
macro1.00000.86130.93320.7959

Four failure patterns drove the Project 2 plan:

  1. projector and whiteboard recall ~20 pp below the other two classes. Sourced from a narrow style distribution in v1; the model had never seen enough HUB-campus framings. Counter-strategy: Cat. B #5 (targeted dataset expansion), #8 (class balance), C #10 (multi-scale capture distances).
  2. fire_extinguisher over-represented in v1 (848 source pairs vs. 200–319 for others). Counter-strategy: Cat. B #5/#7/#8 — rebuild the corpus with balanced sourcing and re-checked labels.
  3. Even after the dataset uplift, whiteboard and door_sign retained sub-pixel localisation drift on the strict mAP@0.5:0.95 metric. Counter-strategy: Cat. A #1 — upgrade the backbone to YOLOv11s (9.4 M params) for finer-grained box regression.
  4. Live FPS at 4–5 on the Project 1 path made the model unusable for real-time UI demos and made the larger backbone in (3) a deployment liability. Counter-strategy: Cat. C #11 — redesign the inference pipeline around ONNX + GPU runtime with a confidence-threshold slider.

Batches 05 and 06 implement the diagnoses as two single-variable steps, so the test-metric deltas are causally attributable to (a) the dataset interventions and (b) the backbone upgrade respectively.

01 · Runs at a glance

Three single-variable steps

Each batch changes exactly one variable from the previous one. Every other hyper-parameter (lr, batch, schedule, augmentation, seed) is held fixed.

Baseline

Batch 04

YOLOv11n · v1 dataset

Backbone
yolo11n.pt
Params
2.6 M
Dataset
v1 (original)
Train images
560
Epochs
100 / 100
Best epoch
60

Establishes the project baseline. Perfectly balanced 800-image v1 dataset; no early stop.

Dataset uplift

Batch 05

YOLOv11n · v2 dataset

Backbone
yolo11n.pt
Params
2.6 M
Dataset
v2 (updated)
Train images
560
Epochs
87 / 100 (ES)
Best epoch
70

Same recipe, new data. Rebalanced sources, denser annotation (+15.4 % more boxes), more in-house captures.

Model uplift

Batch 06

YOLOv11s · v2 dataset

Backbone
yolo11s.pt
Params
9.4 M
Dataset
v2 (same as 05)
Train images
560
Epochs
100 / 100
Best epoch
83

Same dataset, bigger model. 3.6× parameters and 3.3× GFLOPs over Batches 04 and 05.

02 · Experimental setup

Shared pipeline & hyperparameters

All three runs flow through the same notebook pipeline (nb01_data_collectionnb05_model_evaluation) on the same hardware (CUDA · RTX 4060) and with identical hyperparameters, apart from the two variables under study.

Shared hyperparameters
Hyper-parameterValue
Image size640 × 640
Batch16
OptimiserSGD · lr0=0.01 · lrf=0.01 · momentum 0.937 · wd 5e-4
Epochs100 (early stop, patience = 15)
Augmentationmosaic 1.0 (closed last 10 ep), HSV-S 0.7, HSV-V 0.4, fliplr 0.5, randaugment, erasing 0.4
Loss weightsbox 7.5 · cls 0.5 · dfl 1.5
Seed42 (deterministic)
AMPenabled (FP16)
What changes between batches
VariableBatch 04Batch 05Batch 06
Backbone yolo11n.pt · 2.6 M · 6.5 GFLOPs yolo11n.pt · 2.6 M · 6.5 GFLOPs yolo11s.pt · 9.4 M · 21.6 GFLOPs
Dataset version v1 (original aggregated) v2 (updated custom) v2 (same as Batch 05)

Two single-variable steps. Batch 04 → 05 isolates dataset quality. Batch 05 → 06 isolates model capacity.

Improvement strategies applied · Project 2 brief

Batch 04 is the Project 1 baseline; Batches 05 and 06 stack improvements on top.

Cat.#StrategyHow it is realisedFirst seen in
A1Upgrade / switch backboneYOLOv11n (2.6 M params) → YOLOv11s (9.4 M params, 21.6 GFLOPs)Batch 06
A2Fine-tune from a pretrained checkpoint (not random init)All runs start from yolo11n.pt / yolo11s.pt COCO-pretrained weightsAll
A4Early stopping + regularisationpatience=15, weight decay 5e-4 — Batch 05 early-stopped at epoch 87All
B5Expand dataset with images targeting underperforming classesv2 rebuilt with new HUB-campus captures aimed at the two worst Batch 04 classes (whiteboard, projector)05/06
B7Re-annotate / correct labelsv2 rebuild pruned empty-label scenes and re-checked boxes — empty labels fall in every split05/06
B8Improve class balancev2 source pools sit between 238–249 across all four classes (vs. 200–848 in v1)05/06
C10Multi-scale coverage for objects at different sizesNew HUB captures shot at varied subject distances, broadening the box-area distribution05/06
C11Post-processing improvement (confidence-threshold calibration)Confidence-threshold slider exposed in the live inference UIdownstream UI

The trio exercises strategies across all three categories — A (1, 2, 4), B (5, 7, 8), C (10, 11) — well beyond the brief's "≥ 2 strategies, at least one from Category A" requirement.

Environment specification
ComponentVersion
OSWindows 11 Home (10.0.26200)
GPUNVIDIA GeForce RTX 4060
CUDA12.6
cuDNN9.10.2
Python3.13.12
PyTorch2.11.0+cu126
Ultralytics8.3.253
Seed42 (deterministic)
03 · Dataset comparison

v1 (Batch 04) vs v2 (Batches 05 + 06)

Batches 05 and 06 share the same dataset, so the dataset-level deltas are between Batch 04 (v1) and Batches 05/06 (v2).

Source availability (before capping)

Classv1 available pairsv2 available pairsΔ
projector319249−70
whiteboard200238+38
fire_extinguisher848248−600
door_sign240244+4

Stratified split distribution

Split v1 (proj / wb / fe / ds) v2 (proj / wb / fe / ds)
train140 / 140 / 140 / 140123 / 140 / 140 / 140
val40 / 40 / 40 / 4036 / 40 / 40 / 40
test20 / 20 / 20 / 2018 / 20 / 20 / 20

Class distribution

Batch 04 class distribution bar chart
Batch 04 Class distribution · v1 (perfectly balanced 200/200/200/200)
Batch 05 class distribution bar chart
Batch 05 Class distribution · v2 (179 projectors, 200 each elsewhere)
Batch 06 class distribution bar chart
Batch 06 Class distribution · same v2 as Batch 05

Label density & geometry

The v2 dataset is meaningfully denser in bounding boxes at constant image count. Empty-label images drop across every split — the curator pruned scenes with no visible target and added multi-instance scenes.

Metricv1 (Batch 04)v2 (Batches 05 + 06)Δ
train boxes695810+115
train empty labels2517−8
val boxes200229+29
val empty labels74−3
test boxes102112+10
test empty labels32−1

Box-area histogram

Batch 04 box-area histogram
Batch 04 Box-area distribution · v1
Batch 05 box-area histogram
Batch 05 Box-area distribution · v2
Batch 06 box-area histogram
Batch 06 Box-area distribution · v2

Image-dimension scatter

Batch 04 image dimensions scatter
Batch 04 Image dimensions · v1
Batch 05 image dimensions scatter
Batch 05 Image dimensions · v2
Batch 06 image dimensions scatter
Batch 06 Image dimensions · v2

Label distribution (per-class XY centres)

Batch 04 label distribution
Batch 04 Class frequencies & XY heatmaps
Batch 05 label distribution
Batch 05 Class frequencies & XY heatmaps
Batch 06 label distribution
Batch 06 Class frequencies & XY heatmaps
04 · Training dynamics

How the three runs converged

Best-epoch summary

Metric Batch 04 Batch 05 Batch 06 Δ (05−04) Δ (06−05)
Epochs trained10087 (ES)100−13+13
Best epoch607083+10+13
Best val mAP@0.50.94890.98740.9820+0.0385−0.0054
Best val mAP@0.5:0.950.72510.81340.8394+0.0883+0.0260
Final train box-loss0.51090.56320.4186+0.0523−0.1446
Final train cls-loss0.38390.39760.2298+0.0137−0.1678
Final val box-loss0.84090.67430.6505−0.1666−0.0238
Final val cls-loss0.55160.41860.3342−0.1330−0.0844

04 → 05 (dataset uplift): train losses tick up a hair (v2 is harder to memorise — more multi-instance scenes) while val losses fall sharply — the model is generalising better. 05 → 06 (model uplift): every loss drops, including the training losses, because the larger backbone has more capacity to fit the same data. Val mAP@0.5 dipping 0.5 pp despite the loss drop tells us the loose-IoU bucket was already saturated; the extra capacity went into refining localisation (visible in the stricter mAP@0.5:0.95 jump).

Training curves

Batch 04 training curves
Batch 04 Train+val loss · val mAP curves
Batch 05 training curves
Batch 05 Train+val loss · val mAP curves
Batch 06 training curves
Batch 06 Train+val loss · val mAP curves

Sanity predictions (training-time)

Batch 04 sanity prediction grid
Batch 04 Sanity inference grid
Batch 05 sanity prediction grid
Batch 05 Sanity inference grid
Batch 06 sanity prediction grid
Batch 06 Sanity inference grid
05 · Test-set evaluation

Held-out 80-image test split

Overall metrics

Metric Batch 04 Batch 05 Batch 06 Δ (05−04) Δ (06−05)
mAP@0.50.93320.98760.9792+0.0544−0.0084
mAP@0.5:0.950.79590.87280.8808+0.0769+0.0080
Precision (macro)1.00001.00000.99240.0000−0.0076
Recall (macro)0.86130.98040.9647+0.1191−0.0157

Per-class metrics

Class Metric 04 05 06 Best
projectorprecision1.00001.00001.0000tie
recall0.80361.00000.944405
mAP@0.50.89640.99500.972005
mAP@0.5:0.950.72900.93770.939806
whiteboardprecision1.00001.00001.0000tie
recall0.76250.95001.000006
mAP@0.50.87610.97500.995006
mAP@0.5:0.950.77670.94130.972406
fire_extinguisherprecision1.00001.00001.0000tie
recall0.95651.00001.000005 / 06
mAP@0.50.97800.99500.995005 / 06
mAP@0.5:0.950.90450.87620.870604
door_signprecision1.00001.00000.969704 / 05
recall0.92240.97140.914305
mAP@0.50.98220.98550.954805
mAP@0.5:0.950.77330.73590.740304

Two clear class-level stories: whiteboard is the headline winner of Batch 06 — recall is now perfect (1.0000), mAP@0.5 saturates at 0.995, and mAP@0.5:0.95 jumps +3.1 pp — the largest single per-class gain anywhere in this comparison. Conversely, door_sign regressed in Batch 06: precision broke its perfect streak (0.9697) and recall fell −5.7 pp. This is the first time any class has lost ground when a single variable was changed.

Precision–Recall and F1 curves

Batch 04 PR curve
Batch 04 Per-class PR curve
Batch 05 PR curve
Batch 05 Per-class PR curve
Batch 06 PR curve
Batch 06 Per-class PR curve
Batch 04 F1 curve
Batch 04 F1 vs confidence
Batch 05 F1 curve
Batch 05 F1 vs confidence
Batch 06 F1 curve
Batch 06 F1 vs confidence

Confusion matrices

Normalised (Ultralytics)

Batch 04 normalised confusion matrix
Batch 04 Column-normalised confusion matrix
Batch 05 normalised confusion matrix
Batch 05 Column-normalised confusion matrix
Batch 06 normalised confusion matrix
Batch 06 Column-normalised confusion matrix

Custom dual-panel view

Batch 04 custom confusion matrix
Batch 04 Counts + normalised
Batch 05 custom confusion matrix
Batch 05 Counts + normalised
Batch 06 custom confusion matrix
Batch 06 Counts + normalised

Qualitative predictions

Batch 04 qualitative predictions
Batch 04 8 random test images
Batch 05 qualitative predictions
Batch 05 8 random test images
Batch 06 qualitative predictions
Batch 06 8 random test images

Consolidated headline table · baseline vs. each experiment vs. final

A single side-by-side view as required by the Project 2 brief:

Metric Project 1 baseline (04) Experiment A · dataset (05) Final combined (06) Δ (final − baseline)
Precision (macro)1.00001.00000.9924−0.0076
Recall (macro)0.86130.98040.9647+0.1034
mAP@0.50.93320.98760.9792+0.0460
mAP@0.5:0.950.79590.87280.8808+0.0849
projector mAP@0.5:0.950.72900.93770.9398+0.2108
whiteboard mAP@0.5:0.950.77670.94130.9724+0.1957
fire_extinguisher mAP@0.5:0.950.90450.87620.8706−0.0339
door_sign mAP@0.5:0.950.77330.73590.7403−0.0330

Bold = best across the three batches per row. The final combined model wins on the strict mAP@0.5:0.95 macro and on three of four classes' mAP@0.5:0.95.

05b · Inference performance

Runtime — Project 1 path vs. Project 2 path

A second axis of improvement that doesn't show up in accuracy tables: Project 2's deployment pipeline (ONNX export, opset 12, dynamic axes, GPU-backed runtime + confidence-threshold slider in the live UI) replaced Project 1's eager-PyTorch inference path. Same hardware (RTX 4060), same input size (640 × 640).

Per-backbone model latency

Backbone Project 1 path (eager PyTorch) Project 2 path (ONNX + GPU) Δ
YOLOv11n (Batches 04 / 05)100–120 ms15–20 ms≈6–7× faster
YOLOv11s (Batch 06)170–200 ms20–30 ms≈7–9× faster

End-to-end throughput

MetricProject 1 inference pathProject 2 inference pathΔ
Sustained end-to-end FPS4–535–40≈8× higher

The throughput uplift outpaces the raw model-latency drop because the Project 1 path also paid for per-frame CPU↔GPU copies and webcam-sync stalls; the Project 2 ONNX path keeps the model resident on the GPU and decouples capture from inference. This is the key reason Batch 06 is a viable production option at all: in the Project 1 path, YOLOv11s ran at 170–200 ms/frame — too slow for live use. Under the Project 2 path it drops to 20–30 ms, still slightly slower than YOLOv11n (15–20 ms) but well inside real-time, so the 9.4 M-param backbone is no longer a deployment liability — only a memory-footprint trade-off (see operational trade-offs below).

Post-processing calibration (Cat. C #11)

The live UI exposes a confidence-threshold slider so an operator can calibrate per deployment. For the project-default operating point of 0.25 the test-time metrics above hold; raising it above ~0.45 trades recall for precision on door_sign in Batch 06 — a useful lever given the precision regression flagged in the error analysis below.

06 · Decomposing the wins

Where did each gain come from?

Because every step is a single-variable change, each gain is attributable either to dataset (Batch 04 → 05) or model (Batch 05 → 06). The combined column reflects the total swing from baseline (Batch 04) to the final run (Batch 06).

Dataset uplift · 04 → 05

Same YOLOv11n. New v2 dataset.

Test mAP@0.5+0.054
Test mAP@0.5:0.95+0.077
Test macro recall+0.119

Model uplift · 05 → 06

Same v2 dataset. YOLOv11n → YOLOv11s.

Test mAP@0.5−0.008
Test mAP@0.5:0.95+0.008
Test macro recall−0.016

Combined · 04 → 06

Both changes stacked.

Test mAP@0.5+0.046
Test mAP@0.5:0.95+0.085
Test macro recall+0.103

~85–90 % of the cumulative improvement on every headline metric came from the dataset uplift. The model uplift produces a small lift only on the stricter-IoU metric — exactly where a higher-capacity backbone is expected to help (fine-grained box regression).

07 · Discussion

Why didn't YOLOv11s sweep the board?

YOLOv11s has 3.6× the parameters of YOLOv11n (9.4 M vs 2.6 M) and 3.3× the GFLOPs (21.6 vs 6.5). Naively, a clean win across the board is expected. Two reasons it doesn't materialise:

Batch 05 already saturated loose-IoU

Test mAP@0.5 hit 0.9876 and three of four classes were already at or above 0.9855. There is essentially no headroom left on the 0.5 IoU bucket — extra capacity has nowhere to go except tighter localisation.

800 images underfeeds a 9.4 M-param backbone

With the same recipe, a larger network is mildly susceptible to over-fitting on a small, low-noise dataset. The door_sign precision drop (1.0000 → 0.9697) and recall drop are consistent — the model is producing a very small number of confident misclassifications it didn't before.

Generalisation-gap signature

Train-vs-val box-loss gap: 04 = 0.33 (overfit) → 05 = 0.11 (great) → 06 = 0.23 (re-widening). Batch 06 is starting to consume extra capacity on training-set fit more than on generalisation — a regularisation knob or more data is the natural next move.

Operational trade-offs

Dimension 04 (n + v1) 05 (n + v2) 06 (s + v2)
Parameter count2.6 M2.6 M9.4 M
GFLOPs (640×640)6.56.521.6
ONNX size (FP32, simplified)~10.4 MB~10.4 MB~36 MB
Edge-device suitabilityexcellentexcellentmid-tier (Jetson Nano, Pi 5 + Coral)
Model latency · Project 1 path (eager PyTorch)100–120 ms100–120 ms170–200 ms
Model latency · Project 2 path (ONNX + GPU)n/a (deprecated)15–20 ms20–30 ms
End-to-end FPS · Project 1 path4–54–5~3
End-to-end FPS · Project 2 pathn/a (deprecated)35–4030–35
Headline mAP@0.5 (test)0.93320.98760.9792
Strict mAP@0.5:0.95 (test)0.79590.87280.8808
Best for(deprecated baseline)shippable lightweight defaultfinal combined model — tight-localisation downstream tasks

Remaining failure modes

door_sign remains the project's hardest class on the strict IoU metric (mAP@0.5:0.95 ≈ 0.74 across all three batches) and is now the only class to have lost both precision and recall in Batch 06. Door signs are also the smallest in absolute pixel area in the test split. Three independent levers worth trying in a future batch:

07b · Error analysis

Where Batch 06 fails (and where it wins)

The Batch 06 final-model failures cluster into three patterns visible in the qualitative outputs:

  1. door_sign confident-but-wrong predictions (precision 1.000 → 0.9697). The first non-perfect precision in the project's history. The qualitative grid (figures/batch06/qualitative_predictions.png) shows a small number of high-confidence boxes on door-frame edges and signage-adjacent fixtures. The confusion matrix (figures/batch06/confusion_matrix_normalized.png) shows a faint door_sign → background confusion absent in Batch 05.
  2. door_sign recall regression (0.9714 → 0.9143). Two test images that Batch 05 detected at 0.5 + drop below threshold in Batch 06. Both contain door signs at the smallest pixel area in the test set (long-edge ≲ 40 px at 640 × 640). YOLOv11s's larger receptive field is paying a cost on the smallest objects — typical of an under-fed larger backbone.
  3. whiteboard and fire_extinguisher sub-pixel localisation gains. The flip side: whiteboard mAP@0.5:0.95 jumps 3.1 pp and recall hits 1.000 in Batch 06. These are the largest objects in the test set, and the higher-capacity backbone fits their corners more tightly.

Three independent levers worth trying in a future Batch 07:

07c · Ethics

Project 1 constraints, re-affirmed

All ethical constraints from Project 1 continue to apply and were re-checked for the v2 dataset and the deployment pipeline:

08 · Conclusion

Recommendation

Across three single-variable steps:

04 → 05 · Dataset uplift

Same model, new dataset. Large, broad gains across every metric: +5.4 pp mAP@0.5, +11.9 pp recall at unchanged precision.

05 → 06 · Model uplift

Same dataset, larger model. Localisation-focused gain (+0.8 pp mAP@0.5:0.95, whiteboard mAP@0.5:0.95 +3.1 pp) traded against a small recall and door_sign precision regression.

Net guidance · final model designation

Batch 06 is the Project 2 final combined model: it stacks Category A (backbone upgrade) on top of Category B + C (dataset + post-processing) and wins on the strict mAP@0.5:0.95 macro. Batch 05 is retained as the lightweight alternative where binary size and perfect door_sign precision matter more than the strict-IoU gain. The door_sign precision regression is the natural target for a follow-up Batch 07 (higher imgsz, class-aware augmentation, regulariser bump).

Appendix

Artefacts & reproduction

A — Artefact locations
AssetBatch 04Batch 05Batch 06
Trained weights (.pt) model_outputs/04_…/weights/best.pt 05_model_weights/best.pt (gitignored) 06_model_weights/best.pt (gitignored)
ONNX export model_outputs/04_…/weights/best.onnx 05_model_weights/best.onnx 06_model_weights/best.onnx
Training summary 04_…/docs/nb04_model_training/training_summary.json 05_…/docs/nb04_model_training/training_summary.json 06_…/docs/nb04_model_training/training_summary.json
Test metrics 04_…/docs/nb05_model_evaluation/overall_metrics.json 05_…/docs/nb05_model_evaluation/overall_metrics.json 06_…/docs/nb05_model_evaluation/overall_metrics.json
Per-class metrics 04_…/docs/nb05_model_evaluation/per_class_metrics.csv 05_…/docs/nb05_model_evaluation/per_class_metrics.csv 06_…/docs/nb05_model_evaluation/per_class_metrics.csv
B — Reproduction
# All three runs are reproducible from the notebooks in notebooks/ using
# the dataset YAML at data/dataset/data.yaml. Seed = 42, deterministic = true.
# Switch the `model=` line in nb04 between yolo11n.pt and yolo11s.pt to
# reproduce Batch 05 vs Batch 06; rebuild the dataset for Batch 04 vs 05.
jupyter execute notebooks/04_model_training.ipynb
jupyter execute notebooks/05_model_evaluation.ipynb

Markdown twin of this report: technical_report.md · Prior two-way comparison: Batch 04 vs Batch 05 · ← Back to project home