ShelfApi

Automatic shelf bay detection for retail photos
Performance Report — 08-03-2026
95
Blind-tested photos
1.42%
Mean deviation
72%
Spot-on (<1%)
~10ms
Inference time (GPU)

What does ShelfApi do?

ShelfApi analyses photos of retail shelves and automatically detects the boundaries of the central shelf bay. It then masks the surrounding products so that only the relevant bay is visible. This speeds up manual review and enables automated shelf analysis.

Pipeline

Step 1
Photo Upload
Shelf photo from the store
Step 2
AI Boundary Detection
CNN detects left & right boundaries
Step 3
Perspective Correction
Straighten angled photos
Step 4
Masking
Blur or remove surroundings
Technical detail: The model is an EfficientNet-B0 (CNN) that predicts 4 boundary points as percentages of the image width: top-left, bottom-left, top-right, bottom-right. This supports angled shelf boundaries. Inference takes ~10ms on GPU.

Dataset

MetricCount
Total photos in dataset2,814
Human-verified annotations496
Skipped (unusable)26
Available for future verification2,292

Photos are sourced from 6 different Roamler jobs spanning hair care, household products, and cleaning product shelves.

Model Performance

Result: On 95 unseen photos (never used during training), the mean boundary deviation is 1.42% of the image width. 72% of all boundary points required no correction at all.

Error distribution

Per boundary point (4 per photo) — how far does the CNN prediction deviate from the human annotation?

< 1%
72%
1-2%
7%
2-3%
7%
3-5%
4%
5-10%
7%
> 10%
3%

Per boundary point

BoundaryMean deviationMedian
Left Top1.91%0.00%
Left Bottom1.31%0.00%
Right Top1.37%0.00%
Right Bottom1.07%0.00%

Training progression

Label quality has a dramatic impact. More data with noisy labels performs worse than less data with clean human-verified labels.

RoundTraining photosLabelsMean deviationImprovement
Round 12,814 (Gemini)Automatic5.07%Baseline
Round 2226Human-verified2.87%-43%
Round 3401Human-verified1.65%-68%
Round 4 (eval)496Human-verified1.42%-72%

Examples

Green lines = human annotation (ground truth). Red lines = CNN prediction. Darkened areas are masked out.

Perfect prediction (0% error)

32001446_SurfaceCleaning_FullBay_6
Mean deviation: 0.0%
32003106_Dishwasher_FullBay_5
Mean deviation: 0.0%

Good prediction (< 1% error)

30522176_Q56810027_3
Mean deviation: 0.3%
Human CNN prediction
30531317_Q56810027_3
Mean deviation: 0.3%
Human CNN prediction

Moderate prediction (2-5% error)

27725663_HAIR_BAY_3_HAIUK
Mean deviation: 2.1%
Human CNN prediction
32001959_FabricCleaning_FullBay_2
Mean deviation: 2.2%
Human CNN prediction

Worst predictions

31082224_Q56810027
Mean deviation: 9.3%
Human CNN prediction
30387876_Q56810027
Mean deviation: 15.1%
Human CNN prediction

Conclusions & Next Steps