ShelfApi

Automatic shelf bay detection for retail photos

Performance Report — 10-03-2026

279

Blind-tested photos

1.33%

Mean deviation

880

Verified training photos

~10ms

Inference time (GPU)

What does ShelfApi do?

ShelfApi analyses photos of retail shelves, endcap displays (kopstellingen), and checkout displays (kassameubels). It automatically detects the boundaries of the product area and masks the surroundings. This speeds up manual review and enables automated shelf analysis.

Pipeline

Step 1

Photo Upload

Shelf photo from the store

Step 2

AI Boundary Detection

CNN detects 8 boundary points

Step 3

Perspective Correction

Straighten angled photos

Step 4

Masking

Blur or remove surroundings

Technical detail: The model is an EfficientNet-B0 (CNN) that predicts 8 boundary points: 4 for left/right boundaries (x-axis, as % of width) and 4 for top/bottom boundaries (y-axis, as % of height). This supports angled boundaries in all directions. Inference takes ~10ms on GPU, ~50ms on CPU.

Dataset

Metric	Count
Total photos in dataset	3,814
Human-verified annotations	880
Skipped (unusable)	29
Available for future verification	2,905

Photos are sourced from Roamler jobs across multiple categories: regular shelf bays, endcap displays (kopstellingen), and checkout displays (kassameubels).

Model Performance

    Result: The latest model (v0.5, 8 outputs) achieves 1.33% mean deviation on the validation set. X-boundaries average ~2.0% error, Y-boundaries ~0.65% error. On 279 blind-tested photos, the mean deviation is 0.44%.
  

Error distribution

Per boundary point (8 per photo) — how far does the CNN prediction deviate from the human annotation?

< 1%

90%

1-2%

2-3%

3-5%

5-10%

> 10%

Per boundary point

Boundary	Mean deviation	Median
Left Top	1.04%	0.00%
Left Bottom	0.79%	0.00%
Right Top	0.89%	0.00%
Right Bottom	0.80%	0.00%
Top Left	0.00%	0.00%
Top Right	0.00%	0.00%
Bottom Left	0.00%	0.00%
Bottom Right	0.00%	0.00%

Training progression

Label quality has a dramatic impact. More data with noisy labels performs worse than less data with clean human-verified labels.

Round	Training photos	Labels	Mean deviation	Improvement
Round 1	2,814 (Gemini)	Automatic	5.07%	Baseline
Round 2	226	Human-verified	2.87%	-43%
Round 3	401	Human-verified	1.65%	-68%
Round 4 (eval)	496	Human-verified	1.42%	-72%
Round 5	596 + endcaps	Human-verified	2.12%	New category
Round 6	780 + endcaps	Human-verified	1.60%	-68%
Round 7 (8-out)	880 + checkout	Human-verified	1.33%	-74%

Examples

Green lines = human annotation (ground truth). Red lines = CNN prediction. Darkened areas are masked out.

Perfect prediction (0% error)

31992991_FabricEnhancers_FullBay_2

Mean deviation: 0.0%

32000540_FabricCleaning_FullBay_2

Mean deviation: 0.0%

Good prediction (< 1% error)

30522176_Q56810027_3

Mean deviation: 0.2%

Human CNN prediction

30531317_Q56810027_3

Mean deviation: 0.2%

Human CNN prediction

Moderate prediction (2-5% error)

31069640_Q56810027_2

Mean deviation: 2.1%

Human CNN prediction

kop_174851_10105636_foto9_kopstelling9

Mean deviation: 2.1%

Human CNN prediction

Worst predictions

30385719_Q56810027_2

Mean deviation: 6.7%

Human CNN prediction

30387876_Q56810027

Mean deviation: 7.6%

Human CNN prediction

Conclusions & Next Steps

The model works well: 1.33% mean deviation across 8 boundary points (x: ~2%, y: ~0.65%).
Multi-format: Supports regular shelves, endcap displays, and checkout displays in a single model.
Scalable: ~10ms per photo on GPU, ~50ms on CPU — thousands of photos per minute.
More data helps: Going from 226 to 880 verified photos cut the error from 2.87% to 1.33%.
Room to grow: 2,905 photos still available for further training.