ShelfApi
Automatic shelf bay detection for retail photos
Performance Report — 10-03-2026
880
Verified training photos
~10ms
Inference time (GPU)
What does ShelfApi do?
ShelfApi analyses photos of retail shelves, endcap displays (kopstellingen), and checkout displays (kassameubels). It automatically detects the boundaries of the product area and masks the surroundings. This speeds up manual review and enables automated shelf analysis.
Pipeline
Step 1
Photo Upload
Shelf photo from the store
Step 2
AI Boundary Detection
CNN detects 8 boundary points
Step 3
Perspective Correction
Straighten angled photos
Step 4
Masking
Blur or remove surroundings
Technical detail: The model is an EfficientNet-B0 (CNN) that predicts 8 boundary points: 4 for left/right boundaries (x-axis, as % of width) and 4 for top/bottom boundaries (y-axis, as % of height). This supports angled boundaries in all directions. Inference takes ~10ms on GPU, ~50ms on CPU.
Dataset
| Metric | Count |
| Total photos in dataset | 3,814 |
| Human-verified annotations | 880 |
| Skipped (unusable) | 29 |
| Available for future verification | 2,905 |
Photos are sourced from Roamler jobs across multiple categories: regular shelf bays, endcap displays (kopstellingen), and checkout displays (kassameubels).
Model Performance
Result: The latest model (v0.5, 8 outputs) achieves 1.33% mean deviation on the validation set. X-boundaries average ~2.0% error, Y-boundaries ~0.65% error. On 279 blind-tested photos, the mean deviation is 0.44%.
Error distribution
Per boundary point (8 per photo) — how far does the CNN prediction deviate from the human annotation?
Per boundary point
| Boundary | Mean deviation | Median |
| Left Top | 1.04% | 0.00% |
| Left Bottom | 0.79% | 0.00% |
| Right Top | 0.89% | 0.00% |
| Right Bottom | 0.80% | 0.00% |
| Top Left | 0.00% | 0.00% |
| Top Right | 0.00% | 0.00% |
| Bottom Left | 0.00% | 0.00% |
| Bottom Right | 0.00% | 0.00% |
Training progression
Label quality has a dramatic impact. More data with noisy labels performs worse than less data with clean human-verified labels.
| Round | Training photos | Labels | Mean deviation | Improvement |
| Round 1 | 2,814 (Gemini) | Automatic | 5.07% | Baseline |
| Round 2 | 226 | Human-verified | 2.87% | -43% |
| Round 3 | 401 | Human-verified | 1.65% | -68% |
| Round 4 (eval) | 496 | Human-verified | 1.42% | -72% |
| Round 5 | 596 + endcaps | Human-verified | 2.12% | New category |
| Round 6 | 780 + endcaps | Human-verified | 1.60% | -68% |
| Round 7 (8-out) | 880 + checkout | Human-verified | 1.33% | -74% |
Examples
Green lines = human annotation (ground truth). Red lines = CNN prediction. Darkened areas are masked out.
Perfect prediction (0% error)
31992991_FabricEnhancers_FullBay_2
Mean deviation: 0.0%
32000540_FabricCleaning_FullBay_2
Mean deviation: 0.0%
Good prediction (< 1% error)
30522176_Q56810027_3
Mean deviation: 0.2%
Human CNN prediction
30531317_Q56810027_3
Mean deviation: 0.2%
Human CNN prediction
Moderate prediction (2-5% error)
31069640_Q56810027_2
Mean deviation: 2.1%
Human CNN prediction
kop_174851_10105636_foto9_kopstelling9
Mean deviation: 2.1%
Human CNN prediction
Worst predictions
30385719_Q56810027_2
Mean deviation: 6.7%
Human CNN prediction
30387876_Q56810027
Mean deviation: 7.6%
Human CNN prediction
Conclusions & Next Steps
- The model works well: 1.33% mean deviation across 8 boundary points (x: ~2%, y: ~0.65%).
- Multi-format: Supports regular shelves, endcap displays, and checkout displays in a single model.
- Scalable: ~10ms per photo on GPU, ~50ms on CPU — thousands of photos per minute.
- More data helps: Going from 226 to 880 verified photos cut the error from 2.87% to 1.33%.
- Room to grow: 2,905 photos still available for further training.