ShelfApi
Automatic shelf bay detection for retail photos
Performance Report — 08-03-2026
~10ms
Inference time (GPU)
What does ShelfApi do?
ShelfApi analyses photos of retail shelves and automatically detects the boundaries of the central shelf bay. It then masks the surrounding products so that only the relevant bay is visible. This speeds up manual review and enables automated shelf analysis.
Pipeline
Step 1
Photo Upload
Shelf photo from the store
Step 2
AI Boundary Detection
CNN detects left & right boundaries
Step 3
Perspective Correction
Straighten angled photos
Step 4
Masking
Blur or remove surroundings
Technical detail: The model is an EfficientNet-B0 (CNN) that predicts 4 boundary points as percentages of the image width: top-left, bottom-left, top-right, bottom-right. This supports angled shelf boundaries. Inference takes ~10ms on GPU.
Dataset
| Metric | Count |
| Total photos in dataset | 2,814 |
| Human-verified annotations | 496 |
| Skipped (unusable) | 26 |
| Available for future verification | 2,292 |
Photos are sourced from 6 different Roamler jobs spanning hair care, household products, and cleaning product shelves.
Model Performance
Result: On 95 unseen photos (never used during training), the mean boundary deviation is 1.42% of the image width. 72% of all boundary points required no correction at all.
Error distribution
Per boundary point (4 per photo) — how far does the CNN prediction deviate from the human annotation?
Per boundary point
| Boundary | Mean deviation | Median |
| Left Top | 1.91% | 0.00% |
| Left Bottom | 1.31% | 0.00% |
| Right Top | 1.37% | 0.00% |
| Right Bottom | 1.07% | 0.00% |
Training progression
Label quality has a dramatic impact. More data with noisy labels performs worse than less data with clean human-verified labels.
| Round | Training photos | Labels | Mean deviation | Improvement |
| Round 1 | 2,814 (Gemini) | Automatic | 5.07% | Baseline |
| Round 2 | 226 | Human-verified | 2.87% | -43% |
| Round 3 | 401 | Human-verified | 1.65% | -68% |
| Round 4 (eval) | 496 | Human-verified | 1.42% | -72% |
Examples
Green lines = human annotation (ground truth). Red lines = CNN prediction. Darkened areas are masked out.
Perfect prediction (0% error)
32001446_SurfaceCleaning_FullBay_6
Mean deviation: 0.0%
32003106_Dishwasher_FullBay_5
Mean deviation: 0.0%
Good prediction (< 1% error)
30522176_Q56810027_3
Mean deviation: 0.3%
Human CNN prediction
30531317_Q56810027_3
Mean deviation: 0.3%
Human CNN prediction
Moderate prediction (2-5% error)
27725663_HAIR_BAY_3_HAIUK
Mean deviation: 2.1%
Human CNN prediction
32001959_FabricCleaning_FullBay_2
Mean deviation: 2.2%
Human CNN prediction
Worst predictions
31082224_Q56810027
Mean deviation: 9.3%
Human CNN prediction
30387876_Q56810027
Mean deviation: 15.1%
Human CNN prediction
Conclusions & Next Steps
- The model works well: 91% of all boundary points are within 5% deviation, 72% are spot-on.
- Scalable: ~10ms per photo on GPU — thousands of photos per minute.
- More data helps: Going from 226 to 496 verified photos cut the error in half (2.87% → 1.42%).
- Room to grow: 2,292 photos still available for further training.
- Edge cases: A small percentage of photos (~3%) show larger deviations, typically due to unusual camera angles or ambiguous shelf boundaries.