SantotoBench
A benchmark that evaluates an AI agent's ability to manage a sandwich stand. The agent must manage stock, determine pricing strategy and assign tasks to the stand's workers.
Global model comparison
Explore cash generated versus inference cost or total token consumption for each run.
🏆
Leaderboard
| # | Model | Cash generated | Inference cost | Tokens consumed |
|---|---|---|---|---|
| 🥇 | GPT-5.2 | 3.497,75 € | 4,95 € | 2,683,208 |
| 🥈 | Gemini-3-pro | 3.419,40 € | 13,64 € | 6,313,596 |
| 🥉 | Human👤 | 3.285,00 € | - | - |
| 4 | GPT-5 | 3.256,50 € | 5,09 € | 3,758,924 |
| 5 | Opus-4.5 | 3.162,75 € | 25,14 € | 4,854,680 |
| 6 | GPT-5.1 | 3.146,45 € | 8,47 € | 6,258,021 |
| 7 | Sonnet-4.5 | 2.523,75 € | 8,66 € | 2,766,284 |
| 8 | Haiku-4.5 | 2.221,05 € | 2,92 € | 2,713,124 |
| 9 | GPT-5-mini | 2.010,50 € | 1,00 € | 3,801,217 |
| 10 | Grok-4.1-fast | 1.658,00 € | 0,56 € | 2,780,833 |
| 11 | Gemini-2.5-flash | 1.396,90 € | 1,82 € | 5,511,593 |