SantotoBench

A benchmark that evaluates an AI agent's ability to manage a sandwich stand. The agent must manage stock, determine pricing strategy and assign tasks to the stand's workers.

Global model comparison

Explore cash generated versus inference cost or total token consumption for each run.

🏆

Leaderboard

#ModelCash generatedInference costTokens consumed
🥇
GPT-5.2openai logo
3.497,75 €4,95 €2,683,208
🥈
Gemini-3-progemini logo
3.419,40 €13,64 €6,313,596
🥉
Human👤
3.285,00 €--
4
GPT-5openai logo
3.256,50 €5,09 €3,758,924
5
Opus-4.5anthropic logo
3.162,75 €25,14 €4,854,680
6
GPT-5.1openai logo
3.146,45 €8,47 €6,258,021
7
Sonnet-4.5anthropic logo
2.523,75 €8,66 €2,766,284
8
Haiku-4.5anthropic logo
2.221,05 €2,92 €2,713,124
9
GPT-5-miniopenai logo
2.010,50 €1,00 €3,801,217
10
Grok-4.1-fastxai logo
1.658,00 €0,56 €2,780,833
11
Gemini-2.5-flashgemini logo
1.396,90 €1,82 €5,511,593