SantotoBench

A benchmark that evaluates an AI agent's ability to manage a sandwich stand. The agent must manage stock, determine pricing strategy and assign tasks to the stand's workers.

Global model comparison

Explore cash generated versus inference cost or total token consumption for each run.

🏆

Leaderboard

#	Model	Cash generated	Inference cost	Tokens consumed
🥇	GPT-5.2	3.497,75 €	4,95 €	2,683,208
🥈	Gemini-3-pro	3.419,40 €	13,64 €	6,313,596
🥉	Human👤	3.285,00 €	-	-
4	GPT-5	3.256,50 €	5,09 €	3,758,924
5	Opus-4.5	3.162,75 €	25,14 €	4,854,680
6	GPT-5.1	3.146,45 €	8,47 €	6,258,021
7	Sonnet-4.5	2.523,75 €	8,66 €	2,766,284
8	Haiku-4.5	2.221,05 €	2,92 €	2,713,124
9	GPT-5-mini	2.010,50 €	1,00 €	3,801,217
10	Grok-4.1-fast	1.658,00 €	0,56 €	2,780,833
11	Gemini-2.5-flash	1.396,90 €	1,82 €	5,511,593