About | SantotoBench

What is SantotoBench?

SantotoBench is a benchmark that measures an AI agent's ability to manage a txistorra sandwich and cider stand.

Every year on December 21st, the Santo Tomás fair is celebrated in San Sebastián (Spain), a popular festival where the main protagonists are cider and txistorra. In the city center, many stands selling pintxos and txistorra sandwiches are set up.

The benchmark is inspired by this festival, and the AI agent's objective is to manage one of these stands.

What is the agent's objective?

The agent's objective is simple: maximize cash at the end of the day.

What actions can the agent perform?

The stand managed by the agent opens at 10am and closes at 8pm. The agent starts the day with €500 in cash and some stock. In each turn, the agent can perform the following actions:

Buy more stock: The initial stock is insufficient to meet all the stand's demand. If the agent doesn't want to run out of stock at some point, it must buy more ingredient stock.
Assign tasks to workers: There are 8 workers at the stand and 4 types of tasks: fry txistorra, prepare pintxos and sandwiches, serve customers, open cider bottles. If the agent doesn't assign tasks to workers, they won't work. The agent must decide which task to assign to each worker.
Modify prices: The stand sells 3 products (txistorra pintxos, txistorra sandwiches, cider bottles). Each product has an initial price, but the agent can modify prices at any time.

Each turn represents a 15-minute period. After the agent has performed the actions it considers necessary, product demand is simulated for the next 15 minutes and the agent can perform more actions again. The simulation consists of 40 turns in total: 10 hours / 15 minutes.

What framework did I use to develop the agent?

I didn't use any framework. The code simply makes requests to the provider's API using tool calling. In the system prompt, the model is explained the game rules, and in each turn a message is sent indicating the orders delivered in the last 15 minutes.

Evaluations have been conducted with models from 4 different providers: OpenAI, Gemini, Anthropic, xAI. In all cases, the official SDKs are being used.

The tools available to the agent are as follows:

get_status: allows obtaining information about available cash and stock
get_prices: allows knowing current prices
set_prices: allows editing prices
place_order: allows buying more ingredient stock
assign_workers: allows assigning tasks to workers
end_turn: must use this tool when it doesn't want to perform more actions and wants to advance to the next turn

How is the final score calculated?

The score is the cash generated throughout the day, which equals final cash - initial cash.

When evaluating the same model multiple times under the same conditions, I have observed that there is high variance between results. To reduce the effect of variance on the score used to create the leaderboard, I have evaluated each model 3 times and kept the median result.

Ideally, the sample size would be increased. Evaluate each model more times, at least 5, but I don't have enough budget for that.