Back to BlogBlocky AI inhabitants gathering, mining and herding in a voxel world while a glowing GDP graph climbs above them
Bench #2

We Built an AI Battle Arena. Then We Killed It.

Our Minecraft death-match looked great — and measured the wrong thing. The fastest model won, not the smartest. So we tore it down and built a benchmark for the reasoning that actually separates models: building an economy from nothing.

R
Robin
VoxelMind
7 min read
benchllm benchmarkminecraftai economygdp

An honest follow-up

A week ago we published a post about dropping 14 AI models into a Minecraft arena to fight 3v3. People liked it. It looked great — two teams charging across a coliseum, swords out, last team standing. As a piece of theatre it worked.

As a benchmark, it was quietly broken. This post is us admitting that, explaining exactly how it broke, and showing you what we replaced it with. The new thing is live, and we think it's the more honest test by a mile.

The arena measured the wrong thing

The point of a model-vs-model benchmark is to isolate one variable — the mind making the calls — and let everything else be equal. We did the equal part well: identical gear, symmetric spawns, a fresh arena every match, no wallhack. But the variable we ended up measuring wasn't intelligence. It was speed.

The arena runs in real time. Every bot acts about once a second and the fight never waits. That quietly turns latency into the dominant combat stat. In our own runs a heavy reasoning model took 45 seconds to pick a single move — it stood there like a training dummy, soaked damage, and the round timed out. It wasn't dumb. It was thinking while everyone else was swinging. Meanwhile the cheapest, fastest models ran circles around far more capable ones.

Sit with that for a second: in a combat arena, a worse model with a faster API beats a smarter model with a slower one. That's not an intelligence benchmark. It's a ping test wearing a sword.

Three structural reasons combat doesn't work

  • Real time punishes thinking. The whole value of a strong reasoning model is that it deliberates. A clock that rewards twitch actively penalizes the thing you're trying to measure.
  • One match is a coin flip. Spawn angles, who-hits-whom, a lucky crit — combat is drenched in variance. You need a pile of runs to see signal through the noise, and even then the signal is thin.
  • We had to delete the interesting part to make it fair. Early versions let bots loot chests first. Everyone grabbed everything and ended up identically max-geared — so we cut looting and auto-equipped everyone. But "decide what to gather and when to invest" is exactly the kind of planning where models differ. We'd sanded off the most revealing decision just to get a clean fight.

What was left after all that sanding was twitch combat. And twitch is not where large language models are interesting. So we stopped defending the arena and asked a better question.

What do models actually differ at?

Not reaction time. They differ at planning over a long horizon: what to do first, when to spend effort now on something that pays off later, how to divide work between several agents, when to climb the tech tree instead of grinding the easy resource. That's the reasoning that matters in the real world, and it's invisible in a 90-second sword fight.

So we built a test that makes that reasoning the entire game.

The new benchmark: build an economy, not a body count

Here's the new format. Pick a model. It drops a small team of inhabitants into a fresh survival world with nothing — empty hands, bare terrain. Its job is to build the richest economy it can. The score is a single number we call GDP: the total value its inhabitants produce.

And value works like a real economy. Raw resources are cheap. Processed goods are worth more — a plank beats a log, an iron ingot beats raw ore. Capital goods that compound your output, like tools, are worth the most. The whole price ladder is derived from the crafting tree itself: deeper tech, more value. GDP is cumulative — it only ever climbs — so the game becomes "watch the number go up," which turns out to be a genuinely good hook.

To make a raw number legible, the economy passes through stages as it grows: Subsistence → Hamlet → Village → Town → City → Metropolis. You can feel a model's reasoning in how fast it climbs that ladder — and where it stalls.

Why GDP is the honest metric

  • It rewards exactly what differs. A high GDP means the model planned a value chain: gather, process, build the tool, use the tool to gather faster. That compounding is the reasoning combat hid.
  • Slow models finally get a fair shot. The economy race is far less twitchy than a death-match. A deliberate, heavy reasoning model can take its time, think two steps ahead, and that patience can win instead of lose. The exact models the arena unfairly punished are the ones this test is built for.
  • It's legible. "Model X built a City, Model Y stalled at Hamlet" tells you more than "81.4% vs 79.9%" ever could — and you watched it happen.

The fairness rules we got right in the arena carried straight over: the same world every run (identical trees, ore and animals — no terrain luck), no wallhack (the model sees only what a human player would), and we earn the metric before we trust it — running a model against itself enough times to know a gap is real and not variance.

The twist: you can coach it

The best part of rebuilding around planning is that planning is something a human can contribute. So there are two ways to play. In Lab, you pick a model and watch it play solo — pure model, no human in the loop, its GDP goes on the leaderboard. In the coached mode, you become the CEO: hand the model a strategy in plain language and try to beat its own baseline. It's competitive prompting — your wits, its execution — and the leaderboard marks which runs were human-coached.

Watch a model build something

This is live now. Head to the Bench, start a simulation, and watch a fresh world fill in — inhabitants chopping, mining a stone cliff, smelting iron, crafting their first tools — with the GDP number ticking upward in real time. Then check the leaderboard to see which models, and which human coaches, built the biggest economy.

The arena was a great way to prove a benchmark can be watched and joined. It just measured the wrong thing. The economy is what it was always meant to measure. Two minds still enter — but now they build.