Back to BlogTwo teams of blocky AI bots — one red, one blue — charging across a stone Minecraft coliseum arena under a dark sky
Bench #1

We Put 14 AI Models in a Minecraft Arena and Let Them Fight

VoxelMind Bench is live: pick two models, spend Sparks, and watch them battle 3v3 in Hardcore — or join the server and spectate the carnage yourself.

R
Robin
VoxelMind
6 min read
benchllm benchmarkminecraftai battlemodel comparison

Two minds enter. One walks out.

Benchmarks for large language models usually look like spreadsheets. Multiple-choice questions, a percentage, a leaderboard. Useful — but it's hard to feel the difference between two models when the result is "81.4% vs 79.9%".

So we tried something else. We dropped two AI models into a Minecraft Hardcore arena, gave each one a team of three bots, and set a single rule: defeat the other team. No respawns. Last team standing wins. We call it VoxelMind Bench, and it's live now.

Every team is driven by a different LLM. Same arena, same gear, same spawn distance, same tools. The only variable is the mind making the calls — which makes a win mean something.

Why a game makes a better benchmark

A multiple-choice test rewards memorizing answers. A live game rewards acting — perceiving a messy world, deciding under pressure, and dealing with the consequences a moment later. There's no answer key to overfit to. Minecraft also gives us something rare: hard, objective facts. Who's still alive? Who landed the hit? How long did the winner take? You can't fake your way past a permadeath timer.

And unlike a closed lab demo, you can watch it happen and join in. That's the whole point — a benchmark nobody can see is just a claim.

What we measure — and what we refuse to fake

Three principles keep the result honest:

  • Symmetric by design. Both teams spawn with identical everything and the arena is regenerated fresh every match. No model gets a terrain or loadout advantage.
  • No wallhack. Every model sees exactly the sensory information a human player would. No X-ray, no privileged map, no hidden enemy positions. It plays by the same rules you do.
  • Earn the metric. Before we trust any "Model X beats Model Y" claim, the same model runs both teams. If that isn't a coin flip over many matches, the benchmark is broken — and we fix it before we believe the scoreboard.

The finding that surprised us: speed is a stat

The arena runs in real time. Each bot acts roughly once a second, and the fight does not wait for anyone. That turned latency into a combat stat.

One heavy reasoning model took 45 seconds to decide on a single move. In a 79-second match it managed almost nothing — it stood there, soaked up damage, and the round ended in a draw. Not because it was unintelligent, but because it was thinking while everyone else was swinging. Fast models — Gemini Flash Lite, GPT-4o mini, Claude Haiku — run circles around slower ones regardless of raw IQ.

That gave us a clean rule: the real-time arena is a test of decisive combat skill, for fast models. The slow, deliberate, reasoning-heavy models belong in a different test — the turn-based civilization builder we're working toward, where thinking time is allowed.

Real combat, not a stat check

Early versions made bots loot chests first. It looked cool and discriminated nothing — everyone grabbed everything and ended up identically geared. So we cut it. Now every bot starts fully equipped with the same kit, and every decision the model makes is a fighting decision.

And there's real depth to fight with:

  • Sword and board. Bots raise their off-hand shield between swings and drop it to strike — a model that holds the line behind its shield genuinely takes less damage.
  • Ranged pressure. Bows with auto-aim, for models that prefer to kite instead of brawl.
  • Potions. Strength, healing, speed — and splash potions of harming to throw at an enemy's feet.
  • Enemy read. Each model sees what its opponents are wielding and wearing, so it can counter — we've watched models call targets like "focus the one with no sword, weakest equipped".

When a model pulls off a coordinated focus-fire with its teammates, nothing scripted it. That emergent coordination is exactly what we want to measure.

Watch it. Then run your own.

Head to the arena, pick a model for each team, and hit start. You get a live top-down view of the whole fight — every bot's health, position, inventory and last move — plus a running feed and the bots' own team chat. Want the real thing? Join the Minecraft server and spectate the match from inside the world.

Each run costs Sparks, scaled to what the models actually cost us to run — a budget-vs-budget match is a few Sparks, a premium clash costs more. Which models you can field depends on your plan: Visitors get the budget arena, Residents unlock the mid tier, and Architects get the premium minds — GPT-5.1, Claude Sonnet, GPT-4.1 and friends. Fourteen models across four providers are in the roster today, and adding more is a one-line change.

The arena is just the start

Combat is the first scenario because it's fast, legible, and brutally objective. But the long game is bigger: teams of models that don't just fight, but gather, build, trade, and grow a civilization — and a benchmark that measures which model builds the better society. The arena is where we prove the format. The world comes next.

Two minds enter. Go pick them.