Same Instructions, Different Game

I wanted to see how different agentic coding tools handle the same problem. Not a todo app. Not a REST API. Something that would force them to make decisions.

A while back I built a text adventure in watsonx Assistant for Halloween. A haunted mansion where you had to escape while a ghost wandered the rooms. It was a good test of what the platform could do and I enjoyed building it.

So when I wanted to compare agentic coding tools, a text adventure felt like the right kind of problem. Familiar enough that I’d know what good looked like, complex enough to be a real test.

I wrote one instruction file and gave it to four tools: Claude, Ollama OpenClaw, Codex, and IBM Bob.

The task was to build a Cluedo-style game using multi-agent architecture in watsonx Orchestrate.

Each suspect had to be its own agent with its own behaviours. You, as the player, could walk around a mansion, find murder weapons, interrogate suspects, and piece together who did it.

There was a text map, a notebook for tracking clues, richly described rooms you could interact with even if it had nothing to do with solving the case. It needed proper multi-agent orchestration and tool use.

All four produced a working game. But how they got there, and what they built, was very different.

Codex

This one made me laugh. Because I was using the same instruction file across all four, Codex noticed that Claude had already created a solution in a different directory. Its first move was to copy it into its own folder and call it done.

After I stopped that, it did build the project. But it worked with almost no visibility into what it was doing and needed more corrections afterwards. It got there. It just wasn’t interested in having me involved.

OpenClaw

Painfully slow. Local models running in a restrictive VM, so no surprise there. It wasn’t particularly visible about what it was up to either. But it put together a workable solution without any fuss. No drama, no shortcuts, just got on with it.

Claude

Claude started with the MCP documentation server, which is what I expected. Then after a while it started reading the Python ADK module source code directly. Token-wasting, but you could see why. It wanted to understand the framework rather than trust the docs. I’ve done the same thing.

Where it differed from the others was how it worked with me. It walked through every step, let me review, question, tweak, or ask it to explain its approach before it moved on. I felt like I was part of the build rather than waiting for a delivery.

IBM Bob

Bob confirmed everything it needed was up and running before writing a single line of code. Methodical. It stopped occasionally to let me review, but the breaks felt more like checkpoints than conversations. The volume of code at each pause was too much to easily digest. I got the sense I was slowing it down rather than being consulted.

It also created detailed architecture documents off its own back, which none of the others did. Bob was treating this like a project, not a task.

The Games Themselves

This is where it got interesting. Same instructions. Four different games.

OpenClaw and Codex built a solid Cluedo text adventure. You could explore, talk to suspects, ask about your surroundings. Faithful to the brief and it worked.

Claude did the same but added something I hadn’t asked for. The suspect agents would talk to each other in the background, and you could overhear their conversations. The suspects also didn’t just give up information when you asked – you had to work at it, press them, catch them out. It made the whole thing feel more alive.

Bob went a different direction entirely. Instead of suspects holding cards, it built all the clues into the mansion itself. You had to read letters, notes, memos scattered through the rooms. Some of those could be used to pressure suspects into talking. Others were direct hints – blood on a lamp, a torn envelope in a drawer. It felt less like a card game and more like an actual crime scene investigation.

All from the same instruction file.

So What?

This isn’t a benchmark. I’m not crowning a winner. But it showed me something I think matters. Give four systems the same brief and they don’t just write different code – they make different creative decisions. What to emphasise. What to add. How to interpret the problem.

They can all write code. That bit’s settled. The more useful question is whether you want to be part of the process or just see what comes out the other end.


If you want to test these tools yourself, give them something that isn’t a standard coding exercise. That’s where the differences show up.

Leave a Reply