Exploring the Orchestrate

I spend a lot of time working inside watsonx Orchestrate. Agents, tools, knowledge bases, the connections between them. After a while you build a mental model of how everything fits together, but it stays in your head. I wanted to see it.

So I built wxo-explorer – a 3D network graph that connects to a watsonx Orchestrate instance via its REST API and renders the whole environment as something you can fly around and interact with.

Agents show up as blue spheres. Tools are green cubes. Knowledge bases are orange cylinders. Edges show the relationships between them. Click on anything and you get its details, what it connects to, what uses it. You can also chat with agents directly from inside the app, each one maintaining its own conversation session so you can jump between them without losing context.

Here’s a short video of it in action.

Why Godot

The obvious choice for something like this would have been Python. It’s what I use for most things. But Python struggles with real-time 3D rendering and parallel processing, and this needed both. The graph uses a force-directed layout algorithm that has to run continuously while you’re navigating around it.

Godot had what I needed out of the box. Good 3D, built-in parallel processing, a permissive MIT licence, and GDScript is straightforward enough that you can read the code and understand what it’s doing without fighting the engine. I’ve used it before, so I knew I could move quickly.

It also meant the whole thing runs as a standalone app. No browser, no server, just open it, point it at your Orchestrate instance, and go.

What You Can Do With It

The camera has two modes – orbit and free-fly. Orbit is good for looking at the overall structure. Free-fly is better for getting in close and following the connections between nodes. Keyboard, mouse, and gamepad all work.

The chat panel lets you talk to any agent directly. Select the agent node, open the chat, and you’re in a conversation. The responses render with full markdown support – headings, code blocks, tables, the lot. It’s useful for testing agent behaviour without switching back to the Orchestrate UI.

There’s also dual authentication so it works with both the Developer Edition running locally and SaaS instances.

The Bob Surprise

After building this with Claude, I gave the same instruction file to IBM Bob. I honestly didn’t expect much. Godot has been a problem for most LLMs because there are multiple versions floating around and they tend to mix up the APIs. GDScript isn’t exactly mainstream training data.

Bob built it. Not a rough approximation, an actual working wxo-explorer with the same core functionality. 3D graph, node interaction, API connectivity.

For comparison, I’d also tried ChatGPT and it was so bad I gave up on it entirely. It kept mixing up Godot versions and producing code that didn’t run.

Bob didn’t have that problem. Whatever it’s doing with its context and tooling, it handled a niche framework better than I’d have predicted. Good enough that I ended up adding features to the Bob version for work.

It’s another example of something I keep noticing. These tools are moving faster than the assumptions we have about them. The gaps I expected to find aren’t always where I expect them to be.


The source is on GitHub if you want to try it yourself. You’ll need Godot 4.6 and a watsonx Orchestrate instance to connect to.

Same Instructions, Different Game

I wanted to see how different agentic coding tools handle the same problem. Not a todo app. Not a REST API. Something that would force them to make decisions.

A while back I built a text adventure in watsonx Assistant for Halloween. A haunted mansion where you had to escape while a ghost wandered the rooms. It was a good test of what the platform could do and I enjoyed building it.

So when I wanted to compare agentic coding tools, a text adventure felt like the right kind of problem. Familiar enough that I’d know what good looked like, complex enough to be a real test.

I wrote one instruction file and gave it to four tools: Claude, Ollama OpenClaw, Codex, and IBM Bob.

The task was to build a Cluedo-style game using multi-agent architecture in watsonx Orchestrate.

Each suspect had to be its own agent with its own behaviours. You, as the player, could walk around a mansion, find murder weapons, interrogate suspects, and piece together who did it.

There was a text map, a notebook for tracking clues, richly described rooms you could interact with even if it had nothing to do with solving the case. It needed proper multi-agent orchestration and tool use.

All four produced a working game. But how they got there, and what they built, was very different.

Codex

This one made me laugh. Because I was using the same instruction file across all four, Codex noticed that Claude had already created a solution in a different directory. Its first move was to copy it into its own folder and call it done.

After I stopped that, it did build the project. But it worked with almost no visibility into what it was doing and needed more corrections afterwards. It got there. It just wasn’t interested in having me involved.

OpenClaw

Painfully slow. Local models running in a restrictive VM, so no surprise there. It wasn’t particularly visible about what it was up to either. But it put together a workable solution without any fuss. No drama, no shortcuts, just got on with it.

Claude

Claude started with the MCP documentation server, which is what I expected. Then after a while it started reading the Python ADK module source code directly. Token-wasting, but you could see why. It wanted to understand the framework rather than trust the docs. I’ve done the same thing.

Where it differed from the others was how it worked with me. It walked through every step, let me review, question, tweak, or ask it to explain its approach before it moved on. I felt like I was part of the build rather than waiting for a delivery.

IBM Bob

Bob confirmed everything it needed was up and running before writing a single line of code. Methodical. It stopped occasionally to let me review, but the breaks felt more like checkpoints than conversations. The volume of code at each pause was too much to easily digest. I got the sense I was slowing it down rather than being consulted.

It also created detailed architecture documents off its own back, which none of the others did. Bob was treating this like a project, not a task.

The Games Themselves

This is where it got interesting. Same instructions. Four different games.

OpenClaw and Codex built a solid Cluedo text adventure. You could explore, talk to suspects, ask about your surroundings. Faithful to the brief and it worked.

Claude did the same but added something I hadn’t asked for. The suspect agents would talk to each other in the background, and you could overhear their conversations. The suspects also didn’t just give up information when you asked – you had to work at it, press them, catch them out. It made the whole thing feel more alive.

Bob went a different direction entirely. Instead of suspects holding cards, it built all the clues into the mansion itself. You had to read letters, notes, memos scattered through the rooms. Some of those could be used to pressure suspects into talking. Others were direct hints – blood on a lamp, a torn envelope in a drawer. It felt less like a card game and more like an actual crime scene investigation.

All from the same instruction file.

So What?

This isn’t a benchmark. I’m not crowning a winner. But it showed me something I think matters. Give four systems the same brief and they don’t just write different code – they make different creative decisions. What to emphasise. What to add. How to interpret the problem.

They can all write code. That bit’s settled. The more useful question is whether you want to be part of the process or just see what comes out the other end.


If you want to test these tools yourself, give them something that isn’t a standard coding exercise. That’s where the differences show up.