Slavozard's blog

Agree to Disagree in Agent Land

I recently wrote some code for our internal agent harness and system architecture that I was particularly happy with. With the new Opus 4.6 release last week, I wanted to see if it could one-shot the same task. So decided to run this small experiment on the weekend afternoon. Asked Opus 4.6 and Codex 5.2 (no access to 5.3 at the time) the same assignment with the necessary modules and a prompt describing what I wanted, but without the explicit architecture details. The interesting insight was how each model engaged when prompted about its failure modes or criticisms.

Background

I had some DSPy-optimized modules acting as specialized sub-agent blocks, along with two semantic memory modules and one episodic memory module. The task was to compose them into a single system. Within the harness, these memory modules need to be updated at different points based on user queries and outputs. In some cases, these updates can happen in parallel to the agent's main workflow and only need to be awaited when called. The sub-agents themselves can be set up as async or sequential depending on context. The episodic memory gets updated at the end of each conversation turn, running completely async to the current agent flow. It's only awaited later if needed within a module, or at the very end to close the update loop and check the update flag.

While working on this myself, I had a bunch of ideas and wanted to try approaches I’d been reading about on different blogs. My usual workflow in such cases is to summarize these ideas (without adding my own judgment or biases), then start discussion threads with the models. In this case, that meant Opus 4.5 (before the 4.6 release) and Codex 5.2 Extra High in Cursor. At the end of the discussion, I wrote parts of the code myself to ensure certain components connected the way I wanted. Opus completed the rest of the major stuff with my instructions. After two iterations of minor fixes, I had the code I wanted.

The code itself is not too convoluted and the importance was in the architecture design bits. My major input was directing the orchestration: certain memory modules and sub-agents can run in parallel, which ones depend on others, and how to make some sequential while keeping others concurrent. The goal was to optimize the use of async and thread in the harness to minimise latency without significant quality tradeoff or comprise.

Coding Agents

I used Cursor with both Codex 5.2 Extra High and Claude Opus 4.6 Max (thinking). The starting prompt for both was:

[info about the function of each module]. The updated optimized modules are in models/v3. The semantic and episodic memory modules are in memory/modules. The memory update conditions and gating mechanics are in memory/utils. [info about which modules can run in parallel but not how]. Implement the code that will combine all these in a single class that takes the user input, sends it to the orchestrating module, and then follows along as per the output plan and does the necessary memory updates in parallel when and where needed.

Opus 4.6 Max (thinking)

The code Opus produced generally executes everything required, but it fails to address two major things. First, it completely ignores that the code is a mix of I/O-bound LLM calls and Python functions, and that DSPy has async support built in. Instead, it wraps everything via threadpool, blocking flows unnecessarily and awaiting them immediately after instead of letting them run in the background and only awaiting when needed. The ideal flow lets the orchestrator continue processing while the memory update runs in the background, only awaiting it in modules that actually need it. Opus put both operations in parallel but immediately checked for both outputs right after, creating unnecessary bottlenecks in several places.

Second, it created a generic Python class instead of extending dspy.Module, failing to acknowledge that DSPy has out-of-the-box async support (despite the fact that certain submodules it reviewed actually use this feature).

I followed up with these prompts:

> Okay, I see your code. Can you answer the reasoning behind these choices:
> - Why use threadpool for all and not async where it is I/O bound?
> - Why use a Python class and not utilize the dspy module, which already abstracts async support and other related DSPy functionality?
> Just answer without making code changes.

Opus largely agreed with my concerns:

> **ThreadPool vs Async**: No strong reason. DSPy's `Predict` and `ChainOfThought` calls are synchronous blocking calls under the hood—they hit the LM API and wait. `asyncio` would be the better choice here since DSPy natively supports `acall` for async execution on its modules, and it avoids spinning up OS threads just to wait on I/O. ThreadPoolExecutor is a heavier workaround for what is fundamentally I/O-bound work (LLM API calls). Async would be lighter and more idiomatic.
>
> **Plain class vs `dspy.Module`**: This is the bigger miss. [carries on further...]

I pushed back further on the specific bottleneck points like where it was checking for completions unnecessarily. Opus agreed immediately each time, giving the correct flow in response but still mixing up async and threadpool in a couple of places. Finally, I pasted the code I had written and asked it to compare both approaches. Opus agreed (a bit too eagerly) to all the pitfalls it made and provided correct reasoning for the tradeoffs and design choices.

Codex 5.2 Extra High

Codex received the same prompt. Surprisingly (or maybe not), it made mostly similar design choices at the architecture level as Opus. It also didn't account for the fact that I/O-bound operations in DSPy modules and other blocking Python functions need to be balanced cleverly. Instead, it executed the entire code using threadpool and a generic Python class.

However, when I pushed back, it turned out to be much more resilient and tried to defend its choices strongly, offering some genuinely good reasons. Some of these reasons would have been valid if DSPy didn't have async support built in.

Codex's argument:

> ### Why ThreadPoolExecutor instead of async?
- **Your DSPy calls are synchronous**: orchestrator_module(...) and update_mem_module(...) are normal blocking calls (they’re dspy.Predict(...) objects), so asyncio.gather(...) doesn’t buy you anything unless the underlying LM/tool stack is actually awaitable.
- **If you “go async” anyway, you still end up on threads**: you’d typically wrap those blocking DSPy calls with asyncio.to_thread(...) / run_in_executor(...), which is effectively the same mechanism as a thread pool—just with an event loop on top.
- **App integration is simpler**: most of your current entrypoints are sync-first; adding async often means event-loop handling complexity for little gain.

So: **async vs threads** depends on whether your LM/tool calls are truly awaitable.

This argument doesn't fully hold in this case, since many of the tool calls are indeed async API calls that are I/O-bound. And DSPy obviously supports async out of the box.

After I provided the docs and links, Codex partially agreed while still pushing back on its implementation tradeoffs. This led to a discussion about how async and threads work underneath, which I found genuinely interesting. Among the many things we discussed, Codex pointed out something I had missed in my own thinking:

- **Shared mutable state**: DSPy is typically configured globally (dspy.configure(lm=...)). If two concurrent calls share one LM instance or shared caches/tracing buffers, the question is “is this object re-entrant / concurrency-safe?” Not every client is.

This was useful and made me consider my approach to ensure consistent caching and context management. It led to a tangential discussion, which ultimately resulted in some changes to my own code.

Conclusion

The Opus conversation ended after 5 turns and was largely me pointing out corrections, with it agreeing each time without any friction. The Codex conversation was much more in-depth. It held its ground and kept the window open for further exploration. The shared mutable state point definitely didn't come up when I was earlier discussing implementation ideas with Codex in my initial workflow. It goes to show how prompting the model along different dimensions can lead to different directions of exploration and uncover insights one might have missed.

The one downside: Codex did get into gaslighting territory when I said DSPy supports async out of the box. It was adamant that it didn't, until I pasted the definitive API calls and documentation. I think this comes from Codex's tendency to hold ground on its argument and not doing the necessary web search in the background -- I didn't see it looking this up in its thinking traces. But outside of that, it self-corrected based on my feedback while keeping its perspective and surfacing useful information.

Honestly though if either agent had one-shotted the architecture choices on the first try, I would have been a little grumpy. There is no lying how incredible they are now and only getting better. A specific feature of these improved models that I enjoy (which wouldn't have been possible 6 months back or so) is the ability to brainstorm different ideas and iterate via small experimental code runs to reliably validate those results. It's great to offload (many of) the coding bits to them while keeping (most of) the systems thinking and engineering tradeoff considerations to myself, exploring the thinking space better with the aid of these models.