Slavozard's blog

We can marry agent harness components

While building compound agentic systems, some of the design dimensions that we consider optimizing are memory, context management (what the systems sees vs does not) and the accumulating history over longer horizons. These choices have direct downstream qualitative and quantitative effects like improving the agent's action outputs and the token/cache economics respectively. In this blog, I share the argument that rather than treating these design axes orthogonally, we should optimize them jointly by making strong opinionated choices informed by the application space. This becomes particularly effective while building scoped enterprise agentic flows, where constraints are known in advance and can be leveraged. As such, I motivate these choices based on practical use-case bottlenecks we encountered and share some implementation details on how we can marry these design components to each other.

The implementation syntax is build on top of DSPy, however the arguments are framework agnostic and holds for any compound agentic systems. For more detailed view into the DSPy based enterprise agentic setup, I wrote about this a few months back here.

marrying state history and cache design

The working mental model of history in an llm system is an accumulation of user input and model output over each turn, appended in order. In practice, to maximize the cache hit ratio, we compose each new call with a stable prefix of system prompt, prior input-output pairs, followed by the new user query. This changes when we move on from a single llm call to a compound system chaining many calls together. In compound agentic systems, modules wrap around such one/multiple llm calls. Consider a task-oriented dialogue (TOD) system -- a common enterprise architecture comprising a manager/router and multiple domain-specific expert modules, with a final synthesizer presenting the consolidated output to the user. It follows the control flow of manager parsing user input and routing to one or more domain experts, who then make tool calls and provides an output and the synthesizer consolidates all these to respond. Maintaining per-module append-only history here creates inter-module information boundaries that can qualitatively degrade the whole system. The router watches previous user inputs and its own routing outputs but has no visibility into what expert modules actually did, what tools they called, or what they surfaced. As such, say at turn-N, the routing decision is blind to what happened inside the experts at turns zero to N-1 and if/how the previous queries were resolved. This context level knowledge gap can affect the expert routing. Likewise, domain experts see their own past inputs and outputs but have no view of what other modules contributed, which affects how they parse new queries and which tools they reach for. The synthesizer similarly lacks the connected context of the full journey.

An easy vanilla fix to this would be to dump the entire trajectory for each call/module into a single shared artifact passed to all modules. Although this can lead to better context awareness among modules, it will naturally blow up the context window for all those modules leading to context rot. After a few turns, the system degrades both quantitatively and qualitatively. Per module token consumption also increases unbounded.

We landed on a structured compromise between these two failure modes, specially for opinionated enterprise agents. The goal here is not to maximize context visibility or cache reuse independently, but to find a middle ground where modules remain aware of broader system activity without carrying the full execution trajectory. Rather than per-module history or a single shared trajectory list, we can maintain a single shared history artifact with constrained, standardized contents. For the N-th turn, the shared history is:

All modules now use this artifact as their history enabling a bird's-eye view of the whole system. The router sees which tools were invoked by which modules in prior turns, experts see the lineage of inter-module activity and how it connected to the final answer and the synthesizer has the full chronology of agent activity. The shared format is also stable across each modules and turns, which recovers the cache prefix consistency. The heavy artifacts like tool results and API responses, are offloaded to the memory and context management system. The only caveat will be that the cache hit ratio percentage would take a slight hit from had we maintained module specific stable prefix, as we are discarding some of the input tokens like tool results.

On the implementation side, DSPy's default history follows a pattern suitable for the per module design. We override this by subclassing the ChatAdapter from DSPy adapters and reimplementing format_conversation_history to produce the structured format above. The resulting history object is maintained as an append-only record across all turns and assigned as a shared input field on every module signature. The signature structure standardizes the order mentioned before. We jointly optimize the full module stack with GEPA, enforcing the shared history as a fixed structural input. Modules learn to use the tool call lineage and prior system outputs to ground their current-turn decisions. In practice, we observed that module outputs improved measurably when each module had an awareness of the broader system it operates within and their specific roles.

blog_diagram.jpg

A uniform, lightweight history prefix is shared across all active runtime modules, while heavy artifacts like tool payloads are completely decoupled and offloaded to episodic/semantic memory layers.

marrying context management to memory design

In the past, developers would think of memory as an add on artifact to the agent design space. It would be an orthogonal optimization surface, bolted onto any harness. I think the community is now moving on from that idea and converging on optimizing them jointly, more so for opinionated enterprise agents with well scoped action space.

For this discussion, I am following the memory taxonomy as semantic memory for persistent user-level knowledge and episodic memory for evolving runtime state. In compound systems, how they manage the active context at runtime can infer what memory architecture makes sense. Building on the previous section's context design, the shared history is deliberately kept lean by excluding the heavy session artifacts. These tool results, api payloads, search reasoning and others are explicitly offloaded into the memory system. The coupling is the key takeaway here. The memory architecture is not designed from scratch as an add-on, rather derived from how the context state is structured.

The episodic memory covers session-level artifacts like full tool results, retrieval traces, detailed reasoning. In practice, we write them to an append-only log stored per user, module and session. The shared history carries only the name and parameters of each tool call, preserving chronology without the payload. When a module needs more than the history provides, it queries this artifact store directly. The model learns when and why to do this through GEPA optimization, so the lookup behavior is ingrained in the prompt space. This keeps the runtime context lean with predictable growth and extra context available on demand.

Semantic memory covers persistent user-level knowledge like preferences, profile specifics, task-relevant patterns that accumulate across sessions. The choice of retrieval mechanism is again conditioned on how the context evolves in the modular system. For enterprise cases, where most context is structured and module-scoped, a typed artifact can be sufficient and cheaper. Based on the system needs, we can also replace the memory storage architecture to something more complicated. In our applications, dedicated background module refined this asynchronously each turn, extracting predefined fields from user queries and building an evolving typed artifact. Domain expert modules can selectively invoke this module when parsing new user inputs. Like before, the scope and need of these lookups are optimized via GEPA.

Takeaway: For compound multi-agent systems, treating memory, state history and context management as orthogonal design axes can be a recipe for broken context management and inter-module information boundary. Decisions made in one dimension can constrain others. Instead, we should aim to optimize architectures that explicitly couple these design choices and are opinionated, tightly bound informed by the agent application boundaries. This can lead to simpler architectures, better context utilization, more predictable scaling behavior, and improved downstream agent performance.


P.S. if anyone reading this is using Gemini models, be aware of the inconsistent cache hit rate of Gemini implicit caching and there needs to be workarounds to deal with that. For more details check these issue, issue, and a quick ai script to test it here

In DSPy, while working with custom adapters and streaming enabled, you will need to edit the 'streaming.py' file as it has a hard coded assertions that fails for any other adapter than the ones already part of the library. There is a an open issue for this here.