Help us build Ouro into the best platform it can be. What's working, what's not. Tell us how we can better serve your needs.
How this post is connected to other assets
Hi everyone, and welcome back for our second Ouro audio devlog. I imagine this one's going to be a bit rambly today, as I’m out on a walk. I’ll start off by sharing some of the things I’ve been working on over the past couple of days.
I’ve been getting back into conversations and the chat functionality on the platform. It’s been going alright, but what I really wanted to ensure was that it handled LLM, or large language model connections, particularly with streaming responses. This makes a big difference; instead of having to wait for the entire response from one of these agents to be completed before seeing the message—like you do with ChatGPT or other platforms—responses can be sent to the user as the tokens are being generated. So, they can read the message as it is being created. I’m really excited to share that and show you all how well it has come together.
The platform still supports structured data, as well, which is important because often when these LLMs generate text responses, they do so in Markdown format. This means it can handle headings, lists, bullet points, bold text, code, and so forth. I'm thrilled to have accomplished that.
What I really want to discuss today is conversation context management for these algorithms and agents. While testing the streamed response to build out this functionality, I've realized that to have a truly effective chatbot, we need to manage conversation history. The agent or the LLM (large language model) needs to know what has been said previously—whether it’s part of the whole conversation or just some key elements—to inform the next response. This is a bit of a challenging topic. I haven’t really encountered a definitive package or solution that guarantees a high-performing chatbot in this area, which makes it a ripe opportunity for research, development, and experimentation.
It’s kind of funny; I was using Claude to brainstorm ideas for a better conversation context management system, which feels a bit meta, but let's dive into that. You'll start to notice that the purpose of these devlogs is for me to ramble about stuff I'm building or want to build. Eventually, as this AI improves through these conversations—or really through these monologues—it might be able to take these ideas and implement them on our behalf, without requiring us to do much of the legwork.
Right now, I’ll take this audio, turn it into a transcript, summarize it, and then use that as input for another conversation where we’ll dive into the code, ensuring it fits within our existing framework. However, I don't foresee this being the most efficient workflow for long since AI has the potential to manage many of those pieces effectively.
Returning to conversation management, let’s clarify some of the reasons we need this capability. When I first started working with LLMs, we were limited by a small context window. The context window refers to how much text you can pass into the LLM at once. Initially, it was around 2,000 tokens, then it grew to about 4,000 tokens. Now, some models are hitting context lengths of 100,000 or even 1 million tokens. That’s pretty incredible, but it’s also costly, as these LLM providers typically charge per input token.
The need for a context management system arises from the reality that, in a back-and-forth conversation, the context grows larger since each new message adds to the total context we might pass in for the next response. However, it’s often unnecessary to send the entire conversation history for each new response. Sometimes, the user’s recent query might not require any context from previous messages at all; it could derive an answer simply from what the LLM already knows.
This presents us with an opportunity for efficiency while also enhancing the conversation. If conversation history management is done well, we might have asynchronous calls happening in the background as the conversation unfolds. For instance, the LLM on the agent side could fetch external information, an approach commonly referred to as Retrieval Augmented Generation (RAG). We can also use RAG to refer back to the conversation message history or draw from external data sources. Access to this extra information can lead to more grounded responses, thereby reducing the chances of hallucination. It could be a database query or another dynamic query to obtain real-time information.
For now, I primarily want to focus on conversation history management and how to do this effectively. I’m considering the use cases for this chat function and envisioning it as a research assistant or brainstorming partner—not necessarily executing actions yet but formulating a strong plan during the ideation phase.
Good planning is essential before any execution takes place, right? This concept is also relevant for agents, as the first step in an agent's plan is often to create that plan, which may involve sub-agents carrying out individual tasks. It enables teams to accomplish more within a shorter timeframe.
Now, let’s outline some initial ideas and common solutions that have already been developed. A lot of this revolves around mimicking human brain functions. Essentially, we have short-term memory, long-term memory, and several processes for pattern matching and fetching external data, all of which integrate as we connect different ideas. All these elements contribute to generating the next response—how we will respond to what we just heard or read.
Short-term memory is relatively straightforward; it involves passing the most recent N number of messages, which grounds the LLM’s response in the current context. In contrast, long-term memory becomes much more complicated. Long-term memory should allow us to remember specific instructions given by the user throughout the conversation. For example, if a user starts by saying, "Hey agent, respond to me with a southern accent," that instruction should persist through the whole interaction. If we stick to short-term memory and only examine the last N messages, that instruction could eventually fall out of context. The user wouldn't be getting what they asked for, so we need to maintain that persistent instruction.
Alongside remembering specific instructions, another long-term memory function should involve retaining important entities and relevant ideas. Additionally, we should consider the user’s intent. Often, in a single message, users may shift the entire direction of the conversation. Future messages could diverge from the main topic they intended to address. When thinking of this as a potential research assistant, it’s crucial to explore a wide range of ideas, as the solution is often discovered through synthesizing various perspectives. However, we must ensure that as we explore, we eventually return to what we initially set out to accomplish.
Another idea is potentially creating different types of context based on the user's intent. I've seen this breakdown into categories based on the user’s goals. For example, if someone is iterating on a piece of code—something I frequently do—that approach requires a different context than more general queries like, "What are some good things to do in Chicago?" In that case, no context is necessary. Or consider a question like, "What did we talk about? What was that idea you had for X, Y, Z?" Here, we would need to search the conversation history more thoroughly.
These different user intents would influence how we structure long-term memory. Short-term memory could be more straightforward, typically being recent messages. However, depending on the user's requirements, there might even be instances where short-term memory isn’t needed at all.
We haven’t formulated a solid method for how to create varying context based on user intent, but I believe it’s crucial. For example, when iterating code, the agent may produce extensive blocks of code. A user might provide feedback like, “That’s not quite right; I want it to do this instead." As the agent adjusts the code, the short-term memory could accumulate irrelevant lengthy blocks of code that no longer apply. To address this, we might want to keep the most recent version of the code and summarize the changes requested thus far, which would allow users to track how a specific code block has evolved. This approach could also incorporate RAG for referring back to earlier versions or other relevant pieces of code.
This presents another challenge, but I believe we can optimize the conversation context management system based on the user’s goals at any given moment. While I’ve been thinking about it primarily as a research assistant and brainstorming tool, these are just specific intents within a larger system that can handle context management differently for various use cases.
As we move forward, we may start by developing a solid context management system for brainstorming and research assistance, then expand into other areas such as code iteration or image generation through iterative prompts. It’s an underexplored area with great potential, as effective conversation memory management can significantly impact how successfully we utilize LLMs.
Often, discussions center around new leaderboard scores and performance metrics of LLMs, but we need to make equal advancements in the systems surrounding these models. A smaller model with excellent memory management could outperform a much larger model lacking those capabilities regarding usability, cost, and speed—all critical factors.
I think I’ll stop here for now. I’m going to continue developing this idea and hopefully have some early solutions to share. We are currently working with Hermes, our agent, and he will be the first to integrate this memory management system. We can experiment with it and see how it performs.
Thanks for listening, reading, or whatever this turns into. Have a good one!
Discover other posts like this one