Memory context limit exceeded #957

dlaliberte · 2024-02-06T05:16:45Z

Describe the bug
Several people have experienced the error with memory context limit being exceeded, especially with local LLMs.

Please describe your setup

MemGPT version
- 0.3.0
How did you install memgpt?
- git clone
Describe your setup
- What's your OS: Windows 11.
- How are you running memgpt? pwsh

Screenshots
Similar to this:

  File "C:\Users\danie\memgpt\memgpt\local_llm\chat_completion_proxy.py", line 155, in get_chat_completion
    result, usage = get_koboldcpp_completion(endpoint, auth_type, auth_key, prompt, context_window, grammar=grammar)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\danie\memgpt\memgpt\local_llm\koboldcpp\api.py", line 17, in get_koboldcpp_completion
    raise Exception(f"Request exceeds maximum context length ({prompt_tokens} > {context_window} tokens)")
Exception: Request exceeds maximum context length (8465 > 8192 tokens)

Additional context
It turns out there are about 3 bugs that need to be fixed to fully resolve this.

Here is my chat stream, copied from a thread in the discord Support channel.

The error about exceeding the maximum context length occurs in the process of trying to summarize the messages after it has been determined that an overflow will occur.

Attempting to summarize 18 messages [1:19] of 24
Using model koboldcpp, endpoint: http://172.18.128.1:5002
unsetting function_call because functions is None
An exception occurred when running agent.step():
...
  File "C:\Users\danie\memgpt\memgpt\local_llm\koboldcpp\api.py", line 17, in get_koboldcpp_completion
    raise Exception(f"Request exceeds maximum context length ({prompt_tokens} > {context_window} tokens)")
Exception: Request exceeds maximum context length (4219 > 4096 tokens)

So, the summarizer needs to use a different way of using the LLM to do the summarizing that avoids this problem.
Why is it trying to summarize 75% (==0.75) of all messages, not counting the system message? That seems like way too severe, and a likely cause of sending a longer message than intended, just to get some text summarized.

Tracking this down further, in chat_completion_proxy.py, the get_chat_completion function is called with function_call parameter set to None, since functions is None. And this raises a ValueError.

    if function_call != "auto":
        printd(f"[DEBUG] Raising ValueError since {function_call} is not 'auto'")
        raise ValueError(f"function_call == {function_call} not supported (auto only)")

I commented out that if check, since it doesn't seem to be important what the function_call was in that function. Then I got a summary from the LLM!

But then I immedately got another exception about exceeding the maximum context length. So ... I'll keep tracking this down. I suspect there are more of these stumbling blocks on the way back to using the summary.

The next exception is occurring in persistence_manager.py, trim_messages method:

    def trim_messages(self, num):
        printd(f"InMemoryStateManager.trim_messages {num} messages of length {len(self.messages)}")
        self.messages = [self.messages[0]] + self.messages[num:]

The debug output indicates that len(self.messages) is 0!
But Agent._trim_messages is called when the agent.messages has a length of 27.
It looks like the persistence_manager was not given the messages from the agent.

Taking a wild guess about how to fix this problem with the persistence_manager, I am tempted to call the init method with the agent, which at least sets the self.messages to the agent.messages.

Well, that sorta worked! At least it avoided the exception in the persistence_manager trim_messages method.

But then the resulting summary maybe wasn't small enough, and the system got into an apparent infinite loop summarizing repeatedly, not making enough progress. I had reduced the amount to summarize from 75% down to just 25%, so maybe that is not enough, especially for the artificially small maximum length of only 4096. I'll try 50%.

Looking good! This may be fixed. I have not had any more memory context exceeded errors.

Local LLM details

If you are trying to run MemGPT with local LLMs, please provide the following information:

The exact model you're trying to use: TheBloke/dolphin-2.2.1-mistral-7B-GGUF/dolphin-2.2.1-mistral-7b.Q5_K_S.gguf
The local LLM backend you are using: Kobold
Your hardware for the local LLM backend: local computer with GPU

The text was updated successfully, but these errors were encountered:

dlaliberte · 2024-02-07T00:16:16Z

I can create a PR with my fix, but my changes feel more like hacks, since I didn't figure out why the code might be the way it is, so there might be a better fix.

cpacker · 2024-02-09T04:18:10Z

Hi @dlaliberte , thank you so much for such a detailed bug report!

I just merged in a patch in #977 that fixes some of the bugs you mentioned. It doesn't change the % of messages that are send to the summarizer - I have a feeling that the other fixes will make it so that that's not required, however, please let me know if you think this still needs patching / if you still see this or a similar bug persisting.

dlaliberte · 2024-02-09T06:03:39Z

Thanks for the quick fix! I look forward to trying it.

Regarding the % of messages sent to the summarizer, at the time I was questioning the 75% number, I hadn't yet figured out the underlying causes of the failure(s), and I suspected everything. But even though that turned out to not be relevant, it still got me thinking about what the effect would be of replacing 75% of messages with a very brief summary. I also wondered if it had been set to such a high percentage as a way of possibly dealing with the memory context limit being exceeded. At least, the summarizing would not have to be done as frequently as it would with a much smaller percentage.

One more thing I hope you will consider. In the past, I experimented with creating a summary of the entire conversation at every step of a conversation, and this summary was then added to the beginning of each reply by the LLM. Although it effectively shortened the context window for new content, it extended the memory of earlier parts of the conversation. My hope for MemGPT is that we could instead store all those summaries in the archival memory, so they would not consume the limited context window space. Ideally, each summary should connect with earlier and later summaries, and more distantly related summaries and documents.

sarahwooders assigned cpacker Feb 6, 2024

sarahwooders added the bug Something isn't working label Feb 6, 2024

cpacker added bug Something isn't working and removed bug Something isn't working labels Feb 6, 2024

github-project-automation bot added this to 🐛 MemGPT issue tracker Feb 6, 2024

github-project-automation bot moved this to To triage in 🐛 MemGPT issue tracker Feb 6, 2024

cpacker moved this from To triage to Ready in 🐛 MemGPT issue tracker Feb 8, 2024

dlaliberte mentioned this issue Feb 8, 2024

ValueError: function_call == None not supported (auto only) #967

Closed

cpacker moved this from Ready to In progress in 🐛 MemGPT issue tracker Feb 8, 2024

cpacker linked a pull request Feb 8, 2024 that will close this issue

fix: patch mem lim exceeded #977

Merged

cpacker mentioned this issue Feb 8, 2024

fix: patch mem lim exceeded #977

Merged

cpacker closed this as completed in #977 Feb 9, 2024

github-project-automation bot moved this from In progress to Done in 🐛 MemGPT issue tracker Feb 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory context limit exceeded #957

Memory context limit exceeded #957

dlaliberte commented Feb 6, 2024 •

edited

Loading

dlaliberte commented Feb 7, 2024

cpacker commented Feb 9, 2024 •

edited

Loading

dlaliberte commented Feb 9, 2024

Memory context limit exceeded #957

Memory context limit exceeded #957

Comments

dlaliberte commented Feb 6, 2024 • edited Loading

dlaliberte commented Feb 7, 2024

cpacker commented Feb 9, 2024 • edited Loading

dlaliberte commented Feb 9, 2024

dlaliberte commented Feb 6, 2024 •

edited

Loading

cpacker commented Feb 9, 2024 •

edited

Loading