Continual learning,memory and context problem
An analysis and chain of thoughts
As development is progressing in LLMs, one problem remains constant throughout. That problem, in short, can be defined as context limit.This fundamental limitation creates a domino effect, leading to the creation of new problems as LLMs get adapted into the application layer. Sure, some SOTA models have context lengths of up to 1 million tokens, and though they sound lucrative, one question continuously emerges: does the model generalize well and attend to all these tokens properly when that 1 million-token context is completely filled?
The benchmark needle-in-the-haystack attempts to address this, but in 2025 we have seen the rise of the term benchmaxxing, where providers train models to perform best on benchmarks. As soon as real-world usage appears, however, we see them failing. This is what happened with LLaMA-4, and Sebastian Raschka mentioned this in his 2025 review.
So then, what is the solution? One must ask. The simplest answer that usually emerges is RAG. Over the past years, RAG has been stylized with various modifications to suit enterprise use cases…whether the data is long and static or continuously growing, and whether it is structured or unstructured in nature. But this comes at the cost of infrastructure management. It involves careful implementation of the retrieval layer, often with hybrid search on vector databases and a re-ranking process.
At their core, these systems involve search algorithms such as keyword search and BM-25, mixed with vector search, which itself is a culmination of ANN (Approximate Nearest Neighbour) methods and cosine similarity. This is further aided by use-case-based filters that developers choose during data analysis.
But thinking from a more general perspective, the problem at heart is what is known as continual learning. At the end of the day, we want the model to work, answer, or do something within a particular context, and as time passes, we want it to become familiar with us…something like a personalized assistant that understands, remembers, and does what it is asked, at the very least.
So what does continual learning mean? As far as I understand, continual learning is about making models adapt and learn from new data incrementally without forgetting what they already know. From a first principle, one might say this means updating the weights of the model over time by taking new data into account. It sounds simple until one thinks about this sequentially.
Let’s take a small example. Suppose we have a 1 billion-parameter model. After a month of usage, you have interacted with it many times, and all that data amounts to, say, 1 million tokens. Now suppose this data is properly arranged and we fine-tune the model. That would mean updating 1 billion parameters, which is not compute-efficient. So the obvious choice becomes parameter-efficient fine-tuning, where techniques like LoRA and QLoRA exist.
Now, instead of fine-tuning all 1 billion parameters, you fine-tune a tiny percentage of them based on the LoRA configuration, which involves the rank used to form the decomposition matrices. LoRA generates adaptive weights that must be merged with the base 1 billion model to access the learned knowledge. But suppose you want this to be personalized and the model resides in the cloud, then you have to merge these weights only at inference time in order not to pollute the base weights with multiple users’ learned data.
This means you need to maintain infrastructure to store and manage all these weight files, which becomes a hefty task. Now suppose you are not personalizing and instead have one general model where you mix all data. Even then, every month(or a given period) you would be performing a new LoRA fine-tune and merge. Scaled to 600 billion or 1 trillion-parameter models, this becomes a gigantic task both in terms of compute and training complexity.
And even if you somehow handle that, there comes a fundamental bottleneck: does this training result in 100 percent retention? If not, over time the accumulation of losses could cause the model to forget older knowledge or increase the likelihood of hallucinations. These are the trade-offs one must understand before going down this path, even though there will always be engineers smart enough to mitigate parts of the problem.
Returning to retrieval-based memory systems, I found some solutions which are Mem0 and MemVid .They are trying to solve the problem infrastructurally i.e adding an external layer to aid the model or system.
Mem0 creates a lightweight memory layer that stores past interactions and retrieves them based on semantic similarity. The idea is straightforward: extract salient memories, embed them, and retrieve them when relevant. This works reasonably well when memory size is moderate and interaction patterns are repetitive. However, as memory grows, the system becomes increasingly sensitive to retrieval quality. If the wrong memory is surfaced, the model’s response quality degrades sharply. In that sense, Mem0 inherits all the fragility of embedding-based retrieval without fundamentally changing the paradigm.
MemVid takes a different infrastructure approach by packaging all memory (embeddings, metadata, search structures) into a single serverless file, inspired by video encoding formats. Instead of running database servers, it treats memory like a video file - append-only frames with efficient seeking and compression. However, it also shifts complexity elsewhere…alignment between modalities, increased storage costs, and heavier pre-processing pipelines. While promising for certain applications, it does not generalize easily to all personalized agentic systems.
All these solutions tell us that external memory is only as good as retrieval. None of these escape the fundamental constraint that the model itself does not internalize memory; it merely consumes retrieved artifacts.
Adding to the process, there is this concept by Alex Zhang called Recursive Language Models (RLM)... which is not about making the model internalize memory via weight updates, but giving it a programmable playground where the massive context lives entirely outside the prompt. The model writes code to access it, performs recursive searches programmatically (using regex, keyword/string operations, slicing, filtering, or spawning sub-instances of itself on extracted chunks), then feeds the outputs back into its own next prompt and iterates back and forth until it’s confident enough to output the final answer.
RLM keeps the raw, full context untouched in an external REPL environment (a persistent Python sandbox), where the model interacts with it like a programmer would: decompose, inspect, verify, aggregate step-by-step. It’s a pure inference-time technique , no weight update but the model performs the task dramatically better than when you naively stuff the entire context into its prompt window.
The paper shows strong empirical wins across diverse long-context tasks that punish “context rot” (diluted attention, lost-in-the-middle effects). On OOLONG (trec_coarse split, 132k–263k token contexts with distributional queries requiring aggregation across thousands of entries), RLM using GPT-5-mini outperformed base GPT-5 by over 34 points (~114% relative increase), while staying nearly as cheap or cheaper per query (median costs lower due to selective processing avoiding full-context ingestion every time). On a synthetic OOLONG-Pairs variant (pairwise aggregation, quadratic scaling), F1 jumped from near-zero in baselines to ~58–60%. On LongBench-v2 CodeQA (multi-choice codebase reasoning over fixed file sets), RLMs with frontier models like GPT-5 or Qwen3-Coder showed sharp gains over direct prompting or common scaffolds. Ablations confirm recursion is crucial: dropping sub-LM calls degraded performance by ~10% on many tasks, as self-calls enable semantic labeling, verification, and dense access that single-pass code can’t match.
Recursion introduces latency from multiple inference steps (though parallelizable sub-calls help, and average costs often drop). It leans heavily on the root model being a strong code-writer (weaker models struggle more). Potential for bad code, infinite loops, or sandbox escapes needs careful timeouts and restrictions. Still, it feels more structured for long contexts. But at the end , the model doesnot get better in its vanilla form rather is just more smart around using long contexts to achieve the task at hand.
Apart from all these great engineering ideas to maximize the usefullness of LLMs , since the model has not improved its base state, the question of continual learning remains unanswered by these.
In Dec 2025 ,Google released a paper called Nested Learning: The Illusion of Deep Learning Architecture which attempts to solve the problem by taking inspiration of how memory is developed in human brain over time i.e the idea of short term and long term memory.
As the name suggests, it attempts to uncover an illusion about neural networks we have had for years. The paper tells us that forward propagation and backward propagation through optimizers like Adam, AdaGrad, and Muon are not separate processes. Instead, they are a complicated nested structure updating their local surprise signals, just as we update short-term memory and then it slowly shifts to long-term memory. In this framework, optimizers become associative memories i.e they are learning how the model learns and updating their own learning algorithm as time passes. Think of it like a Russian doll structure, but instead of static nested dolls, imagine each doll can learn how to reshape the doll inside it. That’s Nested Learning.
The paper proposes the following radical ideas:
• Multi-Frequency Memory Systems
Human brains operate at multiple frequencies simultaneously,gamma waves (30-150 Hz) process immediate sensory input, beta waves (13-30 Hz) handle active thinking, and theta/delta waves (0.5-8 Hz) consolidate memories into long-term storage. Current Transformers, however, only operate at two extreme frequencies: infinity (∞) for attention blocks that recompute everything fresh each token, and zero (0) for MLP weights frozen after pre-training. There’s no gradual consolidation.
The paper introduces Continuum Memory Systems (CMS) to fix this. Instead of static MLP blocks, imagine memory banks updating at different rates i.e. high-frequency neurons adapt quickly but retain information briefly (working memory), while low-frequency neurons update slowly but preserve knowledge long-term (like episodic memory). Critically, information flows between these frequencies, creating a consolidation loop where patterns learned in the fast system can gradually migrate to the slow system without erasing what was already there. This eliminates the binary choice between “recompute everything” (attention) and “never update” (MLPs).
• Optimizers as Associative Memories
When you train a neural network with gradient descent, the paper argues you’re not just “moving weights downhill.” You’re training an associative memory that maps each data point to its “local surprise signal” which is essentially, how wrong the model’s prediction was for that input. The backpropagation process is compressing these surprise signals into the model’s parameters.
Even sophisticated optimizers like Adam are just two-level nested memories. The momentum term itself is an associative memory trying to compress the history of past gradients. But here’s the bottleneck: with standard momentum (β=0.9), only the last ~43 gradient steps contribute 99% to the current update. For continual learning spanning thousands or millions of interactions, this is woefully inadequate. The optimizer has no long-term memory of the loss landscape.
The paper proposes Delta Momentum and other variants that use gradient-dependent decay i.e. where the memory knows when to forget irrelevant past directions and when to hold onto important ones. Somewhat of a system that is learning what to remember.
• Self-Modifying Networks
Here’s the most radical claim: a linear attention layer with a learnable initial memory state is functionally identical to an MLP layer, except it can also adapt in-context. Translation: MLPs are just frozen attention mechanisms. They’ve lost the ability to update during inference.
So the paper asks: why separate the architecture from its optimization process at all ?
The model becomes self-referential i.e. it learns how to modify its own learning algorithm. There’s no clean separation between “training” and “inference” anymore. It’s all just different levels of the same nested optimization process running at different frequencies. The inner loops run fast (in-context adaptation), the outer loops run slow (weight updates), and the meta-loop learns how to coordinate them.
• Memory as a Distributed System, rather than Isolated Modules
Neuroscience research shows that memory in the human brain isn’t stored in one place. There’s no “short-term memory box” in the frontal cortex separate from a “long-term memory box” in the hippocampus. Instead, memory is a distributed process across multiple brain regions, with different areas updating at different rates and continuously communicating.
Traditional deep learning violates this. Attention handles short-term context, MLPs store long-term patterns, and never shall the two meet during inference. The paper argues for architectural uniformity where the same basic building blocks (linear layers, associative memories) repeated across all levels, but operating at different frequencies. This mirrors the brain’s neuroplasticity, where neural elements aren’t rigidly dedicated to one function but can be flexibly redeployed.
To validate these ideas, the paper introduces HOPE, which combines linear attention for fast in-context adaptation, Continuum Memory Systems with multiple frequency bands, and self-modifying update rules that meta-learn consolidation strategies. HOPE demonstrates significant improvements over standard Transformers across multiple benchmarks.
In continual learning experiments with five sequential tasks, standard Transformers suffer 60-80% catastrophic forgetting while HOPE maintains performance with only 15% loss. On the BABILong benchmark with contexts up to 500K+ tokens, HOPE maintains consistent accuracy whereas standard Transformers degrade sharply beyond 100K tokens. The Needle-in-Haystack test shows HOPE achieves consistent retrieval regardless of position, eliminating the "lost in the middle" effect. For few-shot learning, HOPE continues improving with 128+ examples while standard Transformers saturate at 32 examples. Most impressively, HOPE learns new language patterns online during inference without weight updates, while standard Transformers require full fine-tuning. Despite these continual learning capabilities, HOPE remains competitive with standard Transformers on core language modeling tasks, proving it doesn't sacrifice fundamental performance.
The paper is radical and fascinating and if gets scaled might be the perfect answer to continual learning .
For a better understanding of the paper watch the below videos where the first author Ali Behrouz describes the paper.
That’s the end of this analysis and chain of thought. I might have thought something unintuitive or wrong , if you find them do cite them in the comment and help me correct them.


