Magic (magic.dev/blog/100m-token-context-windows) recently announced a breakthrough in reducing the computational complexity of LLMs’ response generation, making context windows as long as 100 million tokens possible - equivalent to 650 novels. This is a significant leap: faster inference times and lower costs, sometimes dropping by several orders of magnitude, especially for ultra-long context windows.
Combined with Groq-like hardware (think ASICs in bitcoin mining), the cost of inference could drop to a point where “human in a box”-level intelligence becomes cheap. The timing is perfect to build inference-hungry, multi-modal applications.
But RAG (Retrieval-Augmented Generation) isn’t going anywhere. (Here, “cost of LLM generation” refers to time, dollar cost, and compute/FLOPs - all largely proportional and used interchangeably.)
Pre-Magic Era: The Case for RAG
With a knowledge base of length m and instructions of length n (typically m » n): if you could process infinitely long inputs, the time complexity of LLM generation would be O(m² + n²). With a RAG-like solution, complexity drops dramatically to O(log m + n²) - vector-DB lookup is O(log m), assuming retrieval yields a constant number of constant-length chunks. A radical drop, hard to forgo if correctness benchmarks remain largely unaffected.
Post-Magic Era: Why RAG Still Makes Sense
Imagine the cost of LLM generation dropping to O(m + n) due to Magic’s breakthrough. Even then, RAG-like solutions further improve complexity to O(log m + n). Given m is still much greater than n, this reduction remains highly attractive.
TL;DR
The next generation of human-computer interfaces - voice bots, video chats, humanoids capable of natural communication - will demand sub-second latency and ultra-low costs, and many scenarios will require searching web-scale databases. RAG-like solutions will remain essential to making such technology feasible and marketable in consumer applications.