Ring Attention Explained
Apr 2024
How do state-of-the-art LLMs like Gemini 1.5 and Claude 3 scale to long context windows beyond 1M tokens? Well, Ring Attention presents a way to split attention calculation across GPUs while hiding the communication overhead in a ring,enabling zero overhead scaling.