Differentiating Multi-Token Prediction from Prior LLM Training Methods
Table of Links
Abstract and 1. Introduction
2. Method
3. Experiments on real data
3.1. Benefits scale with model size and 3.2. Faster inference
3.3. Learning global patterns with multi-byte prediction...