3 pages
Normalizing each example across its features
Why bigger models, more data, and more compute lead to predictable gains
Applying Transformers directly to image patches for visual recognition