2 pages
Pre-training deep bidirectional representations for NLP
Autoregressive language models that learn to predict the next token