How large language models learn from examples in the prompt without weight updates
In-Context Learning (ICL) is the remarkable ability of large language models to learn new tasks from examples provided in the prompt—without any gradient updates or fine-tuning.
The Phenomenon
Given a few input-output examples, LLMs can generalize to new inputs:
Input: "The movie was terrible" → Sentiment: negative
Input: "I loved every minute" → Sentiment: positive
Input: "It was a waste of time" → Sentiment:
The model completes with “negative”—having learned the task from context alone.
Why It’s Surprising
Traditional machine learning requires:
- Collect labeled data
- Define a loss function
- Optimize weights via gradient descent
ICL skips all of this. The model’s weights remain frozen; “learning” happens through attention over the prompt.
Formal Framework
Let be demonstration examples and be a new input. The model computes:
The demonstrations create a task distribution that the model conditions on.
How Many Examples?
| Setting | Examples | Use Case |
|---|---|---|
| Zero-shot | 0 | Task described in natural language |
| One-shot | 1 | Single example + new input |
| Few-shot | 2-32 | Multiple examples |
More examples generally improve accuracy, but with diminishing returns.
Interactive Visualization
See how adding examples affects model predictions:
In-Context Learning Demo
Observation: More examples → higher confidence. The model "learns" the sentiment task from context alone.
What Makes ICL Work?
Several hypotheses:
- Task Recognition: Model recognizes task from examples, retrieves learned behavior
- Implicit Fine-Tuning: Attention acts as a form of gradient descent
- Bayesian Inference: Model infers task distribution from demonstrations
- Induction Heads: Specific attention patterns copy patterns from context
Emergent Ability
ICL appears only at scale:
- Small models: Cannot do ICL effectively
- GPT-3 (175B): Strong ICL ability emerges
- Larger models: Increasingly robust ICL
This “phase transition” makes ICL an emergent capability.
Best Practices
| Practice | Effect |
|---|---|
| Diverse examples | Better generalization |
| Consistent format | Clearer task signal |
| Similar examples to test | Improved accuracy |
| Clear separators | Reduces confusion |
Limitations
- Context window: Limited by maximum sequence length
- Order sensitivity: Performance can vary with example ordering
- Recency bias: May weight recent examples more heavily
- Task complexity: Struggles with multi-step reasoning (see Chain-of-Thought)