In-Context Learning

How large language models learn from examples in the prompt without weight updates

In-Context Learning (ICL) is the remarkable ability of large language models to learn new tasks from examples provided in the prompt—without any gradient updates or fine-tuning.

The Phenomenon

Given a few input-output examples, LLMs can generalize to new inputs:

Input: "The movie was terrible" → Sentiment: negative
Input: "I loved every minute" → Sentiment: positive
Input: "It was a waste of time" → Sentiment:

The model completes with “negative”—having learned the task from context alone.

Why It’s Surprising

Traditional machine learning requires:

  1. Collect labeled data
  2. Define a loss function
  3. Optimize weights via gradient descent

ICL skips all of this. The model’s weights remain frozen; “learning” happens through attention over the prompt.

Formal Framework

Let x1,y1,,xk,ykx_1, y_1, \ldots, x_k, y_k be demonstration examples and xtestx_{test} be a new input. The model computes:

P(ytestx1,y1,,xk,yk,xtest)P(y_{test} | x_1, y_1, \ldots, x_k, y_k, x_{test})

The demonstrations create a task distribution that the model conditions on.

How Many Examples?

SettingExamplesUse Case
Zero-shot0Task described in natural language
One-shot1Single example + new input
Few-shot2-32Multiple examples

More examples generally improve accuracy, but with diminishing returns.

Interactive Visualization

See how adding examples affects model predictions:

In-Context Learning Demo

Examples:2
Prompt:
Input: "The movie was terrible" → negative
Input: "I loved every minute" → positive
Input: "The acting was brilliant" → ?
Model Prediction
positive(81% confidence)
Probability Distribution
positive81%
negative19%

Observation: More examples → higher confidence. The model "learns" the sentiment task from context alone.

What Makes ICL Work?

Several hypotheses:

  1. Task Recognition: Model recognizes task from examples, retrieves learned behavior
  2. Implicit Fine-Tuning: Attention acts as a form of gradient descent
  3. Bayesian Inference: Model infers task distribution from demonstrations
  4. Induction Heads: Specific attention patterns copy patterns from context

Emergent Ability

ICL appears only at scale:

  • Small models: Cannot do ICL effectively
  • GPT-3 (175B): Strong ICL ability emerges
  • Larger models: Increasingly robust ICL

This “phase transition” makes ICL an emergent capability.

Best Practices

PracticeEffect
Diverse examplesBetter generalization
Consistent formatClearer task signal
Similar examples to testImproved accuracy
Clear separatorsReduces confusion

Limitations

  • Context window: Limited by maximum sequence length
  • Order sensitivity: Performance can vary with example ordering
  • Recency bias: May weight recent examples more heavily
  • Task complexity: Struggles with multi-step reasoning (see Chain-of-Thought)
Found an error or want to contribute? Edit on GitHub