Order Matters: Sequence to Sequence for Sets

How input and output ordering affects seq2seq learning on set-structured data

Order Matters: Sequence to Sequence for Sets addresses a fundamental question: how should seq2seq models handle inputs or outputs that are inherently unordered sets?

The Problem

Standard seq2seq models assume ordered sequences. But many tasks involve sets:

  • Input sets: Point clouds, object collections
  • Output sets: Predicted tags, detected objects
  • Both: Sorting (input set → output sequence)

Naively treating sets as sequences introduces spurious ordering dependencies.

Input Order Sensitivity

Experiment: Sort numbers using seq2seq.

Input OrderOutput
[3, 1, 4, 2][1, 2, 3, 4] ✓
[1, 2, 3, 4][1, 2, 3, 4] ✓
[4, 3, 2, 1][1, 2, 3, ?] ✗

Different input orderings can lead to different (wrong) outputs!

Interactive Demo

Explore how input order affects set processing:

Order Matters

Input: Unordered set (order shouldn't matter)
6
5
1
9
4
2
1
3
Same set, different order → model should give same output
Read-Process-Write
Read: Embed all inputs. Process: Attention over set. Write: Generate sequence.
Permutation Invariance
Use attention to aggregate—order of input doesn't affect representation.
Applications
• Sorting: Input set → sorted sequence
• Convex hull: Points → ordered hull vertices
• Sentence ordering: Shuffled sentences → coherent paragraph

Solution: Read-Process-Write

The paper proposes a three-phase architecture:

1. Read

Embed all input elements:

{e1,e2,...,en}=Embed(x1,x2,...,xn)\{e_1, e_2, ..., e_n\} = \text{Embed}(x_1, x_2, ..., x_n)

2. Process

Use attention to create order-invariant representation:

qt=LSTM(qt1)q_t = \text{LSTM}(q_{t-1}) αi=softmax(eiqt)\alpha_i = \text{softmax}(e_i \cdot q_t) rt=iαieir_t = \sum_i \alpha_i e_i

Multiple processing steps refine the representation.

3. Write

Generate output sequence with pointer network:

p(yiy<i,M)=Attention(si,M)p(y_i | y_{<i}, M) = \text{Attention}(s_i, M)

Output Order Learning

For tasks where output order matters but isn’t predetermined:

L=minπSnilogp(yπ(i)yπ(<i))\mathcal{L} = \min_{\pi \in S_n} \sum_i -\log p(y_{\pi(i)} | y_{\pi(<i)})

Search over permutations to find the easiest ordering to learn.

Key Results

Sorting

ModelAccuracy
Seq2seq72%
Seq2seq + Attention87%
Read-Process-Write94%

Convex Hull

Sorting input points by angle dramatically improved performance—demonstrating that “natural” orderings help learning.

Implications

  1. Order is a hyperparameter: Choice of ordering affects learning
  2. Attention enables invariance: Self-attention over sets removes order dependence
  3. Output order matters: Some orderings are easier to learn than others

Connection to Transformers

The Read-Process-Write architecture anticipated:

  • Set attention (now standard in Transformers)
  • Iterative refinement (multiple attention layers)
  • Permutation invariance (via attention aggregation)

Key Paper

Found an error or want to contribute? Edit on GitHub