How Postgres Stores Composite Indexes on Disk
Composite indexes can seem like an easy way to “put two columns together and get great performance benefits” - as Postgres can just figure it out. However, the physical representation is not intuitive unless you’ve looked inside a B-tree page. Once you understand how these indexes are stored, you can understand why certain queries leveraging a composite index perform better whilst others don’t quite get the same lift-off effect.
This post walks through exactly how a composite index like the one below will end up getting stored on disk.
CREATE TABLE test_tbl(
id bigserial primary key,
col1 int,
col2 text
);
CREATE INDEX t_idx1 ON test_tbl(col1, col2);
Composite Indexes Are One Key, Not Two
t_idx1 is a single B-tree index whose key is the pair (col1, col2).
Postgres stores that pair as one logical key and orders it lexicographically:
- Sort by
col1 - For rows with the same
col1, sort bycol2
Everything about how queries use this index flows from this rule.
On-Disk Storage: A Separate File, B-Tree Structured
t_idx1 is its own table-like structure on disk:
- It lives in
pgdata/base/<db_oid>/<relfilenode> - It uses the B-tree access method
- It is split into 8kB pages
-
Pages are arranged as:
- Meta page (block 0)
- Internal pages (routing nodes)
- Leaf pages (actual key + TID entries)
This file has no connection to the heap except via the TIDs stored inside each tuple.
graph TD
subgraph file["Index File (relfilenode)"]
M["Block 0 · Meta Page\nroot=1, level=2, fast-root=1"]
I1["Block 1 · Internal Page\nsep: (col1=2)"]
L1["Block 2 · Leaf Page\n(1,'aaa', TID)\n(1,'bbb', TID)"]
L2["Block 3 · Leaf Page\n(2,'abc', TID)\n(2,'xyz', TID)"]
end
M --> I1
I1 -->|"left child\ncol1 < 2"| L1
I1 -->|"right child\ncol1 ≥ 2"| L2
L1 <-->|"prev / next ptrs"| L2
For large indexes Postgres splits the file into 1 GB segments (.1, .2, …) but
the page structure and B-tree layout remain identical across segments.
Each entry in a leaf page stores:
col1- 4-byte integercol2- text datumTID- 6-byte heap pointer (block number + offset)
This structure is wrapped in an IndexTupleData header that contains flags,
tuple size, and a null bitmap if needed.
Text values follow varlena rules:
- Short text: stored inline (
[length][bytes]) - Large text: replaced by an 18-byte TOAST pointer rather than embedding huge strings in the index
A leaf entry is physically:
[IndexTupleHeader]
[col1 value]
[col2 varlena or toast pointer]
[TID]
How Composite Keys Are Ordered
The B-tree compares keys in order:
(col1=1, col2='aaa')
(col1=1, col2='bbb')
(col1=2, col2='abc')
If a query filters only on col2, the index cannot be traversed meaningfully
because the tree is structured around col1.
Column order in the index will define physical locality.
A Concrete Example: 12 Rows Across Three Levels
Consider inserting these 12 rows (insertion order is irrelevant — the index always maintains sorted order):
INSERT INTO test_tbl (col1, col2) VALUES
(3, 'ivy'), (1, 'bee'), (4, 'koi'),
(2, 'fox'), (1, 'ant'), (3, 'gnu'),
(4, 'jay'), (2, 'dog'), (1, 'cat'),
(3, 'hen'), (4, 'lynx'),(2, 'elk');
The index sorts them by (col1, col2). TIDs reflect insertion order, not sort
order:
| col1 | col2 | TID |
|---|---|---|
| 1 | ‘ant’ | (0,5) |
| 1 | ‘bee’ | (0,2) |
| 1 | ‘cat’ | (0,9) |
| 2 | ‘dog’ | (0,8) |
| 2 | ‘elk’ | (0,12) |
| 2 | ‘fox’ | (0,4) |
| 3 | ‘gnu’ | (0,6) |
| 3 | ‘hen’ | (0,10) |
| 3 | ‘ivy’ | (0,1) |
| 4 | ‘jay’ | (0,7) |
| 4 | ‘koi’ | (0,3) |
| 4 | ‘lynx’ | (0,11) |
With 3 entries per leaf page the index grows to three levels: a root internal page, two level-2 internal pages, and four leaf pages.
graph TD
META["Meta Page · block 0\nroot → block 1"]
ROOT["Root Internal · block 1\nsep = (3)"]
LI["Left Internal · block 2\nsep = (2)"]
RI["Right Internal · block 3\nsep = (4)"]
L4["Leaf · block 4\n(1,'ant') TID(0,5)\n(1,'bee') TID(0,2)\n(1,'cat') TID(0,9)"]
L5["Leaf · block 5\n(2,'dog') TID(0,8)\n(2,'elk') TID(0,12)\n(2,'fox') TID(0,4)"]
L6["Leaf · block 6\n(3,'gnu') TID(0,6)\n(3,'hen') TID(0,10)\n(3,'ivy') TID(0,1)"]
L7["Leaf · block 7\n(4,'jay') TID(0,7)\n(4,'koi') TID(0,3)\n(4,'lynx') TID(0,11)"]
META --> ROOT
ROOT -->|"key < (3)"| LI
ROOT -->|"key ≥ (3)"| RI
LI -->|"key < (2)"| L4
LI -->|"key ≥ (2)"| L5
RI -->|"key < (4)"| L6
RI -->|"key ≥ (4)"| L7
L4 <-.->|"prev/next"| L5
L5 <-.->|"prev/next"| L6
L6 <-.->|"prev/next"| L7
Every split in this tree falls between rows with different col1 values, so
suffix truncation reduces all three separator keys to a single integer:
- Root sep = (3) — split between
(2,'fox')and(3,'gnu')→ col1 differs → stored as(3) - Left internal sep = (2) — split between
(1,'cat')and(2,'dog')→ stored as(2) - Right internal sep = (4) — split between
(3,'ivy')and(4,'jay')→ stored as(4)
Query Traversal
Select a query below to step through how the B-tree is traversed:
Internal Pages Store Limited Information
Internal nodes route lookups downward but do not need the full key. They store separator keys. Since Postgres 13, the B-tree code applies suffix truncation when writing a separator: it strips trailing attributes that are not needed to distinguish the two subtrees being separated.
- If
col1alone is different at the split boundary → the separator is(col1). - If the split falls between two rows that share the same
col1→ the separator must includecol2to disambiguate, so it becomes(col1, col2).
In practice, most separators end up as just col1, keeping internal pages small
and the tree shallower. But the full key can appear whenever a page boundary
falls inside a run of identical col1 values.
Leaf pages store all actual index entries; structure:
[PageHeader]
[line pointers]
...
[free space]
...
[packed index tuples at the end]
Leaf pages also form a doubly-linked list, allowing efficient range scans.
Because items are sorted by (col1, col2), values with similar keys land on
adjacent leaf pages.
Skip Scans: Using the Index Without the Leading Column
A query filtering only on col2 can’t traverse the B-tree meaningfully because
the tree is ordered by col1 first. Before Postgres 17, the planner would
ignore t_idx1 entirely for WHERE col2 = 'foo'.
Postgres 17 introduced native skip scans. The planner enumerates each
distinct value of col1, then does a separate descend into the tree for each
one, effectively running:
col1 = <v1> AND col2 = 'foo'
col1 = <v2> AND col2 = 'foo'
...
This is only efficient when col1 has low cardinality — the number of
descents equals the number of distinct col1 values. An ENUM type is the
ideal fit:
CREATE TYPE status AS ENUM ('active', 'pending', 'closed');
ALTER TABLE test_tbl ADD COLUMN status status;
CREATE INDEX t_idx2 ON test_tbl(status, col2);
With three enum values the skip scan does three tree descents instead of a full
sequential scan. A high-cardinality integer col1 would make the same approach
impractical (thousands of descents).
On older Postgres versions the equivalent workaround is an explicit IN list:
WHERE col1 IN (SELECT DISTINCT col1 FROM test_tbl) AND col2 = 'foo'
Practical Implications
- Column order is structural — put the most selective or most-filtered column first.
- Prefix lookups (
col1 = ?) are ideal;col2-only queries require a skip scan (PG 17+) or a separate index. - Low-cardinality leading columns (enums, status codes) make skip scans practical.
- Large text values don’t bloat the index due to TOAST pointers.
- Locality on leaf pages drives range scan speed — similar keys land on adjacent pages.
Closing
Composite indexes in Postgres are literally concatenated keys stored in B-tree leaf pages. Once you understand that physical structure, most planner decisions become predictable rather than surprising.