How Postgres Stores Composite Indexes on Disk

Composite indexes can seem like an easy way to “put two columns together and get great performance benefits” - as Postgres can just figure it out. However, the physical representation is not intuitive unless you’ve looked inside a B-tree page. Once you understand how these indexes are stored, you can understand why certain queries leveraging a composite index perform better whilst others don’t quite get the same lift-off effect.

This post walks through exactly how a composite index like the one below will end up getting stored on disk.

CREATE TABLE test_tbl(
  id   bigserial primary key,
  col1 int,
  col2 text
);

CREATE INDEX t_idx1 ON test_tbl(col1, col2);

Composite Indexes Are One Key, Not Two

t_idx1 is a single B-tree index whose key is the pair (col1, col2).

Postgres stores that pair as one logical key and orders it lexicographically:

Sort by col1
For rows with the same col1, sort by col2

Everything about how queries use this index flows from this rule.

On-Disk Storage: A Separate File, B-Tree Structured

t_idx1 is its own table-like structure on disk:

It lives in pgdata/base/<db_oid>/<relfilenode>
It uses the B-tree access method
It is split into 8kB pages
Pages are arranged as:
- Meta page (block 0)
- Internal pages (routing nodes)
- Leaf pages (actual key + TID entries)

This file has no connection to the heap except via the TIDs stored inside each tuple.

graph TD
    subgraph file["Index File (relfilenode)"]
        M["Block 0 · Meta Page\nroot=1, level=2, fast-root=1"]
        I1["Block 1 · Internal Page\nsep: (col1=2)"]
        L1["Block 2 · Leaf Page\n(1,'aaa', TID)\n(1,'bbb', TID)"]
        L2["Block 3 · Leaf Page\n(2,'abc', TID)\n(2,'xyz', TID)"]
    end

    M --> I1
    I1 -->|"left child\ncol1 < 2"| L1
    I1 -->|"right child\ncol1 ≥ 2"| L2
    L1 <-->|"prev / next ptrs"| L2

For large indexes Postgres splits the file into 1 GB segments (.1, .2, …) but the page structure and B-tree layout remain identical across segments.

Each entry in a leaf page stores:

col1 - 4-byte integer
col2 - text datum
TID - 6-byte heap pointer (block number + offset)

This structure is wrapped in an IndexTupleData header that contains flags, tuple size, and a null bitmap if needed.

Text values follow varlena rules:

Short text: stored inline ([length][bytes])
Large text: replaced by an 18-byte TOAST pointer rather than embedding huge strings in the index

A leaf entry is physically:

[IndexTupleHeader]
[col1 value]
[col2 varlena or toast pointer]
[TID]

How Composite Keys Are Ordered

The B-tree compares keys in order:

(col1=1, col2='aaa')
(col1=1, col2='bbb')
(col1=2, col2='abc')

If a query filters only on col2, the index cannot be traversed meaningfully because the tree is structured around col1. Column order in the index will define physical locality.

A Concrete Example: 12 Rows Across Three Levels

Consider inserting these 12 rows (insertion order is irrelevant — the index always maintains sorted order):

INSERT INTO test_tbl (col1, col2) VALUES
  (3, 'ivy'), (1, 'bee'), (4, 'koi'),
  (2, 'fox'), (1, 'ant'), (3, 'gnu'),
  (4, 'jay'), (2, 'dog'), (1, 'cat'),
  (3, 'hen'), (4, 'lynx'),(2, 'elk');

The index sorts them by (col1, col2). TIDs reflect insertion order, not sort order:

col1	col2	TID
1	‘ant’	(0,5)
1	‘bee’	(0,2)
1	‘cat’	(0,9)
2	‘dog’	(0,8)
2	‘elk’	(0,12)
2	‘fox’	(0,4)
3	‘gnu’	(0,6)
3	‘hen’	(0,10)
3	‘ivy’	(0,1)
4	‘jay’	(0,7)
4	‘koi’	(0,3)
4	‘lynx’	(0,11)

With 3 entries per leaf page the index grows to three levels: a root internal page, two level-2 internal pages, and four leaf pages.

graph TD
    META["Meta Page · block 0\nroot → block 1"]
    ROOT["Root Internal · block 1\nsep = (3)"]
    LI["Left Internal · block 2\nsep = (2)"]
    RI["Right Internal · block 3\nsep = (4)"]
    L4["Leaf · block 4\n(1,'ant') TID(0,5)\n(1,'bee') TID(0,2)\n(1,'cat') TID(0,9)"]
    L5["Leaf · block 5\n(2,'dog') TID(0,8)\n(2,'elk') TID(0,12)\n(2,'fox') TID(0,4)"]
    L6["Leaf · block 6\n(3,'gnu') TID(0,6)\n(3,'hen') TID(0,10)\n(3,'ivy') TID(0,1)"]
    L7["Leaf · block 7\n(4,'jay') TID(0,7)\n(4,'koi') TID(0,3)\n(4,'lynx') TID(0,11)"]

    META --> ROOT
    ROOT -->|"key < (3)"| LI
    ROOT -->|"key ≥ (3)"| RI
    LI -->|"key < (2)"| L4
    LI -->|"key ≥ (2)"| L5
    RI -->|"key < (4)"| L6
    RI -->|"key ≥ (4)"| L7
    L4 <-.->|"prev/next"| L5
    L5 <-.->|"prev/next"| L6
    L6 <-.->|"prev/next"| L7

Every split in this tree falls between rows with different col1 values, so suffix truncation reduces all three separator keys to a single integer:

Root sep = (3) — split between (2,'fox') and (3,'gnu') → col1 differs → stored as (3)
Left internal sep = (2) — split between (1,'cat') and (2,'dog') → stored as (2)
Right internal sep = (4) — split between (3,'ivy') and (4,'jay') → stored as (4)

Query Traversal

Select a query below to step through how the B-tree is traversed:

Choose a query to animate the traversal:

Select a query above to step through how the B-tree is traversed.

Meta Page · block 0

root → block 1

Root Internal · block 1

sep = (3)

Left Internal · block 2

sep = (2)

Right Internal · block 3

sep = (4)

Leaf · block 4

(1,'ant') (0,5)

(1,'bee') (0,2)

(1,'cat') (0,9)

Leaf · block 5

(2,'dog') (0,8)

(2,'elk') (0,12)

(2,'fox') (0,4)

Leaf · block 6

(3,'gnu') (0,6)

(3,'hen') (0,10)

(3,'ivy') (0,1)

Leaf · block 7

(4,'jay') (0,7)

(4,'koi') (0,3)

(4,'lynx') (0,11)

Internal Pages Store Limited Information

Internal nodes route lookups downward but do not need the full key. They store separator keys. Since Postgres 13, the B-tree code applies suffix truncation when writing a separator: it strips trailing attributes that are not needed to distinguish the two subtrees being separated.

If col1 alone is different at the split boundary → the separator is (col1).
If the split falls between two rows that share the same col1 → the separator must include col2 to disambiguate, so it becomes (col1, col2).

In practice, most separators end up as just col1, keeping internal pages small and the tree shallower. But the full key can appear whenever a page boundary falls inside a run of identical col1 values.

Leaf pages store all actual index entries; structure:

[PageHeader]
[line pointers]
...
[free space]
...
[packed index tuples at the end]

Leaf pages also form a doubly-linked list, allowing efficient range scans.

Because items are sorted by (col1, col2), values with similar keys land on adjacent leaf pages.

Skip Scans: Using the Index Without the Leading Column

A query filtering only on col2 can’t traverse the B-tree meaningfully because the tree is ordered by col1 first. Before Postgres 17, the planner would ignore t_idx1 entirely for WHERE col2 = 'foo'.

Postgres 17 introduced native skip scans. The planner enumerates each distinct value of col1, then does a separate descend into the tree for each one, effectively running:

col1 = <v1> AND col2 = 'foo'
col1 = <v2> AND col2 = 'foo'
...

This is only efficient when col1 has low cardinality — the number of descents equals the number of distinct col1 values. An ENUM type is the ideal fit:

CREATE TYPE status AS ENUM ('active', 'pending', 'closed');
ALTER TABLE test_tbl ADD COLUMN status status;
CREATE INDEX t_idx2 ON test_tbl(status, col2);

With three enum values the skip scan does three tree descents instead of a full sequential scan. A high-cardinality integer col1 would make the same approach impractical (thousands of descents).

On older Postgres versions the equivalent workaround is an explicit IN list:

WHERE col1 IN (SELECT DISTINCT col1 FROM test_tbl) AND col2 = 'foo'

Practical Implications

Column order is structural — put the most selective or most-filtered column first.
Prefix lookups (col1 = ?) are ideal; col2-only queries require a skip scan (PG 17+) or a separate index.
Low-cardinality leading columns (enums, status codes) make skip scans practical.
Large text values don’t bloat the index due to TOAST pointers.
Locality on leaf pages drives range scan speed — similar keys land on adjacent pages.

Closing

Composite indexes in Postgres are literally concatenated keys stored in B-tree leaf pages. Once you understand that physical structure, most planner decisions become predictable rather than surprising.