Designing Composite Indexes in PostgreSQL: A Data Engineer’s Playbook

Deep dive on the leftmost rule, INCLUDE columns, and reading query plans like a pro.

Introduction — why this matters

You shipped a feature, traffic doubled, and suddenly a simple dashboard query is crawling. You tried CREATE INDEX, but PostgreSQL still chooses a Seq Scan… or worse, it uses your index but doesn’t actually get faster. The culprit is almost always the same: composite indexes designed without the leftmost rule in mind, missing INCLUDE columns, and a blind spot when reading plans.

This playbook shows you exactly how to design the right composite B-tree indexes, when to use INCLUDE, and how to read EXPLAIN so you can ship performant queries confidently.

Concepts & Architecture (clear and concise)

The leftmost (prefix) rule

Postgres uses a composite B-tree index from left to right. Practically:

The index can serve predicates and ordering that start at the leftmost key and continue in order.
Equality on earlier keys unlocks use of later keys.
If you skip the first key in your predicates/order, the index becomes far less useful or unusable.

Example: index (customer_id, created_at, order_id)

Good: WHERE customer_id = ? AND created_at >= ? ORDER BY created_at DESC
Good: WHERE customer_id = ? ORDER BY created_at
Weak/Bad: WHERE created_at >= ? (skips customer_id)
Good: WHERE customer_id = ? AND order_id = ? (equality on left keys lets planner jump deep)

Rule of thumb: Put the most selective, most consistently used equality predicates first, then range/sort keys.

`INCLUDE` columns (covering indexes)

INCLUDE columns do not participate in search or sort. They are payload columns stored in the index leaf so the executor can perform an Index-Only Scan (no table/heap fetch) when the visibility map cooperates.

Use INCLUDE(col_a, col_b) to cover SELECT lists or avoid heap lookups.
Don’t use INCLUDE expecting better filtering—it won’t filter.
Keep them lean; they still occupy I/O and memory.

Direction & NULLS in B-tree

B-tree scans can go forward or backward, so ASC/DESC in the index rarely matters for speed. What does matter is NULLS ordering (e.g., NULLS LAST), which can save a sort.

Bitmap vs Index vs Index-Only

Index Scan: walks the index and fetches matching table rows.
Bitmap Index Scan: collects TIDs from the index into a bitmap, then batch reads the table; great for many scattered rows.
Index-Only Scan: serves the query entirely from the index (no heap fetch) when the visibility map says rows are all-visible.

A concrete schema & query patterns

We’ll use a realistic table:

CREATE TABLE orders (
  order_id       BIGSERIAL PRIMARY KEY,
  customer_id    BIGINT NOT NULL,
  status         TEXT NOT NULL,        -- 'new','paid','shipped','cancelled'
  created_at     TIMESTAMPTZ NOT NULL,
  total_amount   NUMERIC(12,2) NOT NULL,
  city           TEXT,
  seller_id      BIGINT
);

Top queries (access patterns)

Get a customer’s recent orders (paginated):

SELECT order_id, created_at, status, total_amount
FROM orders
WHERE customer_id = $1
ORDER BY created_at DESC
LIMIT 50 OFFSET 0;

Recent “paid” orders for a seller, sorted by time:

SELECT order_id, created_at, total_amount
FROM orders
WHERE seller_id = $1 AND status = 'paid' AND created_at >= now() - interval '7 days'
ORDER BY created_at DESC
LIMIT 100;

Compute spend by customer in a date range:

SELECT sum(total_amount)
FROM orders
WHERE customer_id = $1
  AND created_at >= $2 AND created_at < $3;

Designing the right composite indexes

Pattern 1 — equality + time sort

Query 1 wants equality on customer_id and sort/range on created_at.

-- Key columns for search/sort; cover SELECT with INCLUDE
CREATE INDEX idx_orders_cust_created
  ON orders (customer_id, created_at DESC)
  INCLUDE (status, total_amount);

Why:

Leftmost: customer_id (equality) comes first.
Next: created_at supports both ORDER BY created_at DESC and time slicing.
INCLUDE provides status and total_amount for potential Index-Only scans.

Pattern 2 — multi-equality + time range + sort

Query 2 filters by seller_id, status (both equality), then created_at (range/sort).

CREATE INDEX idx_orders_seller_status_created
  ON orders (seller_id, status, created_at DESC)
  INCLUDE (total_amount);

Why:

Two equalities first enable deep seek and clustering of matches.
Time last for range/sort.
INCLUDE covers the select list.

Pattern 3 — range slice for analytics

Query 3 does equality on customer_id and a time window, no ORDER BY.

-- Query 1's index already serves Query 3 efficiently.
-- If this is mission-critical and huge, consider:
CREATE INDEX idx_orders_cust_created_range
  ON orders (customer_id, created_at);

Note: Avoid index bloat. If Query 1’s index is present, it likely covers Query 3 too.

Reading the plan (what to look for)

Run with EXPLAIN (ANALYZE, BUFFERS):

Healthy plan for Query 1

Index Only Scan using idx_orders_cust_created on orders
  Index Cond: (customer_id = $1)
  Filter: (true) -- none
  Heap Fetches: 0

What this tells you

“Index Only Scan”: great—no heap I/O.
“Index Cond” shows predicates satisfied by the key columns.
Heap Fetches: 0 means visibility map was favorable (all-visible pages). If high, you’re not “index-only” in practice.

Red flags

Filter: holds key predicates → your index order is wrong (leftmost rule violated) or you’re missing an expression/functional index.
Bitmap Heap Scan with millions of rows → consider a different leading column or a partial index for hot subsets.
Unexpected Sort node → your index didn’t match the sort keys in order; check leftmost rule and NULLS ordering.

Best practices

1) Start with access patterns, not tables

List your top N queries with predicates and ordering.
For each, design the minimal composite index that satisfies left→right: equality first, then range/sort.

2) Use `INCLUDE` to cover, not to filter

Add only columns needed to avoid heap fetches (SELECT list, GROUP BY, maybe a tiny JSONB field).
Keep them small; big text fields or wide JSON can negate the benefit.

3) Beware of skipping the leftmost key

If you often query by created_at alone, you need another index (e.g., (created_at)), not a hope that (customer_id, created_at) will be used.

4) Align ORDER BY with index order

Same column sequence.
B-tree supports both scan directions; prioritize matching the column sequence, and set NULLS LAST/FIRST if it removes a Sort.

5) Consider partial and expression indexes

Partial: CREATE INDEX idx_orders_paid_recent ON orders (seller_id, created_at DESC) WHERE status = 'paid' AND created_at >= now() - interval '30 days'; Smaller, hotter, faster—if the predicate matches your queries.
Expression: CREATE INDEX idx_orders_created_date ON orders ((date(created_at))); Use when queries filter on date(created_at) (otherwise you’d force a Filter).

6) Validate with `EXPLAIN` + `pg_stat_statements`

Confirm “Index Cond” reflects your leftmost keys.
Watch rows, loops, Heap Fetches, and BUFFERS.
Use pg_stat_statements to identify top time/rows queries and regressions after index changes.

7) Use extended statistics for multi-column correlation

When selectivity depends on column combinations, help the planner:

CREATE STATISTICS stat_seller_status (dependencies) ON seller_id, status FROM orders;
ANALYZE orders;

This reduces row misestimation and improves index choice.

8) Keep bloat in check

Periodically review redundant/unused indexes (pg_stat_user_indexes).
Measure write amplification: every extra index slows INSERT/UPDATE/DELETE.
Favor 1 good composite over 3 overlapping single-column indexes.

Common pitfalls (and fixes)

Pitfall: Index (created_at, customer_id) but queries are WHERE customer_id = ? ORDER BY created_at.
Fix: (customer_id, created_at)—put equality first.
Pitfall: Add INCLUDE(total_amount) expecting WHERE total_amount > 100 to use it.
Fix: Only key columns filter. Make total_amount a key if you filter on it.
Pitfall: Function on column blocks index.
Fix: Either rewrite predicate (created_at >= $1) or create expression index that matches the function used.
Pitfall: Huge Bitmap Heap Scan with rechecks.
Fix: Consider a different leading column, partial index, or add equality filters before range.
Pitfall: Relying on index-only scans but seeing Heap Fetches > 0.
Fix: Tables are being updated frequently; visibility map not fully set. Index-only scans still help, just less. Consider covering fewer columns to keep the index small.

Quick design table (cheat sheet)

Query pattern	Good composite index	Optional INCLUDE	Notes
`cust_id = ?` + `ORDER BY created_at`	`(customer_id, created_at)`	`(status, total_amount)`	Classic feed/timeline
`seller_id = ? AND status = 'paid' AND created_at >= … ORDER BY created_at`	`(seller_id, status, created_at)`	`(total_amount)`	Equality before range
Range by date only	`(created_at)`	—	If truly needed often
Aggregate by customer over a window	`(customer_id, created_at)`	—	Serves filters + grouping paths
Hot subset (recent, paid)	Partial index on same order	INCLUDE as needed	Smaller + faster for hot paths
Filter on function `date(created_at)=…`	`((date(created_at)))`	—	Expression index

Real example: end-to-end

-- 1) Indexes
CREATE INDEX idx_orders_cust_created
  ON orders (customer_id, created_at DESC)
  INCLUDE (status, total_amount);

CREATE INDEX idx_orders_seller_status_created
  ON orders (seller_id, status, created_at DESC)
  INCLUDE (total_amount);

-- 2) Extended stats to help planner
CREATE STATISTICS stat_seller_status (dependencies)
ON seller_id, status FROM orders;
ANALYZE orders;

-- 3) Query we care about
EXPLAIN (ANALYZE, BUFFERS)
SELECT order_id, created_at, status, total_amount
FROM orders
WHERE customer_id = 4242
ORDER BY created_at DESC
LIMIT 50;

What you want to see:

Index Only Scan using idx_orders_cust_created
Index Cond: (customer_id = 4242)
No explicit Sort node (the index delivers order)
Low BUFFERS and negligible Heap Fetches

Conclusion + Takeaways

Composite indexes are powerful—but only if you respect the leftmost rule, use INCLUDE for coverage not filtering, and verify with EXPLAIN. Start from real access patterns, put equality first, then range/sort, and be ruthless about avoiding redundant or misordered indexes. That’s how you keep latency low without burning write throughput.

Takeaways

Design from queries, not columns.
Leftmost: equality → range/sort → INCLUDE to cover.
Read plans: focus on Index Cond, Sort, Heap Fetches.
Use partial and expression indexes when they match your predicates.
Maintain hygiene: measure, prune, and re-analyze.

Internal link ideas

“Partial Indexes: When a Smaller Index Beats a Bigger One”
“Expression Indexes and Functional Predicates in PostgreSQL”
“Understanding Bitmap vs Index Scans in the Planner”
“PostgreSQL Visibility Map: The Hidden Lever Behind Index-Only Scans”
“Choosing Between B-tree, GIN, GiST, and BRIN”

Image prompt

“A clean, modern PostgreSQL indexing diagram showing a composite B-tree (seller_id, status, created_at) with arrows demonstrating the leftmost rule, and a separate callout for INCLUDE(total_amount) enabling index-only scans — minimalistic, high contrast, 3D isometric style.”

Data/ML Engineer Blog