Designing Composite Indexes in PostgreSQL: A Data Engineer’s Playbook
Deep dive on the leftmost rule, INCLUDE columns, and reading query plans like a pro.
Introduction — why this matters
You shipped a feature, traffic doubled, and suddenly a simple dashboard query is crawling. You tried CREATE INDEX, but PostgreSQL still chooses a Seq Scan… or worse, it uses your index but doesn’t actually get faster. The culprit is almost always the same: composite indexes designed without the leftmost rule in mind, missing INCLUDE columns, and a blind spot when reading plans.
This playbook shows you exactly how to design the right composite B-tree indexes, when to use INCLUDE, and how to read EXPLAIN so you can ship performant queries confidently.
Concepts & Architecture (clear and concise)
The leftmost (prefix) rule
Postgres uses a composite B-tree index from left to right. Practically:
- The index can serve predicates and ordering that start at the leftmost key and continue in order.
- Equality on earlier keys unlocks use of later keys.
- If you skip the first key in your predicates/order, the index becomes far less useful or unusable.
Example: index (customer_id, created_at, order_id)
- Good:
WHERE customer_id = ? AND created_at >= ? ORDER BY created_at DESC - Good:
WHERE customer_id = ? ORDER BY created_at - Weak/Bad:
WHERE created_at >= ?(skipscustomer_id) - Good:
WHERE customer_id = ? AND order_id = ?(equality on left keys lets planner jump deep)
Rule of thumb: Put the most selective, most consistently used equality predicates first, then range/sort keys.
INCLUDE columns (covering indexes)
INCLUDE columns do not participate in search or sort. They are payload columns stored in the index leaf so the executor can perform an Index-Only Scan (no table/heap fetch) when the visibility map cooperates.
- Use
INCLUDE(col_a, col_b)to cover SELECT lists or avoid heap lookups. - Don’t use
INCLUDEexpecting better filtering—it won’t filter. - Keep them lean; they still occupy I/O and memory.
Direction & NULLS in B-tree
B-tree scans can go forward or backward, so ASC/DESC in the index rarely matters for speed. What does matter is NULLS ordering (e.g., NULLS LAST), which can save a sort.
Bitmap vs Index vs Index-Only
- Index Scan: walks the index and fetches matching table rows.
- Bitmap Index Scan: collects TIDs from the index into a bitmap, then batch reads the table; great for many scattered rows.
- Index-Only Scan: serves the query entirely from the index (no heap fetch) when the visibility map says rows are all-visible.
A concrete schema & query patterns
We’ll use a realistic table:
CREATE TABLE orders (
order_id BIGSERIAL PRIMARY KEY,
customer_id BIGINT NOT NULL,
status TEXT NOT NULL, -- 'new','paid','shipped','cancelled'
created_at TIMESTAMPTZ NOT NULL,
total_amount NUMERIC(12,2) NOT NULL,
city TEXT,
seller_id BIGINT
);
Top queries (access patterns)
- Get a customer’s recent orders (paginated):
SELECT order_id, created_at, status, total_amount
FROM orders
WHERE customer_id = $1
ORDER BY created_at DESC
LIMIT 50 OFFSET 0;
- Recent “paid” orders for a seller, sorted by time:
SELECT order_id, created_at, total_amount
FROM orders
WHERE seller_id = $1 AND status = 'paid' AND created_at >= now() - interval '7 days'
ORDER BY created_at DESC
LIMIT 100;
- Compute spend by customer in a date range:
SELECT sum(total_amount)
FROM orders
WHERE customer_id = $1
AND created_at >= $2 AND created_at < $3;
Designing the right composite indexes
Pattern 1 — equality + time sort
Query 1 wants equality on customer_id and sort/range on created_at.
-- Key columns for search/sort; cover SELECT with INCLUDE
CREATE INDEX idx_orders_cust_created
ON orders (customer_id, created_at DESC)
INCLUDE (status, total_amount);
Why:
- Leftmost:
customer_id(equality) comes first. - Next:
created_atsupports bothORDER BY created_at DESCand time slicing. INCLUDEprovidesstatusandtotal_amountfor potential Index-Only scans.
Pattern 2 — multi-equality + time range + sort
Query 2 filters by seller_id, status (both equality), then created_at (range/sort).
CREATE INDEX idx_orders_seller_status_created
ON orders (seller_id, status, created_at DESC)
INCLUDE (total_amount);
Why:
- Two equalities first enable deep seek and clustering of matches.
- Time last for range/sort.
INCLUDEcovers the select list.
Pattern 3 — range slice for analytics
Query 3 does equality on customer_id and a time window, no ORDER BY.
-- Query 1's index already serves Query 3 efficiently.
-- If this is mission-critical and huge, consider:
CREATE INDEX idx_orders_cust_created_range
ON orders (customer_id, created_at);
Note: Avoid index bloat. If Query 1’s index is present, it likely covers Query 3 too.
Reading the plan (what to look for)
Run with EXPLAIN (ANALYZE, BUFFERS):
Healthy plan for Query 1
Index Only Scan using idx_orders_cust_created on orders
Index Cond: (customer_id = $1)
Filter: (true) -- none
Heap Fetches: 0
What this tells you
- “Index Only Scan”: great—no heap I/O.
- “Index Cond” shows predicates satisfied by the key columns.
Heap Fetches: 0means visibility map was favorable (all-visible pages). If high, you’re not “index-only” in practice.
Red flags
Filter:holds key predicates → your index order is wrong (leftmost rule violated) or you’re missing an expression/functional index.Bitmap Heap Scanwith millions of rows → consider a different leading column or a partial index for hot subsets.- Unexpected
Sortnode → your index didn’t match the sort keys in order; check leftmost rule and NULLS ordering.
Best practices
1) Start with access patterns, not tables
- List your top N queries with predicates and ordering.
- For each, design the minimal composite index that satisfies left→right: equality first, then range/sort.
2) Use INCLUDE to cover, not to filter
- Add only columns needed to avoid heap fetches (SELECT list, GROUP BY, maybe a tiny JSONB field).
- Keep them small; big text fields or wide JSON can negate the benefit.
3) Beware of skipping the leftmost key
- If you often query by
created_atalone, you need another index (e.g.,(created_at)), not a hope that(customer_id, created_at)will be used.
4) Align ORDER BY with index order
- Same column sequence.
- B-tree supports both scan directions; prioritize matching the column sequence, and set
NULLS LAST/FIRSTif it removes a Sort.
5) Consider partial and expression indexes
- Partial:
CREATE INDEX idx_orders_paid_recent ON orders (seller_id, created_at DESC) WHERE status = 'paid' AND created_at >= now() - interval '30 days';Smaller, hotter, faster—if the predicate matches your queries. - Expression:
CREATE INDEX idx_orders_created_date ON orders ((date(created_at)));Use when queries filter ondate(created_at)(otherwise you’d force a Filter).
6) Validate with EXPLAIN + pg_stat_statements
- Confirm “Index Cond” reflects your leftmost keys.
- Watch
rows,loops,Heap Fetches, andBUFFERS. - Use
pg_stat_statementsto identify top time/rows queries and regressions after index changes.
7) Use extended statistics for multi-column correlation
When selectivity depends on column combinations, help the planner:
CREATE STATISTICS stat_seller_status (dependencies) ON seller_id, status FROM orders;
ANALYZE orders;
This reduces row misestimation and improves index choice.
8) Keep bloat in check
- Periodically review redundant/unused indexes (
pg_stat_user_indexes). - Measure write amplification: every extra index slows INSERT/UPDATE/DELETE.
- Favor 1 good composite over 3 overlapping single-column indexes.
Common pitfalls (and fixes)
- Pitfall: Index
(created_at, customer_id)but queries areWHERE customer_id = ? ORDER BY created_at.
Fix:(customer_id, created_at)—put equality first. - Pitfall: Add
INCLUDE(total_amount)expectingWHERE total_amount > 100to use it.
Fix: Only key columns filter. Maketotal_amounta key if you filter on it. - Pitfall: Function on column blocks index.
Fix: Either rewrite predicate (created_at >= $1) or create expression index that matches the function used. - Pitfall: Huge
Bitmap Heap Scanwith rechecks.
Fix: Consider a different leading column, partial index, or add equality filters before range. - Pitfall: Relying on index-only scans but seeing
Heap Fetches > 0.
Fix: Tables are being updated frequently; visibility map not fully set. Index-only scans still help, just less. Consider covering fewer columns to keep the index small.
Quick design table (cheat sheet)
| Query pattern | Good composite index | Optional INCLUDE | Notes |
|---|---|---|---|
cust_id = ? + ORDER BY created_at | (customer_id, created_at) | (status, total_amount) | Classic feed/timeline |
seller_id = ? AND status = 'paid' AND created_at >= … ORDER BY created_at | (seller_id, status, created_at) | (total_amount) | Equality before range |
| Range by date only | (created_at) | — | If truly needed often |
| Aggregate by customer over a window | (customer_id, created_at) | — | Serves filters + grouping paths |
| Hot subset (recent, paid) | Partial index on same order | INCLUDE as needed | Smaller + faster for hot paths |
Filter on function date(created_at)=… | ((date(created_at))) | — | Expression index |
Real example: end-to-end
-- 1) Indexes
CREATE INDEX idx_orders_cust_created
ON orders (customer_id, created_at DESC)
INCLUDE (status, total_amount);
CREATE INDEX idx_orders_seller_status_created
ON orders (seller_id, status, created_at DESC)
INCLUDE (total_amount);
-- 2) Extended stats to help planner
CREATE STATISTICS stat_seller_status (dependencies)
ON seller_id, status FROM orders;
ANALYZE orders;
-- 3) Query we care about
EXPLAIN (ANALYZE, BUFFERS)
SELECT order_id, created_at, status, total_amount
FROM orders
WHERE customer_id = 4242
ORDER BY created_at DESC
LIMIT 50;
What you want to see:
Index Only Scan using idx_orders_cust_createdIndex Cond: (customer_id = 4242)- No explicit
Sortnode (the index delivers order) - Low
BUFFERSand negligibleHeap Fetches
Conclusion + Takeaways
Composite indexes are powerful—but only if you respect the leftmost rule, use INCLUDE for coverage not filtering, and verify with EXPLAIN. Start from real access patterns, put equality first, then range/sort, and be ruthless about avoiding redundant or misordered indexes. That’s how you keep latency low without burning write throughput.
Takeaways
- Design from queries, not columns.
- Leftmost: equality → range/sort → INCLUDE to cover.
- Read plans: focus on Index Cond, Sort, Heap Fetches.
- Use partial and expression indexes when they match your predicates.
- Maintain hygiene: measure, prune, and re-analyze.
Internal link ideas
- “Partial Indexes: When a Smaller Index Beats a Bigger One”
- “Expression Indexes and Functional Predicates in PostgreSQL”
- “Understanding Bitmap vs Index Scans in the Planner”
- “PostgreSQL Visibility Map: The Hidden Lever Behind Index-Only Scans”
- “Choosing Between B-tree, GIN, GiST, and BRIN”
Image prompt
“A clean, modern PostgreSQL indexing diagram showing a composite B-tree (seller_id, status, created_at) with arrows demonstrating the leftmost rule, and a separate callout for INCLUDE(total_amount) enabling index-only scans — minimalistic, high contrast, 3D isometric style.”
Tags
#PostgreSQL #Indexes #CompositeIndex #DataEngineering #QueryOptimization #INCLUDEColumns #LeftmostRule #EXPLAIN #Performance #DatabaseDesign








Leave a Reply