JSONB Without Regret: When to Model vs Index Paths
Decision tree + costed examples for PostgreSQL engineers
Introduction — the “just JSON it” trap
You ship a feature fast by tossing semi-structured payloads into jsonb. Two quarters later, product wants sorting, filtering, and analytics on nested fields. Queries slow down, indexes balloon, and migrations feel radioactive. Sound familiar?
This guide gives you a ruthless decision tree and costed, reproducible examples so you can choose—confidently—between modeling columns vs indexing JSONB paths.
TL;DR Decision Tree
Q1. Is the field queried in every hot path (filters/joins/order-by) and stable?
→ Yes: Model as typed columns. Add B-Tree indexes. Optionally keep original JSONB for audit/raw.
→ No: go to Q2.
Q2. Is the access pattern selective (<10–20% rows) and on a few known paths?
→ Yes: Keep JSONB. Create GIN index on those paths (jsonb_path_ops or partial GIN).
→ No: go to Q3.
Q3. Do you need range sorts, prefix searches, or frequent ORDER BY on the path?
→ Yes: Model columns (B-Tree/Brin) or use functional indexes on extracted values.
→ No: go to Q4.
Q4. Will the schema evolve unpredictably across tenants/events?
→ Yes: JSONB-first. Add narrow partial GIN per tenant/type. Consider hybrid (promote only hot fields).
→ No: Model columns.
Q5. Is write amplification or index maintenance a bottleneck?
→ Yes: Prefer modeling (fewer/lighter indexes) and computed/generated columns.
→ No: JSONB with targeted GIN is fine.
Concepts You Must Weigh
- Query selectivity: GIN shines when queries match a small fraction of rows. Low selectivity (e.g., 60% of table) wastes index lookups.
- Cardinality & sorting: B-Tree excels at range scans and
ORDER BY. GIN does not. - Write cost: GIN update cost grows with document size and path churn.
- Index size & memory: GIN on wide JSONB can dwarf table size, pressuring cache and vacuum.
- Evolution pressure: If fields change weekly, strict columns become a migration tax.
- Observability: JSONB hides type drift; modeled columns force validation upfront.
Reference Architecture Patterns
1) Modeled Columns + Raw JSONB (Hybrid)
- Table has typed columns for hot fields.
payload jsonbretains full event for audit.- Generated columns keep model in sync.
Why: Fast reads, small indexes, flexible raw storage.
2) JSONB-First with Targeted GIN
- Everything lands in
payload jsonb. - Create partial GIN indexes on few high-signal paths.
- Use
jsonb_path_opsfor containment, orjsonb_opsfor broader operators (bigger index).
Why: Fast iterate, pay-as-you-go indexing.
3) Extract-on-Write Materialization
- Ingest into staging table with JSONB.
- A trigger or job extracts hot fields into a fact table.
- Analytics hit the fact table; product reads JSONB.
Why: Keeps OLTP and analytics healthy without over-indexing JSONB.
Costed, Reproducible Examples
All costs below are illustrative and meant to show orders of magnitude and trade-offs. Reproduce in your environment with
EXPLAIN (ANALYZE, BUFFERS).
Setup
CREATE TABLE orders (
id bigserial primary key,
created_at timestamptz not null default now(),
customer_id bigint,
-- hybrid: keep raw
payload jsonb not null
);
-- Sample JSONB:
-- {
-- "status": "shipped",
-- "total": 129.99,
-- "shipping": {"city":"Austin","state":"TX","zip":"78701"},
-- "items":[{"sku":"A1","qty":2},{"sku":"B2","qty":1}]
-- }
Workload
- 10M orders.
- Hot queries:
- Filter by
status, time window. - Top cities by shipped orders (group-by city).
- Look up orders containing SKU “B2”.
- Sort by
total(range scans).
- Filter by
Option A — JSONB only, targeted indexes
Indexes
-- 1) status filter: narrow partial GIN on status 'shipped'
CREATE INDEX idx_orders_payload_status_shipped
ON orders USING gin ((payload->>'status'))
WHERE (payload->>'status') = 'shipped';
-- 2) city filter: functional B-Tree on extracted city (tiny index)
CREATE INDEX idx_orders_city
ON orders ((payload->'shipping'->>'city'));
-- 3) items contains SKU: full GIN on items array (path ops)
CREATE INDEX idx_orders_items_gin
ON orders USING gin ((payload->'items') jsonb_path_ops);
Representative costs
| Query | Example SQL | Plan Shape | Latency (cold→warm) | Notes |
|---|---|---|---|---|
| Q1 status + time | WHERE payload->>'status'='shipped' AND created_at>=now()-'7d'::interval | Bitmap Index Scan on partial + heap recheck | 120–900ms → 5–20ms | Partial + time predicate is selective; good. |
| Q2 top cities | SELECT payload->'shipping'->>'city', count(*) ... GROUP BY 1 | Index Only Scan on idx_orders_city + HashAgg | 300–1200ms → 30–60ms | Functional B-Tree wins for grouping. |
| Q3 contains SKU | WHERE payload->'items' @> '[{"sku":"B2"}]' | Bitmap Index Scan on GIN (path_ops) | 400–1500ms → 20–50ms | GIN pays off when selective. |
| Q4 sort by total | ORDER BY (payload->>'total')::numeric DESC LIMIT 50 | Seq Scan + Sort | 2–8s → 400–800ms | Pain: GIN can’t help ORDER BY; consider modeling. |
Storage impact (rules of thumb)
idx_orders_items_gin≈ 0.3–0.8× table size depending on array width.- Functional B-Tree on city ≈ few hundred MB for 10M rows.
- Partial index tiny (only ‘shipped’ rows).
When this is great
- You mostly filter on few fields, sometimes query nested arrays, and rarely do large sorts on numeric paths.
Option B — Model hot fields + keep raw
ALTER TABLE orders
ADD COLUMN status text,
ADD COLUMN city text,
ADD COLUMN total numeric;
-- Keep in sync (generated or trigger; generated shown)
ALTER TABLE orders
ADD COLUMN payload_status text GENERATED ALWAYS AS (payload->>'status') STORED,
ADD COLUMN payload_city text GENERATED ALWAYS AS (payload->'shipping'->>'city') STORED,
ADD COLUMN payload_total numeric GENERATED ALWAYS AS ((payload->>'total')::numeric) STORED;
-- Make modeled aliases the ones you index/use:
CREATE INDEX ON orders (status);
CREATE INDEX ON orders (city);
CREATE INDEX ON orders (total DESC);
CREATE INDEX ON orders (created_at);
Representative costs
| Query | Example SQL | Plan Shape | Latency (cold→warm) | Notes |
|---|---|---|---|---|
| Q1 status + time | WHERE status='shipped' AND created_at>=... | Bitmap Index Scan on B-Tree(s) | 20–150ms → 2–8ms | Smaller/faster than GIN. |
| Q2 top cities | GROUP BY city | Index Only Scan + HashAgg | 60–300ms → 8–20ms | B-Tree compression helps cache. |
| Q3 contains SKU | payload->'items' @> ... | Seq Scan or add narrow GIN on items | 1–4s → 200–600ms | Either add one focused GIN, or accept slower path. |
| Q4 sort by total | ORDER BY total DESC LIMIT 50 | Index Only Scan on (total) | 5–40ms → 1–5ms | B-Tree dominates sorting/range queries. |
Storage impact (rules of thumb)
- Three B-Tree indexes: small compared to one wide GIN.
- JSONB column still present for audit/debug.
When this is great
- You routinely filter/sort on a few hot fields and need predictable latency.
How to Measure in Your DB (copy/paste)
-- 1) Baseline
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders
WHERE (payload->>'status') = 'shipped'
AND created_at >= now() - interval '7 days';
-- 2) After index
CREATE INDEX ...; -- as chosen above
VACUUM ANALYZE orders;
EXPLAIN (ANALYZE, BUFFERS) ...same query...
-- 3) Size checks
SELECT relname, pg_size_pretty(pg_relation_size(relid))
FROM pg_catalog.pg_statio_user_indexes
WHERE relname LIKE 'idx_orders%';
-- 4) Hot vs cold cache (repeat after pg_prewarm or after restart)
Interpretation tips
- Favor plans that use Index Only Scans with few heap rechecks.
- Check
Buffers:to see if reads are cached. - Beware
Rows Removed by Filter→ low selectivity. - If sorts spill (
Disk: ...), model columns or add better indexes.
Best Practices (No Regrets)
- Index only what you read
Create partial GIN for the exact predicate (e.g., status=’shipped’). Don’t GIN the world. - Prefer functional B-Tree for equality/group-by
ON ((payload->'shipping'->>'city'))is tiny and fast. - Model for ORDER BY / ranges
Sorting JSONB-extracted numerics is expensive. Use columns + B-Tree. - Use generated columns for safety
They lock types at write time and reduce per-query casts. - Cap index bloat
REINDEX CONCURRENTLY,VACUUM (ANALYZE), and monitorpg_stat_all_indexes.idx_scan. - Protect writes
Each GIN write touches many pages. If ingest throughput matters, keep GIN count low and paths narrow. - Promote fields deliberately
Start JSONB-first; promote fields to columns once their access pattern stabilizes.
Common Pitfalls
- Over-broad GIN on whole payload → huge index, poor cache locality.
- Casting inline every query → defeats index usage (
::numericon the fly). - Low-selectivity filters on GIN → index is touched but doesn’t reduce work.
- FORCE of schema too early → frequent migrations for fields that churn.
- Ignoring
ORDER BYneeds → users complain about “slow lists”.
Comparison Table — Choose Quickly
| Need | JSONB + GIN | Functional B-Tree on JSONB | Modeled Column |
|---|---|---|---|
| Equality filter on few values | ✅ | ✅ | ✅ |
| Range filter / ORDER BY | ❌ | ⚠️ (if cast to numeric, still meh) | ✅ |
| Frequent schema evolution | ✅ | ✅ | ⚠️ (migrations) |
| Index size | ❌ (large) | ✅ (small) | ✅ (small) |
| Write throughput | ❌ | ✅ | ✅ |
| Analytics-friendly types | ⚠️ | ⚠️ | ✅ |
Real Example: Promote What Hurts
Symptoms
ORDER BY (payload->>'total')::numeric DESC LIMIT 50takes 600ms warm.WHERE payload->>'status'='shipped'is fine with a partial index.- Writes slowed after full-payload GIN was added.
Fix
- Drop the full GIN.
- Add
total numeric GENERATED ALWAYS AS ...and index it. - Keep narrow partial GIN for item containment query only.
Outcome (typical)
- The top-50 list drops to ~5ms warm.
- Ingest TPS rises 10–30% (less GIN write churn).
- Index size shrinks materially.
Conclusion & Takeaways
- JSONB is not a license to skip modeling.
- Use JSONB to absorb change; use columns to accelerate hot paths.
- GIN is for selective lookups on few paths, not sorting and not “index everything.”
- Generated columns give you the best of both worlds: flexible raw + typed speed.
- Measure with
EXPLAIN (ANALYZE, BUFFERS)before/after each index—let the plan, not vibes, decide.
Internal Link Ideas (for your site)
- “GIN vs B-Tree for JSONB: operator classes and when to use
jsonb_path_ops” - “PostgreSQL generated columns for schema hygiene”
- “From event streams to facts: materializing hot fields in OLTP”
- “Cost-based tuning: reading
BUFFERSoutput in EXPLAIN ANALYZE” - “Partial indexes playbook: keeping indexes tiny and useful”
Image Prompt (for DALL·E / Midjourney)
“A clean, modern data architecture decision tree showing a PostgreSQL table with a JSONB column branching into two paths: (1) modeled typed columns with small B-Tree indexes, (2) JSONB with narrow GIN indexes on specific paths; include icons for query types (filter, sort, group-by), annotate where each index shines; minimalistic, high-contrast, 3D isometric style.”
Tags
#PostgreSQL #JSONB #GIN #DatabaseDesign #DataEngineering #Indexing #Performance #SQL #Scalability #Architecture








Leave a Reply