MongoDB – Data/ML Engineer Blog

MongoDB Data Modeling & Indexing Playbook for Mid-Level Data Engineers

Meta description: Practical MongoDB playbook for mid-level data engineers: model documents from query patterns, choose the right indexes, avoid hot shards, and keep latency predictable at scale.

Why this matters (and why your p95 spiked last Friday)

Your product shipped fast on MongoDB—until a feature launch doubled traffic and your p95 read latency jumped from 12 ms to 180 ms. Nothing “broke,” but dashboards lag, workers time out, and on-call gets noisy. This article is your pragmatic path back to calm: how to design MongoDB schemas around query patterns, pick the right indexes, and scale without creating a hot shard bonfire.

Core concepts, stripped of fluff

MongoDB’s mental model

Document store: JSON-like BSON docs inside collections. _id is the primary key.
Indexing: B-tree indexes (plus specialized types) power queries. No index → collection scan.
Replication first, sharding when needed: HA via replica sets; horizontal scale via shards.
Consistency controls: writeConcern, readConcern, and readPreference decide durability and staleness.

Work from query patterns backward

Design for how the app actually reads. Aim for:

Stable “query shapes” (same fields, same operators)
Predictable bounded result sizes
Selectivity that matches an index prefix

Embed vs. reference — the 80/20 rules

Relationship	Prefer Embed When…	Prefer Reference When…
1:1	Always read together, small doc	Independent lifecycle, large/rare field
1:N	Child count bounded (<100) and co-read	Unbounded growth or independent writes
N:M	Rare; usually reference	Frequent cross-linking; use IDs and `$lookup` sparingly

Index types you’ll actually use

Single / Compound: Most queries. Remember the left-prefix rule on compound keys.
Multikey: Indexes array fields; still follows prefix logic.
Partial: Index subset (e.g., status: "ACTIVE").
TTL: Auto-expire time-series/ephemeral docs.
Hashed: For even shard distribution on equality lookups.
Text / Wildcard: Niche; verify with explain() before committing.

Aggregation pipeline

Your Swiss army knife for projections, filters, groups, and joins ($lookup). Keep pipelines narrow (project early, match early).

Sharding in one table

Choose a key with high cardinality and good distribution.
Avoid monotonically increasing keys (e.g., timestamps) unless using zoned or compound with a random component.
Model workloads so reads hit one shard when possible.

A realistic example: events analytics at scale

Use case: You track user events for a SaaS app and need fast reads for dashboards (DAU, recent actions per user) and cheap retention for raw events.

Collections & schemas

// users
{
  _id: ObjectId("..."),
  email: "ana@example.com",
  plan: "pro",
  profile: { name: "Ana", tz: "America/New_York" },
  // bounded, co-read preferences — embed
  prefs: { theme: "dark", notifications: { email: true, sms: false } },
  createdAt: ISODate("2025-10-01T12:00:00Z")
}

// events (time-series-ish)
{
  _id: ObjectId("..."),
  userId: ObjectId("..."),
  type: "clicked",
  path: "/pricing",
  ts: ISODate("2025-11-19T14:10:23Z"),
  attrs: { plan: "pro", ua: "..." } // sparse, append-only
}

Index strategy (mongosh)

// Users: frequent lookups by email and plan
db.users.createIndex({ email: 1 }, { unique: true })
db.users.createIndex({ plan: 1, createdAt: -1 }) // left-prefix: plan

// Events: hot path is "latest events for a user"
db.events.createIndex({ userId: 1, ts: -1 }) // satisfies equality+sort
// Time-bounded queries (dashboards)
db.events.createIndex({ ts: -1 })
// Keep raw events 30 days
db.events.createIndex({ ts: 1 }, { expireAfterSeconds: 60 * 60 * 24 * 30 })

Query examples (PyMongo)

from pymongo import MongoClient, ASCENDING, DESCENDING
from datetime import datetime, timedelta, timezone

client = MongoClient("mongodb://...")
db = client.analytics

# recent activity feed per user (index: {userId:1, ts:-1})
def recent_events(user_id, limit=50):
    return list(db.events.find(
        {"userId": user_id},
        projection={"_id": 0, "type": 1, "path": 1, "ts": 1}
    ).sort("ts", DESCENDING).limit(limit))

# DAU in last 24h (index: {ts:-1} + group)
def dau():
    since = datetime.now(timezone.utc) - timedelta(hours=24)
    pipeline = [
        {"$match": {"ts": {"$gte": since}}},
        {"$group": {"_id": "$userId"}},
        {"$count": "dau"}
    ]
    return list(db.events.aggregate(pipeline))

# Sessionized events (project early to keep memory low)
def actions_by_path_since(hours=1):
    since = datetime.now(timezone.utc) - timedelta(hours=hours)
    pipeline = [
        {"$match": {"ts": {"$gte": since}}},
        {"$project": {"p": "$path"}},
        {"$group": {"_id": "$p", "n": {"$sum": 1}}},
        {"$sort": {"n": -1}},
        {"$limit": 20}
    ]
    return list(db.events.aggregate(pipeline))

Sanity-check with `explain()`

db.events.find({ userId: U, ts: { $gte: ISODate("2025-11-19") } })
         .sort({ ts: -1 })
         .limit(50)
         .explain("executionStats")
// Expect IXSCAN on {userId:1, ts:-1}, small nReturned, low totalKeysExamined.

When to shard this

Symptoms you’re outgrowing a single replica set:

Working set > RAM → page faults during peak
Single-node write throughput plateaus
Uneven load from large tenants

Shard key candidates:

{ userId: 1 } hashed for flat multi-tenant distribution with equality reads.
{ userId: 1, ts: -1 } for time-bounded per-user reads; ensure userId holds high cardinality.
Avoid { ts: 1 } alone—monotonic inserts → hot shard.

Best practices & the traps I see most

Modeling & access patterns

Design for your top 3 queries. If a field isn’t in a top query, don’t index it yet.
Bound arrays. Unbounded arrays turn multikey indexes into performance landmines.
Pre-join when it pays. If $lookup happens on every request and the data changes rarely, duplicate small fields (e.g., plan) into events.

Index hygiene

Honor the left-prefix rule. {plan:1, createdAt:-1} serves queries on plan alone and on plan+createdAt, not createdAt alone.
One index per query shape. Extra overlapping indexes eat RAM and slow writes.
Measure, don’t guess. Use db.collection.validate({ full: true }), collStats, and atlas/Profiler.

Query execution

Project early. Return only the fields you need.
Avoid regex prefix misses. /^abc/ can use an index; /abc/ cannot.
Use hint() judiciously. Hints can help in hot paths but become tech debt if schema evolves.

Durability & consistency

Pick explicit writeConcern. At least { w: "majority" } for critical writes.
Read from primaries for correctness-sensitive paths. Use secondaries for analytics where staleness is acceptable.

Sharding realities

Cardinality > 10^5 for busy clusters. Low-cardinality keys → jumbo chunks, hot shards.
Backfill carefully. Resharding and historical re-ingest can stampede one shard; throttle and batch.

Quick reference table: pattern → index → gotcha

Pattern	Query Example	Index	Watch Out For
Point read by email	`find({email})`	`{email:1}` unique	Case/locale collation
Per-user recent	`find({userId}).sort(ts:-1)`	`{userId:1, ts:-1}`	Missing sort field in index
Dashboard by time	`find({ts:{$gte:t}})`	`{ts:-1}`	Monotonic hot inserts
Active docs	`find({status:"ACTIVE"})`	Partial `{status:1}`	Queries must include filter
Expiring data	TTL on `ts`	`{ts:1}` + TTL	TTL isn’t real-time deletion
Text search	`$text`	text index	Tokenization surprises

Common pitfalls checklist (print this)

Added sort field to the index (not just filter fields)
Index size fits in memory (or at least hot subset does)
Arrays bounded / use pagination, not unbounded growth
Query shapes stable and measured with Profiler
Shard key chosen for cardinality and distribution
Write/read concerns set intentionally, not defaults

Conclusion & takeaways

Work backward from query patterns, not tables.
Use compound indexes that match equality → sort → range.
Keep arrays bounded and duplicate small, slow-changing fields to avoid chatty joins.
If you must shard, choose a key for distribution and locality—and test with production-like data.

CTA: If you want a review of your current MongoDB schema and indexes, paste your top three queries (filters, projections, sorts), and I’ll sketch a tailored index and sharding plan.

Internal link ideas (official sources only)

MongoDB Manual — Indexes (single, compound, multikey, partial, TTL)
MongoDB Manual — Aggregation Pipeline & $lookup
MongoDB Manual — Explain Plans & Query Planner
MongoDB Manual — Data Modeling Patterns (embed vs reference)
MongoDB Manual — Sharding & Choosing a Shard Key
MongoDB Manual — Read/Write Concerns & Read Preferences
MongoDB Atlas — Performance Advisor & Query Profiler

(Link these terms to their respective pages on the official MongoDB docs site.)

Image prompt

“A clean, modern architecture diagram showing a MongoDB cluster with a primary and two secondaries, plus a sharded events collection using a compound shard key { userId, ts }. Include indexes { userId:1, ts:-1 }, { ts:-1 }, and a TTL index. Minimalistic, high-contrast, 3D isometric style.”

Bonus: Pitch ideas for future MongoDB articles (SEO-ready)

“MongoDB Shard Key Design: How to Avoid Hot Shards (With Real Traffic Distributions)”
Keywords: shard key, hot shard, cardinality, resharding, zone sharding.
“Aggregation Pipeline Patterns: 12 Recipes Every Mid-Level Engineer Should Know”
Keywords: MongoDB aggregation, $lookup, $group, pipeline optimization.
“From 200ms to 20ms: Practical MongoDB Index Tuning Using Explain Plans”
Keywords: MongoDB index optimization, explain plan, query profiler.
“Time-Series on MongoDB: TTL, Bucketing, and Cost-Efficient Retention”
Keywords: MongoDB time-series, TTL index, retention policy, bucketing.
“Modeling One-to-Many in MongoDB: Embed, Reference, or Hybrid?”
Keywords: embed vs reference, data modeling, denormalization.
“Multi-Tenant SaaS on MongoDB: Isolation, Throttling, and Per-Tenant Analytics”
Keywords: multi-tenant MongoDB, tenant isolation, workload management.
“MongoDB Consistency in Practice: Read/Write Concerns Without the Hand-Waving”
Keywords: write concern majority, read preference, read concern levels.
“Change Streams for Real-Time Features: Patterns, Limits, and Backpressure”
Keywords: MongoDB change streams, real-time, event-driven.
“Atlas Performance Advisor vs. Human Tuning: When to Trust the Robot”
Keywords: MongoDB Atlas, performance advisor, auto-index suggestions.
“Cost-Aware MongoDB: Storage, Index Bloat, and Read/Write Trade-offs”
Keywords: MongoDB cost optimization, index size, storage planning.

Data/ML Engineer Blog

MongoDB Data Modeling & Indexing Playbook for Mid-Level Data Engineers

Why this matters (and why your p95 spiked last Friday)

Core concepts, stripped of fluff

MongoDB’s mental model

Work from query patterns backward

Embed vs. reference — the 80/20 rules

Index types you’ll actually use

Aggregation pipeline

Sharding in one table

A realistic example: events analytics at scale

Collections & schemas

Index strategy (mongosh)

Query examples (PyMongo)

Sanity-check with `explain()`

When to shard this

Best practices & the traps I see most

Modeling & access patterns

Index hygiene

Query execution

Durability & consistency

Sharding realities

Quick reference table: pattern → index → gotcha

Common pitfalls checklist (print this)

Conclusion & takeaways

Internal link ideas (official sources only)

Image prompt

Tags

Bonus: Pitch ideas for future MongoDB articles (SEO-ready)

YOU MAY HAVE MISSED

Monitoring 101 for Data Engineers

Materialized Views in the Real World

Kafka Ingestion with Apache Doris Routine Load

Structured Logging 101

Data/ML Engineer Blog

MongoDB Data Modeling & Indexing Playbook for Mid-Level Data Engineers

Why this matters (and why your p95 spiked last Friday)

Core concepts, stripped of fluff

MongoDB’s mental model

Work from query patterns backward

Embed vs. reference — the 80/20 rules

Index types you’ll actually use

Aggregation pipeline

Sharding in one table

A realistic example: events analytics at scale

Collections & schemas

Index strategy (mongosh)

Query examples (PyMongo)

Sanity-check with explain()

When to shard this

Best practices & the traps I see most

Modeling & access patterns

Index hygiene

Query execution

Durability & consistency

Sharding realities

Quick reference table: pattern → index → gotcha

Common pitfalls checklist (print this)

Conclusion & takeaways

Internal link ideas (official sources only)

Image prompt

Tags

Bonus: Pitch ideas for future MongoDB articles (SEO-ready)

YOU MAY HAVE MISSED

Sanity-check with `explain()`