Designing SLIs and SLOs for Data Pipelines: A Practical Guide for Data Engineers
If your data platform “looks fine” in Airflow but business users are still asking “Can I trust this dashboard?”, you don’t have reliability — you just have green boxes.
SLIs and SLOs are how you turn that vague feeling of “I think it’s OK” into measurable, enforceable reliability for data pipelines.
This article walks through how to design SLIs/SLOs specifically for data platforms: batch, streaming, warehouses, and ML pipelines.
1. Why SLIs/SLOs Matter for Data Pipelines
SLIs/SLOs started in classic SRE for APIs and infrastructure.
Data teams tried to copy them and ended up with useless metrics like:
“Pipeline success rate = 99.9%”
(while half the data is stale or wrong)
For data engineering, reliability = users getting correct-enough data on time, consistently, not “all DAG tasks succeeded.”
You need SLIs/SLOs to:
- Align with consumers (analysts, product, finance) on what “good enough” means
- Prioritize work using error budgets, not random Jira tickets
- Avoid blind spots where pipelines run but data is silently wrong or late
- Automate alerts so humans aren’t staring at dashboards all day
2. Concepts: SLI, SLO, Error Budget (Data Flavor)
Let’s adapt the classic definitions to data.
2.1 SLI (Service Level Indicator)
SLI = a measurable metric that represents data reliability from the consumer’s point of view.
Examples for data pipelines:
- Freshness: “Max delay between source event and availability in warehouse table”
- Completeness: “% of expected rows loaded by 8:00 AM”
- Correctness/Quality: “% of rows passing business validation rules”
- Latency (for streaming): “P95 end-to-end event processing time”
- Schema Stability: “Days without breaking schema changes”
2.2 SLO (Service Level Objective)
SLO = target value or range for an SLI over a time window.
Examples:
- Freshness: “95% of days, the
orders_dailytable is updated by 8:05 AM UTC.” - Completeness: “99% of days, we load ≥ 99.5% of expected order rows.”
- Correctness: “Over 28 days, ≥ 99.9% of rows pass validation checks.”
2.3 Error Budget
Error Budget = how much you’re allowed to be “bad” without breaking your promise.
If your SLO is 95% and you hit it only 90% this month, you’ve burned your error budget.
For data teams, error budgets drive decisions like:
- Freeze new transformations / refactors
- Allocate time to root cause recurring quality issues
- Push back on “just one more dashboard” until reliability stabilizes
3. What Should You Measure? Core Data SLIs
Don’t start from tools or what’s easy to log.
Start from user expectations.
3.1 Map User Journeys to SLIs
Ask each major consumer:
- “What breaks your world?”
- “When do you stop trusting this dataset?”
Then map that to SLIs:
| User expectation | SLI type | Example SLI |
|---|---|---|
| “Finance needs daily revenue by 8:00 AM.” | Freshness | % days when revenue table ready by 8:00 AM |
| “BI team needs full order history per day.” | Completeness | % of expected orders loaded per day |
| “Product needs correct experiment assignments.” | Correctness | % rows passing A/B assignment consistency rules |
| “Ops needs near real-time alerts.” | Latency | P95 event processing time < 2 min |
| “Schema changes shouldn’t break dashboards.” | Schema stability | # of breaking changes per month |
These are consumer-centric, not “DAG runtime”.
4. Designing SLIs/SLOs for Batch Pipelines
Think of a typical daily batch pipeline:
- Extract from source (DB, APIs, logs)
- Load into staging
- Transform into curated tables (e.g.,
fact_orders,dim_customers) - Expose to BI / ML
4.1 Core Batch SLIs
1) Freshness SLI
Question: “Is today’s data available on time?”
You can implement:
-- SLI: fresh_until_minutes_late
SELECT
DATEDIFF(
'minute',
MAX(loaded_at), -- latest record timestamp in target table
CONVERT_TIMEZONE('UTC', CURRENT_TIMESTAMP()) -- or scheduled cutoff
) AS minutes_delay;
Example SLO:
“On 95% of days,
minutes_delay <= 10by 08:05 AM UTC.”
2) Completeness SLI
Question: “Did we load all the rows we expected?”
You can compare against a control table or source-system counts:
SELECT
target.business_date,
target.row_count AS loaded_rows,
src.row_count AS expected_rows,
(target.row_count::float / src.row_count) AS completeness_ratio
FROM daily_target_counts target
JOIN daily_source_counts src
ON target.business_date = src.business_date;
Example SLO:
“For the last 30 days, 99% of days had
completeness_ratio >= 0.995.”
3) Correctness / Quality SLI
Pick critical business rules, not every minor check:
- No negative revenue
- Currency codes valid
- Foreign keys resolvable
- No double-counted orders
Example:
SELECT
1 - (COUNT(*) FILTER (WHERE NOT quality_pass))::float
/ COUNT(*) AS percent_rows_passing
FROM fact_orders_quality_checks
WHERE business_date = CURRENT_DATE - 1;
SLO:
“Over any 30-day window,
percent_rows_passing >= 0.999on 95% of days.”
4.2 Example: SLI/SLO Table for a Daily Revenue Pipeline
| SLI name | Type | SLO | Alert condition |
|---|---|---|---|
revenue_freshness_sli | Freshness | 95% of days data ready by 08:05 UTC | Today > 08:10 UTC and not finished |
revenue_completeness_sli | Completeness | 99% of days ≥ 99.5% of expected rows | Today < 99% rows by 08:10 |
revenue_quality_sli | Correctness | 30-day rolling: 99.9% rows pass rules | Drop below 99.5% on any single day |
revenue_pipeline_success_sli | Technical | 99% of DAG runs succeed | Failure for 2+ consecutive days |
Notice: pipeline success rate is last, not first.
5. SLIs/SLOs for Streaming & Real-Time Pipelines
For streaming (Kafka, Kinesis, Flink, Spark Streaming, etc.), the emphasis shifts to latency, lag, and drop rate.
5.1 Core Streaming SLIs
- End-to-end latency: event ingestion → availability in sink (e.g. Redis, ClickHouse)
- Consumer lag: how far behind the consumer is vs the latest offset
- Throughput / drop rate: events processed vs received
- Out-of-order tolerance: how many events arrive too late to be useful
Example SLI:
-- Assuming you store event time & ingest time in a warehouse
SELECT
APPROX_PERCENTILE(
DATEDIFF('second', event_time, ingested_at),
0.95
) AS p95_latency_seconds
FROM events_stream
WHERE ingested_at >= DATEADD('minute', -10, CURRENT_TIMESTAMP());
Example SLO:
“For the last 7 days, P95 end-to-end latency < 120 seconds during business hours.”
6. Best Practices: How to Design SLIs/SLOs That Don’t Suck
6.1 Start with 3–5 SLIs per Critical Domain
If you try to measure everything, you’ll measure nothing usefully.
Pick 1–2 key tables or domains (e.g., Revenue, Orders, Experiments) and define:
- Freshness
- Completeness
- One important quality metric
Then expand gradually.
6.2 Make Them Consumer-Visible
If users can’t see the status, they won’t trust it.
- Expose SLIs in dbt docs, data catalog, or BI dashboards
- Add “reliability badges” to datasets (e.g., Green: All SLOs met last 7 days)
- Attach SLO status to key dashboards (e.g., main revenue dashboard)
6.3 Define Clear Ownership
Each domain’s SLOs must have:
- An owner team (e.g., “Revenue Data Domain Team”)
- An on-call / primary contact
- A runbook: what to do when SLO is breached
No owner = no reliability.
6.4 Integrate SLOs into Incident & Planning Process
SLOs are not “nice metrics” — they’re decision tools.
Use them to drive:
- Alerts: based on SLI breaches, not single-task failures
- Postmortems: reference impact via SLO violations, not Jira feelings
- Roadmap: if a service consistently burns error budget, invest in reliability work
7. Common Pitfalls (And How to Avoid Them)
Pitfall 1: Vanity SLIs
“Number of Airflow tasks succeeded.”
This tells you nothing about user-visible reliability.
Fix: Only accept SLIs if you can complete the sentence:
“If this SLI is bad, the user feels it because ____.”
Pitfall 2: Overly Aggressive SLOs (e.g., 100% everything)
- 100% freshness
- 0 incidents
- 100% checks passing
You’ll always be in violation and people will ignore the metrics.
Fix: Choose realistic SLOs that match business impact:
- Maybe Revenue needs 99.9% completeness
- But Marketing attribution can live with 98%
Pitfall 3: SLOs You Can’t Measure Automatically
If your SLI needs a spreadsheet and 3 people every month, you’ll stop tracking it.
Fix: Instrument SLIs directly in:
- Your orchestration layer (Airflow/Dagster/Prefect)
- Your warehouse (Snowflake, BigQuery, Redshift) with scheduled queries
- Your monitoring system (Prometheus, Datadog, etc.)
Pitfall 4: Ignoring Downstream Breakage
SLOs should cover end-to-end journeys, not single jobs.
Example failure: Pipeline is fine, but a downstream schema change breaks the main dashboard.
Fix: Add SLIs like:
- “Number of failed dbt models feeding
revenue_mart” - “Number of query errors on key BI dashboards per day”
- “Days since last breaking schema change on core tables”
8. A Lightweight Framework to Implement SLIs/SLOs
You don’t need a full SRE platform to start. Here’s a pragmatic recipe.
Step 1: Pick One Critical Domain
Example: Daily Revenue Reporting
Step 2: Define 3–4 SLIs
revenue_freshness_slirevenue_completeness_slirevenue_quality_slirevenue_pipeline_success_sli(optional)
Step 3: Implement SLIs as Warehouse Tables / Views
Create a table like RELIABILITY.REVENUE_SLIS_DAILY:
| date | freshness_minutes | completeness_ratio | quality_ratio | pipeline_success |
|---|---|---|---|---|
| 2025-11-23 | 6 | 0.998 | 0.9992 | TRUE |
| 2025-11-24 | 12 | 0.993 | 0.9990 | FALSE |
Populate it via a scheduled job.
Step 4: Add SLO Evaluation Logic
Either:
- A scheduled SQL job that computes rolling windows and flags “breach”, OR
- Alerts in your monitoring tool that watch these metrics directly
Step 5: Show It to Users
- Simple Looker/Power BI/Tableau dashboard: “Revenue Data Reliability”
- Green / yellow / red indicators
- Link this to the main revenue dashboard so business users see status at a glance
9. Internal Link Ideas (for a Blog / Docs Site)
You can cross-link this article to:
- “Data Contracts for Analytics and ML”
- “Designing Data Quality Checks that Actually Catch Real Issues”
- “How to Build a Data Incident Management Process (for Analytics Teams)”
- “Dagster/Airflow Patterns for Observability and Monitoring”
- “Choosing the Right Freshness and Latency Targets for Your Data Products”
10. Conclusion & Takeaways
SLIs and SLOs for data pipelines are not a buzzword — they’re how you translate “vibes” into numbers and stop arguing about whether your data is “good enough.”
Key points:
- Start from user expectations, not what’s easy to log
- Focus on freshness, completeness, and correctness for your most critical domains
- Design SLIs that are automated, visible, and owned
- Use error budgets to drive priorities, not just intuition
- Avoid vanity metrics and unreachable “99.999% of everything” targets
If your dashboards and models are business-critical, they deserve the same level of reliability discipline as production APIs. SLIs/SLOs are the way to get there.
Call to action:
Pick one critical pipeline today (probably revenue or orders). Define just 3 SLIs, give them realistic SLOs, and wire up a basic dashboard. Once people start asking, “Why don’t we have this for other datasets?” — you’ll know you’re on the right track.














Leave a Reply