Designing SLIs and SLOs for Data Pipelines

Designing SLIs and SLOs for Data Pipelines

Designing SLIs and SLOs for Data Pipelines: A Practical Guide for Data Engineers

If your data platform “looks fine” in Airflow but business users are still asking “Can I trust this dashboard?”, you don’t have reliability — you just have green boxes.
SLIs and SLOs are how you turn that vague feeling of “I think it’s OK” into measurable, enforceable reliability for data pipelines.

This article walks through how to design SLIs/SLOs specifically for data platforms: batch, streaming, warehouses, and ML pipelines.


1. Why SLIs/SLOs Matter for Data Pipelines

SLIs/SLOs started in classic SRE for APIs and infrastructure.
Data teams tried to copy them and ended up with useless metrics like:

“Pipeline success rate = 99.9%”
(while half the data is stale or wrong)

For data engineering, reliability = users getting correct-enough data on time, consistently, not “all DAG tasks succeeded.”

You need SLIs/SLOs to:

  • Align with consumers (analysts, product, finance) on what “good enough” means
  • Prioritize work using error budgets, not random Jira tickets
  • Avoid blind spots where pipelines run but data is silently wrong or late
  • Automate alerts so humans aren’t staring at dashboards all day

2. Concepts: SLI, SLO, Error Budget (Data Flavor)

Let’s adapt the classic definitions to data.

2.1 SLI (Service Level Indicator)

SLI = a measurable metric that represents data reliability from the consumer’s point of view.

Examples for data pipelines:

  • Freshness: “Max delay between source event and availability in warehouse table”
  • Completeness: “% of expected rows loaded by 8:00 AM”
  • Correctness/Quality: “% of rows passing business validation rules”
  • Latency (for streaming): “P95 end-to-end event processing time”
  • Schema Stability: “Days without breaking schema changes”

2.2 SLO (Service Level Objective)

SLO = target value or range for an SLI over a time window.

Examples:

  • Freshness: “95% of days, the orders_daily table is updated by 8:05 AM UTC.”
  • Completeness: “99% of days, we load ≥ 99.5% of expected order rows.”
  • Correctness: “Over 28 days, ≥ 99.9% of rows pass validation checks.”

2.3 Error Budget

Error Budget = how much you’re allowed to be “bad” without breaking your promise.

If your SLO is 95% and you hit it only 90% this month, you’ve burned your error budget.

For data teams, error budgets drive decisions like:

  • Freeze new transformations / refactors
  • Allocate time to root cause recurring quality issues
  • Push back on “just one more dashboard” until reliability stabilizes

3. What Should You Measure? Core Data SLIs

Don’t start from tools or what’s easy to log.
Start from user expectations.

3.1 Map User Journeys to SLIs

Ask each major consumer:

  • “What breaks your world?”
  • “When do you stop trusting this dataset?”

Then map that to SLIs:

User expectationSLI typeExample SLI
“Finance needs daily revenue by 8:00 AM.”Freshness% days when revenue table ready by 8:00 AM
“BI team needs full order history per day.”Completeness% of expected orders loaded per day
“Product needs correct experiment assignments.”Correctness% rows passing A/B assignment consistency rules
“Ops needs near real-time alerts.”LatencyP95 event processing time < 2 min
“Schema changes shouldn’t break dashboards.”Schema stability# of breaking changes per month

These are consumer-centric, not “DAG runtime”.


4. Designing SLIs/SLOs for Batch Pipelines

Think of a typical daily batch pipeline:

  • Extract from source (DB, APIs, logs)
  • Load into staging
  • Transform into curated tables (e.g., fact_orders, dim_customers)
  • Expose to BI / ML

4.1 Core Batch SLIs

1) Freshness SLI

Question: “Is today’s data available on time?”

You can implement:

-- SLI: fresh_until_minutes_late
SELECT
    DATEDIFF(
        'minute',
        MAX(loaded_at),      -- latest record timestamp in target table
        CONVERT_TIMEZONE('UTC', CURRENT_TIMESTAMP())  -- or scheduled cutoff
    ) AS minutes_delay;

Example SLO:

“On 95% of days, minutes_delay <= 10 by 08:05 AM UTC.”


2) Completeness SLI

Question: “Did we load all the rows we expected?”

You can compare against a control table or source-system counts:

SELECT
    target.business_date,
    target.row_count      AS loaded_rows,
    src.row_count         AS expected_rows,
    (target.row_count::float / src.row_count) AS completeness_ratio
FROM daily_target_counts target
JOIN daily_source_counts src
  ON target.business_date = src.business_date;

Example SLO:

“For the last 30 days, 99% of days had completeness_ratio >= 0.995.”


3) Correctness / Quality SLI

Pick critical business rules, not every minor check:

  • No negative revenue
  • Currency codes valid
  • Foreign keys resolvable
  • No double-counted orders

Example:

SELECT
    1 - (COUNT(*) FILTER (WHERE NOT quality_pass))::float 
        / COUNT(*) AS percent_rows_passing
FROM fact_orders_quality_checks
WHERE business_date = CURRENT_DATE - 1;

SLO:

“Over any 30-day window, percent_rows_passing >= 0.999 on 95% of days.”


4.2 Example: SLI/SLO Table for a Daily Revenue Pipeline

SLI nameTypeSLOAlert condition
revenue_freshness_sliFreshness95% of days data ready by 08:05 UTCToday > 08:10 UTC and not finished
revenue_completeness_sliCompleteness99% of days ≥ 99.5% of expected rowsToday < 99% rows by 08:10
revenue_quality_sliCorrectness30-day rolling: 99.9% rows pass rulesDrop below 99.5% on any single day
revenue_pipeline_success_sliTechnical99% of DAG runs succeedFailure for 2+ consecutive days

Notice: pipeline success rate is last, not first.


5. SLIs/SLOs for Streaming & Real-Time Pipelines

For streaming (Kafka, Kinesis, Flink, Spark Streaming, etc.), the emphasis shifts to latency, lag, and drop rate.

5.1 Core Streaming SLIs

  • End-to-end latency: event ingestion → availability in sink (e.g. Redis, ClickHouse)
  • Consumer lag: how far behind the consumer is vs the latest offset
  • Throughput / drop rate: events processed vs received
  • Out-of-order tolerance: how many events arrive too late to be useful

Example SLI:

-- Assuming you store event time & ingest time in a warehouse
SELECT
    APPROX_PERCENTILE(
        DATEDIFF('second', event_time, ingested_at),
        0.95
    ) AS p95_latency_seconds
FROM events_stream
WHERE ingested_at >= DATEADD('minute', -10, CURRENT_TIMESTAMP());

Example SLO:

“For the last 7 days, P95 end-to-end latency < 120 seconds during business hours.”


6. Best Practices: How to Design SLIs/SLOs That Don’t Suck

6.1 Start with 3–5 SLIs per Critical Domain

If you try to measure everything, you’ll measure nothing usefully.

Pick 1–2 key tables or domains (e.g., Revenue, Orders, Experiments) and define:

  • Freshness
  • Completeness
  • One important quality metric

Then expand gradually.


6.2 Make Them Consumer-Visible

If users can’t see the status, they won’t trust it.

  • Expose SLIs in dbt docs, data catalog, or BI dashboards
  • Add “reliability badges” to datasets (e.g., Green: All SLOs met last 7 days)
  • Attach SLO status to key dashboards (e.g., main revenue dashboard)

6.3 Define Clear Ownership

Each domain’s SLOs must have:

  • An owner team (e.g., “Revenue Data Domain Team”)
  • An on-call / primary contact
  • A runbook: what to do when SLO is breached

No owner = no reliability.


6.4 Integrate SLOs into Incident & Planning Process

SLOs are not “nice metrics” — they’re decision tools.

Use them to drive:

  • Alerts: based on SLI breaches, not single-task failures
  • Postmortems: reference impact via SLO violations, not Jira feelings
  • Roadmap: if a service consistently burns error budget, invest in reliability work

7. Common Pitfalls (And How to Avoid Them)

Pitfall 1: Vanity SLIs

“Number of Airflow tasks succeeded.”

This tells you nothing about user-visible reliability.

Fix: Only accept SLIs if you can complete the sentence:
“If this SLI is bad, the user feels it because ____.”


Pitfall 2: Overly Aggressive SLOs (e.g., 100% everything)

  • 100% freshness
  • 0 incidents
  • 100% checks passing

You’ll always be in violation and people will ignore the metrics.

Fix: Choose realistic SLOs that match business impact:

  • Maybe Revenue needs 99.9% completeness
  • But Marketing attribution can live with 98%

Pitfall 3: SLOs You Can’t Measure Automatically

If your SLI needs a spreadsheet and 3 people every month, you’ll stop tracking it.

Fix: Instrument SLIs directly in:

  • Your orchestration layer (Airflow/Dagster/Prefect)
  • Your warehouse (Snowflake, BigQuery, Redshift) with scheduled queries
  • Your monitoring system (Prometheus, Datadog, etc.)

Pitfall 4: Ignoring Downstream Breakage

SLOs should cover end-to-end journeys, not single jobs.

Example failure: Pipeline is fine, but a downstream schema change breaks the main dashboard.

Fix: Add SLIs like:

  • “Number of failed dbt models feeding revenue_mart
  • “Number of query errors on key BI dashboards per day”
  • “Days since last breaking schema change on core tables”

8. A Lightweight Framework to Implement SLIs/SLOs

You don’t need a full SRE platform to start. Here’s a pragmatic recipe.

Step 1: Pick One Critical Domain

Example: Daily Revenue Reporting

Step 2: Define 3–4 SLIs

  • revenue_freshness_sli
  • revenue_completeness_sli
  • revenue_quality_sli
  • revenue_pipeline_success_sli (optional)

Step 3: Implement SLIs as Warehouse Tables / Views

Create a table like RELIABILITY.REVENUE_SLIS_DAILY:

datefreshness_minutescompleteness_ratioquality_ratiopipeline_success
2025-11-2360.9980.9992TRUE
2025-11-24120.9930.9990FALSE

Populate it via a scheduled job.

Step 4: Add SLO Evaluation Logic

Either:

  • A scheduled SQL job that computes rolling windows and flags “breach”, OR
  • Alerts in your monitoring tool that watch these metrics directly

Step 5: Show It to Users

  • Simple Looker/Power BI/Tableau dashboard: “Revenue Data Reliability”
  • Green / yellow / red indicators
  • Link this to the main revenue dashboard so business users see status at a glance

9. Internal Link Ideas (for a Blog / Docs Site)

You can cross-link this article to:

  • “Data Contracts for Analytics and ML”
  • “Designing Data Quality Checks that Actually Catch Real Issues”
  • “How to Build a Data Incident Management Process (for Analytics Teams)”
  • “Dagster/Airflow Patterns for Observability and Monitoring”
  • “Choosing the Right Freshness and Latency Targets for Your Data Products”

10. Conclusion & Takeaways

SLIs and SLOs for data pipelines are not a buzzword — they’re how you translate “vibes” into numbers and stop arguing about whether your data is “good enough.”

Key points:

  • Start from user expectations, not what’s easy to log
  • Focus on freshness, completeness, and correctness for your most critical domains
  • Design SLIs that are automated, visible, and owned
  • Use error budgets to drive priorities, not just intuition
  • Avoid vanity metrics and unreachable “99.999% of everything” targets

If your dashboards and models are business-critical, they deserve the same level of reliability discipline as production APIs. SLIs/SLOs are the way to get there.

Call to action:
Pick one critical pipeline today (probably revenue or orders). Define just 3 SLIs, give them realistic SLOs, and wire up a basic dashboard. Once people start asking, “Why don’t we have this for other datasets?” — you’ll know you’re on the right track.

Leave a Reply

Your email address will not be published. Required fields are marked *