Retry policies (exponential backoff with filters & deadline)

When to use

Calls to flaky I/O (HTTP, DB, object storage) that usually succeed on retry.
You need limits: max retries, deadline, exception filters, and jitter.
You want a drop-in decorator around functions/methods.

Avoid when the operation isn’t idempotent/safe to re-run, or freshness must be exact.

Diagram (text)

call() ── fail ──> sleep(base * 2^n + jitter) ── retry ──> ... ──> success or give up (retries/deadline)
            ↑ only for selected exceptions (e.g., TimeoutError)

Python example (≤40 lines, type-hinted)

A compact decorator: exponential backoff, optional jitter, exception filtering, deadline.

from __future__ import annotations
import time
from functools import wraps
from typing import Callable, TypeVar, Tuple

T = TypeVar("T")

def retry(*, retries: int = 3, base: float = 0.2, max_delay: float = 2.0,
          exceptions: Tuple[type[BaseException], ...] = (Exception,),
          jitter: Callable[[float], float] = lambda d: d,
          sleep: Callable[[float], None] = time.sleep,
          now: Callable[[], float] = time.time,
          deadline: float | None = None) -> Callable[[Callable[..., T]], Callable[..., T]]:
    def deco(fn: Callable[..., T]) -> Callable[..., T]:
        @wraps(fn)
        def wrapper(*a, **kw) -> T:
            delay, start = base, now()
            for attempt in range(retries + 1):
                try:
                    return fn(*a, **kw)
                except exceptions:
                    if attempt == retries or (deadline is not None and now() - start + delay > deadline):
                        raise
                    d = min(delay, max_delay)
                    sleep(jitter(d))
                    delay = min(delay * 2, max_delay)
        return wrapper
    return deco

Usage & tiny pytest-style checks

def test_succeeds_with_backoff():
    calls, sleeps = {"n": 0}, []
    @retry(retries=5, base=0, sleep=lambda d: sleeps.append(d), exceptions=(TimeoutError,))
    def flaky():
        calls["n"] += 1
        if calls["n"] < 3: raise TimeoutError()
        return 42
    assert flaky() == 42 and len(sleeps) == 2

def test_filters_exceptions():
    import pytest
    @retry(retries=2, base=0, exceptions=(TimeoutError,))
    def bad():
        raise ValueError("no retry")  # not in exceptions
    with pytest.raises(ValueError): bad()

def test_deadline_stops_early():
    import pytest
    t = {"v": 0.0}
    def now(): return t["v"]
    def advance(d): t["v"] += d
    @retry(retries=10, base=1.0, max_delay=8.0, deadline=2.5, now=now, sleep=advance, exceptions=(TimeoutError,))
    def always_timeout(): raise TimeoutError()
    with pytest.raises(TimeoutError): always_timeout()

Trade-offs & pitfalls

Pros: Robust against transient failures; tunable; easy to apply at call sites.
Cons: Can hide real errors; increases latency; adds complexity.
Pitfalls / anti-patterns:
- Retrying non-idempotent ops (double charges, duplicate inserts).
- Catching broad Exception—filter specific transient ones.
- Infinite or too-long retries—set retries/deadline and log.
- Sleeping on the main thread in services—prefer async/backoff primitives or background tasks.
- No jitter → thundering herds. Use full jitter (random(0, delay)) in scaled systems.

Pythonic alternatives

Libraries: tenacity, backoff (rich policies: jitter, stop/try/exception filters, async support).
Decorator stacks: combine with your metrics decorator to time success/fail attempts.
Context managers for scoped retries on blocks (with retry_ctx: ...) if setup/teardown matters.

Mini exercise

Add full jitter:

import random
full_jitter = lambda d: random.uniform(0, d)

Use jitter=full_jitter.
Add on_retry(attempt, exc, delay) callback to the decorator to emit logs/metrics.
Write tests that: (1) callback is called with expected attempt counts, (2) non-retryable exceptions bypass sleeps.

Checks (quick checklist)

Retries limited by count and optionally a deadline.
Exception filter lists only transient errors.
Backoff increases up to max_delay with jitter.
Idempotency considered; non-idempotent ops guarded.
Tests cover success-after-retry, filter, and deadline paths.

Data/ML Engineer Blog

Retry policies (exponential backoff with filters & deadline)

When to use

Diagram (text)

Python example (≤40 lines, type-hinted)

Trade-offs & pitfalls

Pythonic alternatives

Mini exercise

Checks (quick checklist)

YOU MAY HAVE MISSED

Monitoring 101 for Data Engineers

Materialized Views in the Real World

Kafka Ingestion with Apache Doris Routine Load

Structured Logging 101