Nagios – Data/ML Engineer Blog

Nagios for Data & DevOps Engineers: Turning Monitoring into a Predictable System, Not a Fire Drill

When your pipeline dies at 3 a.m., you don’t care how “modern” your stack is — you care who told you first and what exactly broke. Nagios has been that “who” for more than a decade: a brutally simple, scriptable monitoring engine that many enterprises still trust to watch everything from legacy Oracle boxes to shiny Kubernetes nodes. (Nagios Open Source)

This article gives you a data/DevOps engineer’s view of Nagios: what it is, how it works, how to wire checks for your jobs and databases, and what usually goes wrong.

What is Nagios, Really?

At its core, Nagios is an event-driven monitoring engine:

Periodically runs plugins (scripts/binaries) called “checks”
Compares their exit codes/output against thresholds
Raises alerts (email/SMS/webhooks/etc.) when things go bad
Keeps history and status views in a web UI (Core or XI)(Nagios Open Source)

Two main flavors you’ll see:

Edition	What it is	Typical Use Case
Nagios Core	Open-source engine + basic web UI	DIY setups, heavy scripting, low cost
Nagios XI	Commercial product built on Core, with wizards, dashboards, reports	Enterprises needing support, multi-tenant, SLAs (Nagios Enterprises)

If you can run a script and parse text, you can monitor it with Nagios.

Nagios Architecture in Plain English

Think of Nagios as a central control tower with agents and scripts scattered across your infrastructure.

Core Building Blocks

Nagios Server
- Runs the Nagios Core engine
- Schedules checks and processes their results
- Hosts the web UI and stores status/history
Plugins
- Small programs/scripts (C, Bash, Python, etc.)
- Output one line of text + performance data, and exit with a status code:
  - 0 – OK
  - 1 – WARNING
  - 2 – CRITICAL
  - 3 – UNKNOWN
- Vast ecosystem of community plugins + your custom ones (Nagios Open Source)
Remote Agents / Add-ons
- NRPE (Nagios Remote Plugin Executor): Run plugins on remote Linux/Unix hosts. (Nagios Enterprises)
- NSClient++: Windows agent, often used together with NRPE. (Nagios Open Source)
- NRDP / NSCA: Push-style mechanisms to send passive results back to Nagios. (Nagios Open Source)
Objects / Config
- hosts – machines/services you care about
- services – what you check on them (CPU, DB, job status, API health)
- contacts/notifications – who gets paged and when
- commands – how checks are executed (e.g., check_nrpe, check_http)

High-Level Flow

Scheduler decides: “Time to check service X on host Y.”
Executes a command (e.g., check_nrpe -H hostY -c check_load).
Receives status + text + perf data from the plugin.
Updates state, runs event handlers if needed (e.g., restart a service).
Triggers notifications as state changes (OK → CRITICAL, etc.).(Nagios Open Source)

Example: Monitoring a Data Pipeline Job with Nagios

Let’s say you have a daily ETL that loads data into Snowflake or a warehouse. You want Nagios to yell if:

The job didn’t run
Runtime is > 1 hour
Row count delta looks suspicious

1. Write a Simple Check Plugin

Example in Python (simplified):

#!/usr/bin/env python3
import sys
import json
from datetime import datetime, timedelta
from pathlib import Path

STATE_OK = 0
STATE_WARNING = 1
STATE_CRITICAL = 2
STATE_UNKNOWN = 3

def main():
    status_file = Path("/var/lib/etl/job_status.json")
    if not status_file.exists():
        print("CRITICAL - status file missing")
        sys.exit(STATE_CRITICAL)

    try:
        data = json.loads(status_file.read_text())
    except Exception as e:
        print(f"UNKNOWN - cannot parse status file: {e}")
        sys.exit(STATE_UNKNOWN)

    last_run = datetime.fromisoformat(data["last_run"])
    duration_sec = data["duration_sec"]
    row_count = data["rows_loaded"]

    now = datetime.utcnow()
    if now - last_run > timedelta(hours=25):
        print(f"CRITICAL - job stale, last run {last_run.isoformat()}")
        sys.exit(STATE_CRITICAL)

    if duration_sec > 3600:
        print(f"WARNING - runtime {duration_sec}s > 3600s | runtime={duration_sec}s")
        sys.exit(STATE_WARNING)

    print(f"OK - rows={row_count}, runtime={duration_sec}s | rows={row_count} runtime={duration_sec}s")
    sys.exit(STATE_OK)

if __name__ == "__main__":
    main()

Key points:

Exit codes map to Nagios states.
Single-line message with | separated perf data so Nagios/XI can graph trends. (Nagios Enterprises)

2. Register the Command in Nagios

On the Nagios server:

define command{
    command_name    check_etl_daily
    command_line    /usr/local/nagios/libexec/check_etl_daily.py
}

3. Attach It to a Service

define service{
    host_name               etl-orchestrator01
    service_description     Daily ETL Job
    check_command           check_etl_daily
    check_interval          10
    retry_interval          5
    max_check_attempts      3
    notification_interval   60
}

Now Nagios will:

Check every 10 minutes
Retry a few times before alerting
Send notifications every 60 minutes while the job is broken

NRPE: Monitoring Remote Hosts Without Opening Everything

If your ETL server or DB node is locked down, you don’t want Nagios SSH-ing all over the place. NRPE solves that:

NRPE daemon runs on the remote host and executes local plugins.
check_nrpe plugin runs on the Nagios server and talks to NRPE over a dedicated port (usually 5666) using SSL. (Nagios Enterprises)

Typical pattern:

Install NRPE + plugins on the remote host.
Define allowed Nagios server IPs in nrpe.cfg.
Define commands in nrpe.cfg, e.g.:

command[check_etl_daily]=/usr/local/nagios/libexec/check_etl_daily.py

On the Nagios server:

define command{
    command_name    check_nrpe_etl_daily
    command_line    /usr/local/nagios/libexec/check_nrpe -H $HOSTADDRESS$ -c check_etl_daily
}

Now you can monitor internal resources (CPU, logs, job status) without exposing them externally. (Nagios Open Source)

Best Practices for Using Nagios in Modern Data/DevOps Environments

1. Treat Checks as Contracts

Every check should answer a yes/no question:

“Is this API usable for customers?”
“Is today’s partition fully loaded?”
“Is replication lag < 5 minutes?”

Avoid vague checks like “disk usage” with no thresholds; define SLO-aligned thresholds.

2. Use Performance Data Aggressively

Perf data (| key=value) lets Nagios XI graph trends: CPU, rows, lag, durations. That’s your capacity planning and early warning system. (Nagios Open Source)

3. Separate Infra Checks vs. Data Checks

Infra: CPU, disk, memory, process up/down, network.
Data: row counts, freshness, error rates, backlog depth.

Keep them as separate services so you know whether the box is sick or the data is.

4. Use Event Handlers – Carefully

Nagios can auto-remediate:

Restart a failed service
Trigger a Dagster/Airflow re-run
Clear a stuck queue

But be explicit; careless auto-restarts can hide systemic issues or cause flapping.

5. Design for Scale (Nagios Core / XI)

For large estates:

Use distributed / redundant Nagios setups (satellite pollers sending passive checks to a central server). (Nagios Enterprises)
Offload heavy checks or log analysis to specialized tools (Prometheus, ELK, cloud monitoring) and have Nagios only track the outcome.

6. Security Hygiene with NRPE & Agents

Restrict by IP and use SSL with strong ciphers. (Nagios Support)
Only expose strictly necessary commands.
Monitor the agents themselves (is NRPE alive?).

7. Don’t Over-Alert

Bad Nagios = constant false alarms → everyone ignores it.

Use warning vs critical thresholds wisely.
Use notification periods (e.g., non-critical alerts only in business hours).
Use escalations so on-call doesn’t wake up for noisy but non-critical checks. (Nagios Open Source)

Common Pitfalls (And How to Avoid Them)

“We Just Monitored Everything” → Alert Fatigue
Start from business-critical flows (checkout, ETL DWH load, main APIs) and expand from there.
Single Nagios Server Bottleneck
- Too many check intervals, too frequent checks.
- Fix: tune intervals, use distributed monitoring or add pollers.
Unversioned, Manual Config
- People editing /usr/local/nagios/etc/objects/ by hand in prod.
- Fix: store configs in Git, generate them from templates, and automate deploys.
Custom Checks with No Exit Codes / Perf Data
- “It prints something but Nagios shows UNKNOWN.”
- Fix: enforce a contract: exit code + machine-parsable perf line.
Ignoring Trend Data
- Only reacting to “red state” instead of watching trends (e.g., disks creeping from 70% to 95% over weeks).
- Fix: use XI graphs + capacity planning reports. (Nagios Library)

When is Nagios Still a Good Choice?

Despite all the hype around Prometheus, Datadog, etc., Nagios is still solid when:

You’re in a hybrid / legacy-heavy environment (mainframes, old Unix, one-off appliances).
You want a script-friendly, generic monitor that can run any executable.
You need a perpetual-license, on-prem, controlled environment (strict compliance).
You’re okay trading fancy dashboards for simple, stable, predictable monitoring.

Conclusion & Takeaways

Nagios is not the shiny new toy, but it does one job extremely well:
“Run a command, interpret its result, and alert the right people.”
For a data or DevOps engineer, that’s often exactly what you need.

Key points to remember:

Think in checks as contracts: each check answers a clear, binary question.
Use NRPE/agents for deep host checks without exposing internals.
Push data-aware checks (freshness, row counts, lag) alongside infra checks.
Keep configs versioned and templated, not handcrafted.
Use performance data and reports for capacity planning, not just firefighting.

If you wire Nagios thoughtfully into your pipelines and services, it stops being a noisy pager and becomes a reliable early-warning system for your data platform.

Image Prompt (for DALL·E / Midjourney)

“A clean, modern infrastructure monitoring dashboard centered on a Nagios-style server, with remote Linux and Windows nodes connected via secure lines, showing green/yellow/red status indicators and ETL/data pipeline icons — minimalistic, high contrast, 3D isometric style.”

Data/ML Engineer Blog