Rundeck – Data/ML Engineer Blog

Rundeck: Operations-Focused Job Scheduling and Runbook Automation

Introduction

Most workflow orchestration tools target developers. They assume you’ll write code, manage infrastructure, and handle complexity. Rundeck takes a different approach.

Rundeck was built for operations teams. It’s about giving people self-service access to run jobs without needing deep technical knowledge. Need to restart a service? Clear a cache? Deploy a hotfix? Rundeck lets you define these tasks once and let anyone execute them safely.

This isn’t about building complex data pipelines or ML workflows. Rundeck excels at runbook automation, scheduled maintenance, incident response, and operational tasks. It’s the bridge between “someone needs to SSH into a server” and “click a button in a web interface.”

If your team spends time on repetitive operational tasks, answering the same questions, or manually running scripts, Rundeck might be the solution. This guide explains what it does, when it makes sense, and how it compares to other tools.

What is Rundeck?

Rundeck is an open-source runbook automation and job scheduling platform. It turns command-line operations into self-service web-based tasks.

The core concept is simple. You define jobs that execute commands on remote nodes. These jobs can be simple shell commands, scripts, or complex workflows. Users run jobs through a web interface. Rundeck handles authentication, authorization, logging, and scheduling.

PagerDuty acquired Rundeck in 2020 and continues to develop both the open-source and commercial versions. The open-source edition is fully functional. The enterprise version adds features like clustering, enhanced security, and support.

Rundeck has been around since 2010. It’s mature software with a long track record in production environments.

Core Concepts

Understanding Rundeck means grasping a few key ideas.

Jobs are the fundamental unit. A job defines what to execute, where to execute it, and who can run it. Jobs can be single commands or multi-step workflows.

Nodes are the systems where jobs run. Servers, containers, cloud instances, network devices. Rundeck maintains an inventory of nodes with attributes like hostname, tags, and environment.

Projects organize related jobs and nodes. You might have projects for different applications, environments, or teams. Projects provide isolation and access control.

Access Control determines who can do what. Run specific jobs, view logs, modify job definitions, or manage the system. Rundeck has granular permissions down to individual jobs and nodes.

Schedules trigger jobs automatically. Cron-style scheduling for recurring tasks. Run backups nightly, health checks hourly, reports weekly.

Notifications alert people when jobs succeed or fail. Email, Slack, PagerDuty, webhooks. Know immediately when something goes wrong.

The User Interface

Rundeck’s UI is where most users interact with it. The interface is functional rather than flashy.

The job list shows available jobs. Filter by project, tags, or search terms. Click a job to see details, history, and execution options.

Job execution is straightforward. Click “Run Job Now” and optionally provide parameters. Watch the job run in real time or navigate away and check back later.

Activity view shows running and recent jobs. See what’s executing now, what completed, what failed. Filter by date, user, or result.

Node inventory lists all systems Rundeck knows about. View node attributes, run ad-hoc commands, or see which jobs target specific nodes.

Job history provides a complete audit trail. Who ran what job, when, on which nodes, with what result. Drill into individual executions to see full output.

The UI isn’t modern or beautiful, but it’s practical. Operations teams can use it without training.

When Rundeck Makes Sense

Rundeck fits specific scenarios well.

You have operational tasks that non-experts need to run. Support teams need to clear caches. Developers need to deploy to staging. QA needs to refresh test environments. Rundeck gives them controlled access without requiring server credentials.

You want to reduce manual operations. Tasks people run by hand are error-prone. Rundeck standardizes these operations. The same steps execute the same way every time.

You need runbook automation. Incident response often involves predefined procedures. Rundeck turns these runbooks into executable jobs. During an outage, click buttons instead of remembering commands.

You require audit trails. Compliance and security teams want to know who did what. Rundeck logs everything. Track which user ran which job on which systems at what time.

You’re managing traditional infrastructure. Rundeck works great with VMs, physical servers, and legacy systems. SSH-based execution doesn’t require agents or modern container platforms.

You need role-based access control. Different teams need different permissions. Rundeck’s ACL system lets you control access at a fine-grained level.

Common Use Cases

Runbook Automation

Incident response procedures often exist as documentation. Step 1: Check logs. Step 2: Restart service. Step 3: Verify health. Step 4: Notify team.

Rundeck turns this documentation into executable workflows. Create a job that runs each step. Add decision points based on results. Log everything automatically.

When an incident happens, responders run the job instead of following text instructions. Faster response, fewer mistakes, complete audit trail.

Self-Service Operations

Support teams constantly ask engineering for help. “Can you restart the API server?” “Can you clear the Redis cache?” “Can you redeploy staging?”

These requests interrupt engineers and slow down support. With Rundeck, support runs these operations themselves. Engineers define the jobs once with proper safety checks. Support executes them as needed.

This pattern works across organizations. QA teams can refresh test data. Marketing can trigger report generation. DevOps can delegate routine tasks.

Scheduled Maintenance

Many tasks run on schedules. Database backups, log rotation, certificate renewal, cleanup jobs, health checks.

Rundeck handles this scheduling. Define jobs and set cron schedules. Jobs run automatically. Failures trigger notifications. Success is logged.

Unlike simple cron jobs, Rundeck provides visibility. See which scheduled jobs ran, which failed, how long they took. Get historical trends.

Change Management

Production changes need coordination. Who made the change? What exactly changed? When did it happen? Was it authorized?

Rundeck integrates into change management processes. Jobs require approval before execution. Changes are logged with full context. Integration with ticketing systems tracks changes back to requests.

Remote Command Execution

Sometimes you just need to run commands across multiple servers. Update a config file, check disk space, restart a service.

Rundeck’s ad-hoc command feature handles this. Select target nodes by tags or filters. Enter the command. Execute across dozens or hundreds of systems. See results aggregated in one place.

Architecture and Components

Rundeck’s architecture is straightforward.

The Rundeck Server is the main component. It hosts the web UI, API, and job scheduler. The server stores job definitions, execution history, and configuration.

The Database stores persistent data. Job definitions, execution logs, user accounts, and node inventory. Rundeck supports H2 (embedded), MySQL, PostgreSQL, and Oracle.

Node Executors handle command execution on remote systems. The SSH executor is most common. It connects via SSH and runs commands. The WinRM executor handles Windows systems. Custom executors can integrate with other systems.

File Copiers transfer files to nodes before job execution. Scripts, config files, or data files. The SCP copier uses SSH. The WinRM copier handles Windows.

Resource Model Sources populate the node inventory. Pull node lists from cloud providers, CMDBs, or custom sources. Keep the inventory up to date automatically.

Plugins extend functionality. Execution plugins, notification plugins, logging plugins, workflow steps. The plugin ecosystem adds features without modifying core Rundeck.

Job Definition and Workflows

Jobs in Rundeck can be simple or complex.

A basic job runs one command on selected nodes:

- defaultTab: nodes
  description: 'Restart Apache service'
  executionEnabled: true
  loglevel: INFO
  name: Restart Apache
  nodeFilterEditable: false
  plugins:
    ExecutionLifecycle: null
  scheduleEnabled: true
  sequence:
    commands:
    - exec: sudo systemctl restart apache2
    keepgoing: false
    strategy: node-first
  uuid: a1b2c3d4-e5f6-g7h8-i9j0-k1l2m3n4o5p6

This job runs systemctl restart apache2 on target nodes.

Multi-step workflows chain commands together:

sequence:
  commands:
  - exec: echo "Starting deployment"
  - exec: git pull origin main
  - exec: npm install
  - exec: npm run build
  - exec: sudo systemctl restart app
  - exec: curl -f http://localhost:3000/health
  keepgoing: false
  strategy: node-first

Each step runs in order. If any step fails, execution stops.

Jobs can call other jobs. Build reusable components:

sequence:
  commands:
  - jobref:
      name: Backup Database
      nodeStep: 'false'
  - jobref:
      name: Deploy Application
      nodeStep: 'false'
  - jobref:
      name: Smoke Test
      nodeStep: 'false'

This job orchestrates three other jobs in sequence.

Node Filtering and Targeting

Rundeck executes jobs on nodes matching filters. You can target nodes by various attributes.

Tags are the simplest approach. Tag nodes with attributes like environment, role, or application:

tags: production,webserver,us-east

Filter jobs to run only on matching nodes:

tags: production+webserver

This runs on nodes tagged with both “production” and “webserver.”

Attributes provide more granular filtering. Node hostname, operating system, environment, or custom attributes:

hostname: web-*.example.com
os-family: unix
environment: production

Regular expressions enable complex patterns:

hostname: (web|api)-prod-[0-9]+\.example\.com

Node selection strategies determine execution order. Run on all nodes simultaneously, one at a time, or in configurable batches.

Access Control and Security

Rundeck takes security seriously. The ACL system controls who can do what.

Project-level ACLs grant permissions within a project. Run jobs, view history, modify jobs, manage nodes. Different users get different rights.

System-level ACLs control Rundeck administration. Create projects, manage users, configure plugins, access system logs.

Job-level ACLs restrict access to specific jobs. Some jobs are sensitive. Only certain users or groups should run them.

ACLs use a YAML format:

description: Allow developers to run jobs but not modify them
context:
  project: 'production'
for:
  job:
    - allow: [read, run]
  node:
    - allow: read
by:
  group: developers

Authentication integrates with existing systems. LDAP, Active Directory, SSO providers. Don’t create duplicate user accounts.

Audit logging tracks all actions. User logins, job executions, configuration changes. Satisfy compliance requirements and investigate incidents.

Notifications and Integrations

Rundeck notifies stakeholders about job results.

Email notifications are straightforward. Send emails on success, failure, or both. Include job output in the message or link to the execution.

Slack integration posts messages to channels. Alert teams in their communication tool. Include job status, execution time, and output.

PagerDuty integration creates incidents for failed jobs. Critical jobs can trigger on-call alerts automatically.

Webhooks call external APIs. Trigger other systems based on job results. Update ticketing systems, trigger deployments, or call custom services.

Log streaming sends execution logs to external systems. Splunk, Elasticsearch, or cloud logging services. Centralize logs for analysis.

Notifications can include rich context. Job name, execution time, user who triggered it, node details, and full output.

Scaling and High Availability

Single Rundeck instances handle moderate load. Hundreds of jobs, thousands of executions daily, dozens of concurrent users.

Larger deployments need more capacity. Multiple Rundeck servers can share a database. This provides high availability and load distribution.

Cluster mode (enterprise feature) enables active-active deployments. Multiple Rundeck servers work together. Job scheduling distributes across nodes. Failover is automatic.

Database performance matters at scale. Use proper database servers rather than embedded H2. MySQL or PostgreSQL with adequate resources. Index key tables for query performance.

Execution history accumulates over time. Old executions consume database space. Clean up old records periodically. Archive important executions before deletion.

Plugin performance varies. Some plugins are resource-intensive. Monitor plugin execution time and impact.

Comparison with Other Tools

Rundeck vs Ansible

Ansible is a configuration management and automation tool. It’s powerful and flexible.

Rundeck provides a UI for running operations. Ansible is primarily command-line driven.

Rundeck has built-in scheduling, access control, and audit logging. Ansible needs additional tools for these features.

Ansible is better for configuration management and complex automation. Rundeck is better for giving non-experts controlled access to operations.

Many teams use both. Ansible playbooks define the automation logic. Rundeck provides the interface for running them.

Rundeck vs Jenkins

Jenkins is a CI/CD platform. It automates build, test, and deployment workflows.

Jenkins targets development workflows. Rundeck targets operational workflows.

Jenkins has a massive plugin ecosystem focused on software delivery. Rundeck’s plugins focus on operations and remote execution.

Jenkins pipelines are complex and powerful. Rundeck jobs are simpler and more focused.

Use Jenkins for CI/CD. Use Rundeck for runbook automation and operational tasks. Some teams use both for different purposes.

Rundeck vs Airflow

Airflow orchestrates data workflows. It’s designed for ETL, data pipelines, and analytics.

Airflow is code-first. You write Python to define workflows. Rundeck is UI-first. You click to create jobs.

Airflow targets data engineers. Rundeck targets operations teams.

Airflow handles complex dependencies and large-scale data processing. Rundeck handles operational tasks and runbook automation.

Different tools for different problems. Don’t use Rundeck for data pipelines. Don’t use Airflow for server maintenance.

Rundeck vs Terraform

Terraform provisions infrastructure as code. It creates, modifies, and destroys cloud resources.

Terraform manages infrastructure state. Rundeck executes operations on existing infrastructure.

Terraform is declarative. Describe desired state, Terraform makes it happen. Rundeck is imperative. Define steps, Rundeck executes them.

Use Terraform to build infrastructure. Use Rundeck to operate it.

Challenges and Limitations

Rundeck isn’t perfect. Several issues come up in practice.

Not designed for data workflows. If you’re building ETL pipelines or ML workflows, better tools exist. Rundeck’s strength is operational tasks, not data processing.

UI feels dated. The interface works but isn’t modern. Users familiar with contemporary web apps might find it clunky.

Limited dependency management. Jobs can call other jobs, but complex DAGs are awkward. Tools like Airflow handle dependencies better.

SSH-based execution can be slow. Establishing SSH connections has overhead. Running many small tasks across hundreds of nodes takes time. Agent-based tools like Ansible can be faster.

Plugin quality varies. The plugin ecosystem is smaller than Jenkins or Airflow. Some plugins are well-maintained, others are abandoned.

Learning curve for ACLs. The access control system is powerful but complex. Getting permissions right takes time and understanding.

Scaling challenges. Very large deployments need enterprise features. The open-source version has limitations for massive scale.

Best Practices

Here’s what works in production environments.

Organize jobs into logical projects. Don’t dump everything into one project. Separate by application, environment, or team. This improves navigation and access control.

Use meaningful job names and descriptions. Future users won’t know what “Job 123” does. Clear names and descriptions make the system usable.

Tag nodes consistently. Establish tagging conventions and enforce them. Inconsistent tags make node filtering difficult.

Implement proper access control. Don’t give everyone admin rights. Use ACLs to grant minimum necessary permissions.

Set up notifications for important jobs. Critical jobs should alert on failure. Configure appropriate notification channels.

Document jobs thoroughly. Use the description field. Explain what the job does, when to run it, and what to do if it fails.

Test jobs in non-production first. Create staging projects that mirror production. Test changes before deploying.

Clean up old executions. Don’t let execution history grow forever. Archive or delete old records periodically.

Monitor Rundeck itself. Track job success rates, execution times, and system health. Set up alerts for Rundeck failures.

Version control job definitions. Export jobs to YAML and commit them to Git. Track changes over time.

Getting Started

Setting up Rundeck is straightforward.

Download and install:

# Using package manager (Debian/Ubuntu)
curl https://packagecloud.io/pagerduty/rundeck/gpgkey | sudo apt-key add -
echo "deb https://packagecloud.io/pagerduty/rundeck/ubuntu/ focal main" | sudo tee /etc/apt/sources.list.d/rundeck.list
sudo apt-get update
sudo apt-get install rundeck

# Or download the WAR file and run with Java
java -jar rundeck-4.x.x.war

Access the interface:

Navigate to http://localhost:4440. Default credentials are admin/admin. Change them immediately.

Create a project:

Click “New Project” and provide a name. This creates an isolated workspace for jobs and nodes.

Add nodes:

Define nodes manually or configure a resource model source. For manual entry:

nodename: web-01.example.com
hostname: web-01.example.com
username: rundeck
tags: production,webserver
osFamily: unix

Create your first job:

Click “Create Job” in your project. Give it a name and description. Add a simple command like echo "Hello from Rundeck". Save and run it.

From here, build more complex jobs, set up schedules, and configure access control.

Real-World Adoption

Many organizations run Rundeck in production.

Financial services use it for compliance-heavy operations. The audit trail satisfies regulatory requirements.

Healthcare organizations manage HIPAA-compliant infrastructure operations through Rundeck’s access controls.

E-commerce companies give support teams self-service access to operational tasks without exposing production systems.

Government agencies use Rundeck for secure, auditable operations on classified systems.

The tool has a solid reputation in traditional IT operations. Less common in modern cloud-native or data engineering contexts.

The Future Direction

Rundeck’s development continues under PagerDuty ownership.

Cloud integrations are improving. Better support for AWS, Azure, and GCP. Native node discovery from cloud platforms.

Container support is getting attention. Kubernetes integration for running jobs as pods rather than SSH commands.

UI modernization is ongoing. Making the interface more contemporary while maintaining functionality.

API improvements for better programmatic access. More operations available via API.

Enterprise features trickle down to open-source. Some features that were enterprise-only are becoming available to everyone.

The core value proposition remains. Self-service operations with proper controls.

Key Takeaways

Rundeck is a runbook automation and job scheduling platform for operations teams.

It’s not for data pipelines or ML workflows. It’s for giving people controlled access to operational tasks.

The strength is self-service operations. Define jobs once, let appropriate users run them. Reduce manual operations and improve consistency.

Access control and audit logging are built-in. Satisfy compliance requirements and track who did what.

The UI is functional but dated. Users can accomplish tasks without training, but the experience isn’t modern.

Rundeck works best with traditional infrastructure. SSH-based execution fits VMs and physical servers. Less natural for container-based or serverless architectures.

Consider Rundeck if you need runbook automation, self-service operations, or scheduled maintenance tasks. Skip it for data engineering or ML workflows.

The open-source version is fully functional. Enterprise adds clustering and advanced features for large deployments.

Start simple. Create a project, add some nodes, define basic jobs. Build complexity gradually.

Tags: Rundeck, runbook automation, job scheduling, operations automation, self-service operations, SSH automation, access control, audit logging, incident response, infrastructure operations, operational workflows, scheduled tasks, remote execution, IT automation, DevOps tools

Data/ML Engineer Blog