Azkaban – Data/ML Engineer Blog

Azkaban: The Legacy Workflow Scheduler That Shaped Modern Data Engineering

Introduction

Before Airflow dominated data engineering, before Prefect and Dagster existed, there was Azkaban.

LinkedIn built it in 2011 to solve a real problem. They had hundreds of Hadoop jobs running daily. Dependencies were complex. Failures happened constantly. Managing it all through cron and shell scripts was chaos.

Azkaban brought sanity to batch job orchestration. It introduced concepts that became standard in workflow tools. Job dependencies as directed graphs. Visual workflow monitoring. Permission-based access control. Failure notifications.

The tool is still around. Companies with long-standing Hadoop infrastructure still use it. But development has slowed. Most teams are migrating to modern alternatives.

This guide covers what Azkaban is, why it mattered, and whether it still makes sense today. You’ll understand its place in data engineering history and when legacy systems might justify keeping it.

What is Azkaban?

Azkaban is a batch workflow scheduler created at LinkedIn. It manages job dependencies, schedules execution, and provides visibility into workflow status.

The architecture is straightforward. You define jobs in simple property files or YAML. Jobs can depend on other jobs. Azkaban builds a dependency graph and executes jobs in order. When one job finishes, dependent jobs start.

LinkedIn open-sourced Azkaban in 2012. It was one of the first enterprise-grade workflow schedulers available freely. Data teams adopted it quickly, especially those running Hadoop.

The project served LinkedIn for years. They eventually moved to other tools internally, but Azkaban remained in the ecosystem. Development continues, but at a slower pace than modern alternatives.

Core Architecture

Azkaban has three main components.

The Web Server provides the UI and handles user interactions. You upload workflow definitions through the web interface. You trigger runs, view logs, and manage permissions here. The web server is stateless, so you can run multiple instances behind a load balancer.

The Executor Server runs the actual jobs. When a workflow is triggered, the executor picks it up and starts executing. Each job runs as a separate process on the executor. Multiple executors can run simultaneously for horizontal scaling.

The Database stores everything. Workflow definitions, execution history, user permissions, schedules. MySQL is the typical choice, though other databases work.

This separation of concerns was smart design for 2011. The web layer scales independently from execution. Multiple executors handle increased workload. The database becomes the single source of truth.

How Jobs Work

Jobs in Azkaban are simple. Each job has a type, which determines what it does.

Command jobs run shell commands or scripts. This is the most basic job type.

type=command
command=python process_data.py

Hadoop jobs submit MapReduce work. Azkaban handles the submission and monitors completion.

type=hadoopJava
job.class=com.example.MyMapReduceJob

Pig and Hive jobs run queries against Hadoop data.

type=pig
pig.script=transform.pig

Java jobs execute Java classes directly.

type=javaprocess
java.class=com.example.DataProcessor

You can define custom job types through plugins. This extensibility let teams adapt Azkaban to their specific needs.

Workflow Definition

Workflows are defined in ZIP files containing job files and dependencies.

A simple job file looks like this:

# extract.job
type=command
command=./extract_data.sh

Another job that depends on it:

# transform.job
type=command
dependencies=extract
command=./transform_data.sh

And a final job:

# load.job
type=command
dependencies=transform
command=./load_data.sh

You package these files together in a ZIP and upload through the web UI. Azkaban reads the dependencies and builds the execution graph.

Dependencies can be complex. A job can depend on multiple parents. Azkaban won’t start a job until all its dependencies succeed.

Scheduling

Azkaban handles scheduled execution. You set up a schedule through the web interface. Daily at 2 AM, hourly, every 15 minutes.

The scheduler is simple. It’s essentially an enhanced cron. When the scheduled time arrives, Azkaban triggers the workflow.

You can configure what happens on failure. Skip dependent jobs, continue anyway, or alert and stop.

Schedules can overlap. If a previous run is still executing when the next is scheduled, Azkaban can either skip the new run, queue it, or run it concurrently.

User Interface

The Azkaban UI was modern for its time. It shows workflows as visual graphs. You see which jobs are running, completed, or failed.

The execution view displays real-time progress. Click on a job to see logs. Check execution time and resource usage. Understand why something failed.

The workflow graph is interactive. Zoom in and out. Click nodes to see job details. Trace dependencies visually.

You can trigger workflows manually through the UI. Override parameters, select specific jobs to run, or trigger a full workflow.

The schedule management screen shows all scheduled workflows. You can pause, resume, or modify schedules without changing workflow code.

Permissions and Access Control

Azkaban has a robust permission system. This mattered at LinkedIn scale, where hundreds of engineers needed access but not everyone should trigger production workflows.

Projects have owners. Owners control who can view, execute, or modify workflows. You can grant read-only access to some users, execution rights to others, and admin access to a few.

This prevents accidental production runs. Junior engineers can view workflows and logs but can’t trigger critical jobs. Data analysts can run specific analytics workflows but can’t modify ETL pipelines.

The permission model integrates with LDAP and other authentication systems. Single sign-on works out of the box.

Failure Handling

Workflows fail. Azkaban provides several mechanisms to handle this.

Automatic retries are configurable per job. A job can retry a specified number of times before giving up.

retries=3
retry.backoff=60000

This retries up to three times with a one-minute delay between attempts.

Failure notifications alert on-call engineers. Email, webhooks, or custom plugins can notify external systems.

Failure options determine what happens to dependent jobs. By default, if a job fails, its dependents don’t run. But you can configure workflows to continue despite failures.

SLA alerts warn when workflows take too long. Set an expected duration, and Azkaban alerts if exceeded.

Why Azkaban Mattered

In 2011, workflow orchestration was immature. Teams used cron scripts, manual coordination, and hope.

Azkaban introduced several innovations that became industry standard.

Dependency management as code. Instead of carefully timing cron jobs, you declared dependencies. The scheduler figured out execution order.

Visual workflow representation. Seeing jobs as graphs made complex workflows understandable. This became standard in every modern orchestrator.

Centralized execution and monitoring. All workflows in one place. One UI to check status. One system to manage.

Permission-based access. Not everyone should trigger production workflows. Azkaban made this enforceable.

Hadoop integration. MapReduce, Pig, and Hive support made it natural for big data teams. Azkaban understood Hadoop semantics.

These ideas seem obvious now. But in 2011, they were novel. Azkaban pioneered what we now expect from orchestration tools.

When Azkaban Made Sense

Azkaban fit specific needs perfectly.

Hadoop-centric data pipelines. If your world was MapReduce, Pig, and Hive, Azkaban was built for you. Native job types handled submission and monitoring.

Batch processing workflows. Daily ETL jobs, weekly reports, monthly aggregations. Azkaban handled scheduled batch work well.

Teams needing simple deployment. Upload a ZIP file, set a schedule, done. No complex infrastructure or configuration.

Organizations with access control requirements. The permission system was more mature than alternatives at the time.

Companies already using LinkedIn tools. Azkaban integrated well with other LinkedIn open-source projects.

Where Azkaban Falls Short Today

The world moved on. Modern orchestrators learned from Azkaban and improved.

Limited ecosystem compared to Airflow. Airflow has hundreds of operators for different systems. Azkaban has basic job types and requires custom plugins for newer technologies.

No dynamic workflow generation. Workflows are static. You can’t generate jobs based on runtime data easily. Modern tools make this natural.

Dated UI and UX. The interface feels old. It works, but it’s not pleasant compared to modern alternatives.

Slower development pace. The project still gets updates, but infrequently. New features are rare. Bug fixes take time.

Weaker community. Stack Overflow questions go unanswered. Documentation is sparse. Finding help is harder.

No cloud-native features. Azkaban predates cloud data platforms. Integration with S3, Snowflake, BigQuery requires custom work.

Limited modern data tool support. No native dbt support. No Spark on Kubernetes. No integration with modern ML frameworks.

Deployment complexity at scale. Setting up executors, managing databases, handling failover requires operational expertise.

Migration Patterns

Teams still using Azkaban often plan migrations. Several patterns emerge.

Gradual migration to Airflow. The most common path. Start with new workflows in Airflow. Slowly migrate critical Azkaban workflows. Run both systems in parallel during transition.

Lift and shift to managed services. Move to AWS Step Functions, Google Cloud Composer, or Azure Data Factory. Let cloud providers handle orchestration.

Rewrite for modern tools. Take the opportunity to redesign workflows. Move to Prefect or Dagster with better abstractions.

Stay on Azkaban for legacy systems. If workflows still work and the team knows Azkaban, migration cost might not justify the effort. Keep it running until you can’t.

The decision depends on pain points. If Azkaban still meets needs, migration might wait. If you’re hitting limitations, migrating sooner prevents technical debt accumulation.

Running Azkaban Today

If you’re maintaining Azkaban, several considerations matter.

Database maintenance is critical. The MySQL instance stores everything. Regular backups are essential. Schema migrations during upgrades need careful handling.

Executor scaling requires planning. As workloads grow, add executors. But they’re not automatically discovered. Configuration management is manual.

Plugin development might be necessary. Integrating with modern tools often requires custom plugins. The plugin API works, but documentation is minimal.

Monitoring and alerting need external tools. Azkaban doesn’t expose detailed metrics. Integrate with monitoring systems to track workflow health.

Security updates lag. The project doesn’t always get timely security patches. Run Azkaban behind firewalls and limit exposure.

Upgrade cycles are slow. New versions release infrequently. Testing upgrades thoroughly before production is essential.

Comparison with Modern Alternatives

Azkaban vs Airflow

Airflow won the mindshare battle. It has a larger ecosystem, better documentation, and active development.

Airflow’s Python-based DAG definition is more flexible than Azkaban’s property files. Dynamic workflows are natural in Airflow, awkward in Azkaban.

Airflow integrates with modern data tools out of the box. Snowflake, dbt, Spark, cloud services all have operators.

But Airflow is more complex to operate. Azkaban’s simpler architecture can be easier for small teams.

For new projects, Airflow is the better choice. For existing Azkaban deployments, migration cost matters.

Azkaban vs Prefect

Prefect is modern Python workflow orchestration. The developer experience is far better than Azkaban.

Prefect Cloud removes operational overhead. Azkaban requires managing servers and databases.

Prefect’s hybrid execution model is innovative. Azkaban requires all execution infrastructure to be managed.

Prefect has better error handling, clearer code, and modern features like version control integration.

Azkaban might be simpler to understand for basic workflows. But Prefect scales better to complex scenarios.

Azkaban vs Dagster

Dagster’s asset-based approach is fundamentally different from Azkaban’s job-based model.

Dagster emphasizes data quality, testing, and lineage. Azkaban focuses on job execution.

Dagster is built for modern data platforms. Azkaban was built for Hadoop.

For new data platforms, Dagster offers better abstractions. For legacy Hadoop workflows, Azkaban’s simplicity might be adequate.

The LinkedIn Legacy

LinkedIn moved beyond Azkaban internally. But the tool’s influence persists.

Many orchestrators borrowed ideas from Azkaban. Visual workflow graphs, dependency-based scheduling, permission systems all appeared first in Azkaban.

The project proved batch workflow orchestration at scale was possible. It showed what good looked like for 2011.

LinkedIn’s other data tools also shaped the industry. Kafka, Samza, and Voldemort all emerged from similar needs. LinkedIn had real-world big data problems and built practical solutions.

Azkaban was part of that wave. It solved a problem well enough to be useful beyond LinkedIn.

Should You Use Azkaban Today?

For new projects, no. Modern alternatives are better in almost every way.

Airflow has more features, better support, and stronger ecosystem integration. Prefect offers cleaner development experience. Dagster provides better abstractions for data platforms.

Cloud-managed services like Step Functions or Cloud Composer eliminate operational overhead.

But if you already have Azkaban running production workflows, the calculation changes.

Keep Azkaban if:

It currently meets your needs
Migration cost exceeds pain from limitations
Team expertise is deep in Azkaban
Workflows are stable and rarely change
You’re primarily running legacy Hadoop jobs

Migrate away if:

You need modern tool integration
Workflows are becoming more complex
You’re moving to cloud platforms
Finding Azkaban expertise is difficult
Development velocity matters

Key Takeaways

Azkaban was an important tool in data engineering history. It brought workflow orchestration to batch data processing at a time when few options existed.

The project introduced concepts that became standard. Dependency graphs, visual monitoring, permission controls all started with tools like Azkaban.

Today, it’s showing its age. Development is slow. The ecosystem is small. Modern features are missing.

For legacy systems, Azkaban might still make sense. It works. It’s stable. If workflows run fine, migration might wait.

For new projects, choose modern tools. Airflow, Prefect, Dagster, or cloud-managed services all offer better capabilities.

Azkaban deserves recognition for what it accomplished. It solved real problems and influenced the tools we use today. But its time as a first-choice orchestrator has passed.

If you maintain Azkaban systems, plan for eventual migration. If you’re learning orchestration, start with current tools. Azkaban is part of data engineering history, not its future.

Tags: Azkaban, workflow orchestration, LinkedIn engineering, batch processing, Hadoop workflows, ETL scheduler, legacy data tools, job scheduling, workflow scheduler, data engineering history, MapReduce orchestration, enterprise workflow management, batch job orchestration

Data/ML Engineer Blog