Taverna – Data/ML Engineer Blog

Taverna: Workflow Orchestration for Scientific Research and Data-Intensive Science

Introduction

Scientific research generates complex workflows. A biologist might need to query multiple protein databases, run sequence alignment algorithms, apply statistical analysis, and visualize results. Each step depends on the previous one. Each uses different tools and data formats.

Taverna was built to solve this problem. It emerged from the bioinformatics community in the early 2000s when scientists needed a way to connect disparate tools and databases into reproducible workflows.

Unlike general-purpose orchestration tools, Taverna was designed specifically for scientific computing. It focuses on reproducibility, provenance tracking, and integrating heterogeneous data sources. The visual workflow designer lets researchers without programming backgrounds build complex analytical pipelines.

This guide covers what Taverna is, where it fits in the scientific computing landscape, and whether it still has a place in modern research workflows.

What is Taverna?

Taverna is a workflow management system for designing and executing scientific workflows. It provides a graphical interface where researchers drag and drop components to build data analysis pipelines.

The project started at the University of Manchester in 2001 as part of the myGrid project. It became one of the most widely used workflow systems in bioinformatics and life sciences research.

Core capabilities:

Visual workflow composition
Service orchestration across web services and local tools
Provenance capture for reproducibility
Support for complex data types and nested workflows
Integration with scientific databases and web services

Taverna workflows are built from processors. Each processor represents a computational step. Processors connect through data links that define how outputs from one step feed into inputs of another.

The system tracks everything. Which data went into which analysis. What parameters were used. When the workflow ran. This provenance information is crucial for reproducible science.

The Scientific Workflow Problem

Scientific research has unique workflow requirements that differ from business data pipelines.

Heterogeneous data sources. Research pulls data from dozens of specialized databases. GenBank for genetic sequences. UniProt for protein information. PubChem for chemical structures. Each has different APIs, data formats, and access methods.

Complex transformations. Scientific analysis chains together specialized tools. BLAST for sequence similarity. Clustal for alignment. Phylogenetic analysis tools. Statistical packages. Each tool expects specific input formats and produces specialized outputs.

Reproducibility requirements. Scientific results must be reproducible. That means tracking exactly what data was used, which versions of tools ran, and what parameters were set. Years later, another researcher should be able to reproduce your results.

Collaboration across disciplines. Modern research is interdisciplinary. Biologists work with statisticians. Chemists collaborate with machine learning experts. Workflows need to bridge different tools and methodologies.

Publication and sharing. Scientific workflows often get published alongside papers. Other researchers need to understand, modify, and reuse them.

Taverna was designed to address these specific needs.

How Taverna Works

Workflow Composition

Taverna Workbench is the main interface. It’s a desktop application where researchers build workflows visually.

The left panel shows available processors. These include:

Web services (SOAP and REST)
Local Java tools
R and Python scripts
Command-line tools
Specialized bioinformatics services

You drag processors onto the canvas and connect their inputs and outputs. Taverna validates connections to ensure data types match.

Workflows can nest. A complex workflow might contain sub-workflows as individual steps. This promotes reusability and modularity.

Service Integration

Taverna excels at integrating external services. Much scientific data lives in web-accessible databases with programmatic APIs.

SOAP web services were common when Taverna launched. Many biological databases exposed SOAP interfaces. Taverna could import WSDL definitions and automatically create processors for these services.

REST APIs became more popular later. Taverna added support for REST services with configurable HTTP methods, headers, and parameters.

BioMart integration provided access to biological databases like Ensembl. Researchers could query genomic data directly from workflows.

Local tools could be wrapped as processors. If you have a command-line tool or script, Taverna can incorporate it into workflows.

Provenance Tracking

Every workflow execution captures provenance data. Taverna records:

Input data and where it came from
Intermediate results at each step
Output data
Tool versions and parameters
Execution timestamps
Success or failure status

This provenance data gets stored in a database. You can query it later to understand exactly what happened during a workflow run.

For scientific reproducibility, this is critical. You can recreate the exact computational environment that produced a result.

Data Handling

Scientific data comes in many formats. Taverna handles various data types:

Plain text and strings
XML and JSON
Binary data (images, proprietary formats)
Lists and nested collections
Custom MIME types

The system includes data transformation processors. You can convert between formats, extract subsets, or reshape data structures.

Taverna also manages large datasets. Instead of passing full datasets between processors in memory, it can use references and streaming.

Taverna’s Architecture

Components

Taverna Workbench is the desktop application for workflow design and execution. It’s built in Java using OSGi for modularity.

Taverna Engine executes workflows. It handles service invocation, data flow, and error handling. The engine can run standalone without the workbench.

myExperiment was a social networking site for workflows. Researchers could publish, share, and discover workflows. It included versioning, comments, and ratings.

Taverna Server allowed workflow execution without the desktop client. Users could submit workflows through a web interface or API.

Execution Model

Taverna uses a dataflow execution model. Processors execute when their input data is available. This allows parallel execution of independent steps.

The engine builds a directed acyclic graph from the workflow. It identifies which processors can run concurrently and schedules them accordingly.

Retry logic handles temporary failures. If a web service is temporarily unavailable, Taverna can retry the request.

Plugin Architecture

Taverna’s plugin system allows extending functionality. Plugins can add:

New processor types
Additional data formats
Custom user interface components
Integration with specific tools or databases

The bioinformatics community built many plugins for domain-specific needs.

Common Use Cases

Bioinformatics Pipelines

Taverna became most popular in bioinformatics. Typical workflows included:

Sequence analysis workflows. Start with a DNA or protein sequence. Use BLAST to find similar sequences in databases. Retrieve matching sequences. Perform multiple sequence alignment. Build phylogenetic trees. The entire pipeline runs automatically.

Functional annotation. Take a newly sequenced gene. Query multiple databases for functional information. Combine results from GO annotations, pathway databases, and literature. Aggregate findings into a comprehensive report.

Comparative genomics. Compare genomes across species. Identify orthologous genes. Analyze synteny. Perform statistical tests. Visualize results.

Systems Biology

Systems biology studies biological systems as integrated networks. Taverna supported this work.

Metabolic pathway analysis. Query pathway databases. Retrieve enzyme information. Integrate gene expression data. Model metabolic fluxes.

Protein interaction networks. Gather protein interaction data from multiple sources. Build network graphs. Identify hubs and modules. Perform enrichment analysis.

Cheminformatics

Chemistry research used Taverna for computational chemistry workflows.

Structure-activity relationships. Query chemical databases. Calculate molecular descriptors. Build predictive models. Identify promising compounds.

Virtual screening. Screen large compound libraries. Filter by drug-likeness. Predict binding affinities. Prioritize candidates for synthesis.

Medical Research

Clinical and translational research workflows appeared in medical informatics.

Clinical data integration. Combine patient data from electronic health records, lab systems, and imaging databases. Apply statistical analysis. Generate clinical reports.

Genomic medicine. Integrate genomic data with clinical phenotypes. Query variant databases. Predict disease risk. Generate personalized treatment recommendations.

Astronomy and Physics

Some astronomy projects adopted Taverna for data processing pipelines.

Astronomical data processing. Process telescope observations. Apply calibrations. Detect sources. Match against catalogs. Generate reports.

Particle physics analysis. Process detector data. Apply filters. Reconstruct events. Perform statistical analysis.

Strengths of Taverna

Domain-Specific Design

Taverna understands scientific computing needs. The focus on web service integration matched how scientific databases were structured. Built-in support for common bioinformatics services reduced setup time.

Provenance and Reproducibility

The automatic provenance capture was ahead of its time. Modern reproducibility concerns in science make this feature even more relevant.

Research workflows need to be reproducible years later. Taverna’s provenance system made this possible without extra work from researchers.

Accessibility for Non-Programmers

The visual interface lowered barriers. Biologists without programming skills could build sophisticated analysis pipelines. This democratized computational biology.

Drag-and-drop workflow construction was intuitive. Researchers focused on scientific logic rather than coding details.

Workflow Sharing and Reuse

myExperiment created a community around workflows. Researchers published workflows alongside papers. Others could download, understand, and adapt them.

This accelerated scientific progress. Instead of reimplementing analysis pipelines from paper descriptions, researchers could start with working workflows.

Flexibility and Extensibility

The plugin architecture allowed customization. Different scientific communities built domain-specific extensions.

Support for multiple languages (R, Python, Java) meant researchers could incorporate their preferred tools.

Limitations and Challenges

Maintenance and Development

Taverna development slowed significantly. The last major release was in 2014. Active development essentially stopped.

The scientific community moved toward other tools. Funding for Taverna dried up. The team at University of Manchester moved to other projects.

This created problems for users. Newer operating systems and Java versions caused compatibility issues. Security vulnerabilities in dependencies went unpatched.

Steep Learning Curve

Despite the visual interface, Taverna had complexity. Understanding how to properly connect processors took time. Debugging workflows wasn’t always straightforward.

The abstraction sometimes leaked. Users needed to understand web services, XML, and data typing even with the visual interface.

Performance Limitations

Taverna wasn’t built for big data. Modern scientific datasets often exceed what Taverna handles efficiently.

The architecture predated distributed computing frameworks. Running workflows on clusters or cloud infrastructure required workarounds.

Limited Modern Integrations

As APIs evolved toward REST and GraphQL, and away from SOAP, Taverna’s strength in SOAP integration became less relevant.

Modern scientific infrastructure uses cloud platforms, container orchestration, and serverless computing. Taverna didn’t adapt to these trends.

Java Desktop Application

The desktop application model aged poorly. Modern researchers expect web-based interfaces. The requirement to install and maintain desktop software felt outdated.

Cross-platform issues emerged. Different operating systems had different problems. Dependency management became a headache.

The Evolution: Apache Taverna

In 2014, Taverna entered the Apache Incubator. The goal was to revitalize development under the Apache Software Foundation.

Apache Taverna aimed to modernize the codebase and architecture. Plans included:

Updated technology stack
Better cloud integration
Improved performance
Modern web-based interface

The project moved slowly. In 2020, the Apache Taverna project retired. The Apache Software Foundation moved it to the Attic, their archive of inactive projects.

This marked the effective end of Taverna as an active project.

Alternatives and Successors

The scientific workflow space didn’t stand still. Several tools emerged as Taverna declined.

Nextflow

Nextflow became the dominant choice for bioinformatics workflows. It uses a domain-specific language based on dataflow programming.

Why it replaced Taverna:

Better performance for large datasets
Native support for containers (Docker, Singularity)
Cloud and cluster execution built-in
Active development and strong community
Modern architecture

Nextflow handles genomics pipelines that process terabytes of data. It runs efficiently on everything from laptops to cloud clusters.

Snakemake

Snakemake offers a Python-based approach to workflow management. It’s popular in bioinformatics and computational biology.

Advantages over Taverna:

Familiar Python syntax
Good cluster and cloud support
Reproducibility through Conda integration
Active development
Strong community in genomics

Snakemake workflows are code rather than visual. This appeals to computationally sophisticated researchers.

Galaxy

Galaxy provides a web-based platform for accessible bioinformatics. It targets researchers without programming skills.

Different approach from Taverna:

Fully web-based interface
Integrated analysis environment
Large tool repository
Shared data and workflows through web
Active development and community

Galaxy fills the niche Taverna occupied for biologists wanting graphical workflow tools.

Common Workflow Language (CWL)

CWL is a standard for describing workflows in a portable way. It’s specification rather than software.

How it differs:

Vendor-neutral specification
Multiple execution engines support it
Focus on portability and reproducibility
Container-first approach

CWL workflows can run on different platforms. This addresses the lock-in concern with proprietary workflow systems.

Workflow Description Language (WDL)

WDL is another workflow specification language. It came from the Broad Institute.

Key features:

Human-readable syntax
Strong typing
Good for genomics pipelines
Cloud-native execution

WDL is common in genomics, particularly for human genome analysis pipelines.

Apache Airflow and Others

Some scientific teams use general-purpose orchestration tools like Airflow, Prefect, or Dagster.

These tools weren’t built for science specifically. But they offer modern architectures, cloud integration, and active development.

The trade-off is less domain-specific functionality but better performance and maintainability.

Is Taverna Still Relevant?

For new projects, probably not. The lack of active development makes it hard to recommend.

When Taverna still makes sense:

You have existing Taverna workflows that work
Your data sources still provide SOAP services Taverna integrates well with
Your requirements are simple and Taverna meets them
You’re in a constrained environment where installing newer tools is difficult

When to choose alternatives:

Starting new scientific workflows
Need modern cloud or container integration
Working with large datasets
Want active community support
Need ongoing security updates

The reality is that most of the scientific community has moved on. Publications citing Taverna have declined sharply since 2015.

Lessons from Taverna

Taverna’s rise and fall offers lessons for scientific software.

Sustainability matters. Academic software projects often struggle with long-term maintenance. Funding cycles don’t align with software lifecycles. When grants end, development stops.

Community is crucial. Taverna had an engaged community, but it wasn’t large enough to sustain development when institutional support ended.

Architecture ages. Desktop Java applications and SOAP web services were reasonable choices in 2001. By 2015, they felt antiquated. Software needs continuous modernization.

Reproducibility is essential. Taverna got this right. Modern scientific workflow tools all emphasize reproducibility. The provenance features Taverna pioneered are now standard expectations.

Domain specificity helps and hurts. Taverna’s bioinformatics focus helped adoption in that field. But it limited use in other domains. General-purpose tools with good extensibility might have broader sustainability.

The Legacy

Despite ending as an active project, Taverna influenced scientific computing.

Workflow sharing culture. myExperiment established that workflows should be published and shared. This culture continues in platforms like WorkflowHub.

Provenance standards. Taverna’s provenance model influenced later standards like W3C PROV. Modern workflow systems implement similar tracking.

Service-oriented science. The vision of composing workflows from web services shaped how scientific databases built APIs.

Reproducibility focus. Taverna helped establish reproducibility as a core requirement for scientific workflows. This mindset carries forward.

Many researchers who used Taverna now contribute to newer workflow tools. The lessons learned influenced the next generation of scientific workflow systems.

Key Takeaways

Taverna was pioneering when it launched. It solved real problems for scientific researchers. The visual workflow design, service integration, and provenance tracking were innovative.

The project is effectively dead now. Development ceased. The community moved to other tools. Starting new work with Taverna makes little sense.

Modern alternatives like Nextflow, Snakemake, and Galaxy offer better performance, active development, and cloud integration. They learned from Taverna’s strengths and addressed its weaknesses.

For historical or legacy workflows, Taverna might still run. But migration to modern tools is wise for anything important.

The scientific workflow field continues evolving. Containerization, cloud computing, and big data drive new requirements. The tools that succeed balance domain-specific features with general-purpose architecture.

Taverna’s legacy lives on in the workflow systems that followed. The problems it addressed remain relevant. The solutions just come from newer tools now.

Tags: Taverna, scientific workflows, bioinformatics, workflow orchestration, reproducible research, computational biology, myExperiment, scientific computing, research workflows, provenance tracking, life sciences, scientific data analysis, workflow management systems, Apache Taverna

Data/ML Engineer Blog