A quick start of Workflow Playbook¶
The Workflow Playbook is written in the YAML format. It is a human-readable format, so you may proceed to read this page to get a quick understand of Workflow Playbook.
The only things to learn are 1) the YAML format, 2) the Nunjucks template expression, and 3) the filter functions. You don’t need all the above knowledge to start. They are intuitive to learn. Proceed first!
See also
If you are not familiar with the YAML format, you may see the YAML syntax page and its see also section at the bottom of that page. For the Nunjucks template, please see the Nunjucks templating.
A quick demo¶
Let’s say we want to execute a task command like:
perl -Mbigum=bpi -wle print 'bpi(2000)'
This command print the pi
to the 2000 digit accuracy as described in the concept section.
We can use the following definition for such command.
The command will be executed in a perl container.
1 2 3 4 5 6 7 8 | - task-template:
name: calculate-pi
image: perl
exec: perl
args:
- "-Mbignum=bpi"
- "-wle"
- "print bpi(2000)"
|
The above example does not allow any user input. Let’s modify the template to accept user input for the digit accuracy.
1 2 3 4 5 6 7 8 | - task-tempalte:
name: calculate-pi
image: perl
exec: perl
args:
- "-Mbignum=bpi"
- "-wle"
- "print bpi({{ argv[0] if argv[0] > 0 else 2000 }})"
|
By replacing bpi(2000)
with bpi({{ argv[0] if argv[0] > 0 else 2000 }})
, we can specify the accuracy as the first argument.
Here, any ouptut results after resolving the expression in between the {{
and }}
will replace the template text {{ ... }}
. This is templating and we can do loops and conditions using templates.
In this example, the {{ argv[0] if argv[0] > 0 else 2000 }}
is an inline expression. You may expand it to a complete structure in a single line: {%- if argv[0] > 0 -%}{{ argv[0] }}{%- else -%}2000{%- endif -%}
, but it’s more verbose and may be not easy to read.
The {% ... %}
is called a block. It does not produce any text to the final arguments but it allows you to define variables, do a if-elif-else-endif
conditional expression, or do a for-loop
expression.
The {%- and -%} tells the template engine not inserting blanks before and after this {% ... %}
expression, respectively.
Here, to make this complete structure easier to read. You may need the multi-line mode for the args
arguments.
1 2 3 4 5 6 7 8 9 10 11 12 | - task-template:
name: calculate-pi
image: perl
exec: perl
args: |
- "-Mbignum=bpi"
- "-wle"
{% if argv[0] %}
- "print bpi({{ argv[0] }})"
{% else %}
- "print bpi(2000)"
{% endif %}
|
At line 5, we added a |
after the args:
. This specify the args
via a multi-line mode instead of a list.
Under the hood, the args
is firstly defined as a multiline string of YAML template. Then, we parse this string again and expect that the resolved result is a list.
This is useful especially when the argument length is not fixed. Developers may use the for-loop
to define argument lists of non-fixed lengths.
1 2 3 4 | args: |
{% for element in collection %}
- "{{element}}"
{% endfor %}
|
See also
There are many kinds of multi-line syntaxes, such as >
, >+
, |
, |+
, |-
and others. They all have different meanings.
For a more detailed multiline mode, please see Multiline syntax in the YAML format
For most of the time, you only use |
and >-
. You use |
after args:
to specify arguments like the above example, and you use >-
to specify a one liner argument, such as the command argument of bash -c
.
Define tasks by templating¶
The above example shows you to define a task by templating. This is greatly inspired from the Ansible Playbook and how the Ansible team uses a template engine, jinja2, to resolve the playbook. Ansible is an automation system in IT industry. It can deploy app and manage machine configurations automatically. The Ansible Playbook is also written in the YAML format, but it is generalized in the IT world.
Note
The inspiring part acturally comes from how Ansible Playbook resolves a configuration template using a web template engine, jinja2.
The original purpose of this template engine is to generate web pages by giving data and a web page template. It’s smart to apply the template engine here.
With a (web) template engine, we can specify loops and conditions in a straightforward basic text.
Since i built the workflow-player
in Pure JavaScript, i found an equivalent template engine: Nunjucks. If you would like to implement your playbook runner in python, you can directly apply jinja2 as part of your template engine.
These templating systems have a special feature, filter function, that makes them really suitable for task definitions.
The filter function allows us to transform from one value to another by using the pipe character, |
, followed by filter function name.
You can consider the filter function as a value-transform function.
For example, we can use a filter function like capitalize
to transform a string to be capitalized.
Several specialized filter functions that commonly used when defining tasks. For example,
- batch-items: "{{ var.mergedSeqFolder | listFromFileGlob(['*.fastq.gz', '*.fq.gz']) }}"
task-template:
name: 'some-batch-task'
image: 'some-image'
args:
- "{{ item | pathMapping }}"
First of all, the template engine resolves strings that wrapped between {{
and }}
.
You may consider the content of the batch-items
as giving a folder path (left-hand side of the |
) to the filter function, listFromFileGlob
.
Then, this filter function transforms the given folder path
The above example uses the listFromFileGlob
filter function, so that we can use globbing expressions to get a list of matched file paths that inside the mergedSeqFolder
folder.
The resulting list is used as batch-items that will be iterated to execute the batch tasks. In this one-line definition, you already declare a batch task for all the matched files. As for the parallelism, it depends on the concurrency parameter. When the concurrency equals to 1, these jobs are executed sequentially. But the concurrency setting is controlled by the end users instead of developers. So, the task parallelism is not handled by developers. It’s not part of the playbook.
All other value transforming is in this writing style, just the pipe |
followed by filter functions. That’s all.
Let’s use a more advanced example to define real workflow.
A full examplary workflow template¶
The following defines an examplary workflow.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | ---
- var: # This is a step 1 task
# This trimmedSeqFolder folder contains all the trimmed sequence files.
trimmedSeqFolder: "{{ argv[0] }}"
# A folder that contains one reference genome file: refgenome.fa (other files in this folder are bwa indexed files)
# a refgenome folder expects only one *.fa file.
refGenomeFolder: "{{ argv[1] }}"
# This is a batch task containing multiple jobs.
batch-items: "{{ var.trimmedSeqFolder | listFromFileGlob(['*_trimmed.fq']) | sort }}"
task-template:
name: run-bwa
condition: "{{ true if item}}"
image: quay.io/biocontainers/bwa:0.7.17--ha92aebf_3
exec: bwa
args:
- aln
- -t
- "{{ option.cpus }}"
- -f
- "{{ var.trimmedSeqFolder | suffix('/' + path.basename(item, '_trimmed.fq') + '.sai') | pathMapping }}"
- "{{ var.refGenomeFolder | listFromFileGlob(['*.fa']) | sort | first | pathMapping }}"
- "{{ item | pathMapping}}" # The trimmed fastq file.
option:
setUser: true
- task-template: 'another-template' # This is a task of step 2.
args:
...
|
The above example uses the bwa to map sequencing data to a reference genome.
Tip
No need to use another file to pipe these tasks. Constructing workflow is just like stacking building blocks.
You can concatenate the blocks (here, a block begins with - var:
) to form a pipeline. Just making sure that your var.name
is valid and compatible.
Line 1 declares a YAML document start. And you may noticed that the lines 2 and 21 start with a
-
. In the YAML format, they are list (array) elements. Workflow runner can parse this array structure to execute each of these tasks in order.lines 2-7 declared the variables for template to use. The values for these variables are from the global arguments. You can see this as setting alias. With the alias, you won’t change the content of
task-template
whenever the argument orders are changed. These variables will be existed until overwrite. Namely, the step 2 can directly use declaredvar.trimmedSeqFolder
without specifying again.Line 9 declares an item list and each item will be used to resolved the
task-template
to produce a job in this batch task. Therefore, if we got 10 items in thebatch-items
and all of theirtask-template.condition
is true, there will 10 jobs in this batch task. The workflow runner will wait all the 10 jobs to finish. Then, the workflow proceeds to the next step.Line 10-24 is the task template. It contains properties of
name
,image
,exec
,condition
,args
, andoption
.The
name
is used as a key and is defined by developers. This is mainly used to identify the task.The
image
is the full container path.The
exec
is the executable. For a common use case, you can use/bin/sh
or/bin/bash
here and use-c
to formulate your commands.The
condition
tells if this task will be executed or not. If undefined, thecondition
is true. If set to any string other than ‘true’ or ‘TRUE’, this task will be ignored.The
option
can set additional settings for workflow runner. Here, this option will be sent to ourtask-adapter
. The content of thisoption
depends on theadapters
. You can set some runtime resources as default values, such ascpus
ormem
. In this example, we can usesetUser: true
to notify the adapter that--user
must be set when running the docker image. This prevents outputing files with root privileges.The
args
can have two types: array or string. In this example, we can use the array type and you can see lines 12-18 begins with-
. Values leading with the-
will be resolved as argument values in order. This is used when there is a fixed number of known arguments. For a non-fixed number of arguments, use the string type. The string type is a more advanced usage and allows you to fully customized your arguments. Basically, the multi-line mode of the YAML will be used by usingargs: |
orargs: >
. Under the hood, this string is expected as in YAML format and will be resolved as an array. So you can use{% for ... %} ... {% endfor %}
to produce arguments of uncertain number.The
pathMapping
filter function transforms your host path to a path inside the container. The Workflow runner will do the volume mapping for you. This is probably the most frequently used filter function.
Tip
The key orders are not important. That is, the order of
name
,condition
,exec
, etc. can be exchange.The list elmenets inside
args
may be important. It is the order of the input argument for theexec
.We allow array of array here to define your tasks, so that your workflow can have multiple (batch) tasks executed concurrently in one step. If a step contains multiple tasks, you can start the line with
- - task-template:
to declare array of array using the YAML format. The workflow runner will wait for all the tasks in a step to finish.Because indentation is important in the YAML format, using text editors like Sublime Text, Visual Studio Code, or Notepad++ is recommended. They can help you to collapse items for a clear view.