Guidelines to plan tasks into packages on Big Data Processor

We suggest the following tips to have your scripts become a protable form.

1. Do expose all output paths as input arguments.

All output files/folders are specified through arguments when executing your script/program. Namely, Your scripts/programs should accepts arguments of all output file/folder paths.

Tip

A task may generate many output files. In this case, we can plan to generate a single output folder for this task, and all output files with any complex file structure can then be organized in this folder.

Tip

Intermediate outputs of a workflow are recommended to be generated inside an output folder. When this output folder is set as an argument, users can directly examine these intermediate outputs on Bio-Data Processor.

Warning

Bio-Data Processor does not tightly control the file outputs to maintain light weighted. Developers are suggested to know where the output files are (see the third suggested point below). If the outputs are created in the current working directory, we provided web interface for you to set the cwd property for that task. The default cwd is the $ProjectFolder that specifies the project folder path.

2. For input/output files, accepts the absolute full path(s) as arguments.

It is recommended for a task to accept absolute paths as input/output file arguments.

Tip

If a task only accepts the output file/folder that is relative to the current working directory (e.g. output prefix), you may consider setting the task current working directory as the $ProjectFolder on the Bio-Data Processor and setting the argument to use only the output file/folder name only instead of the full path.

Tip

If a (batch) task needs many input files, try to plan the task that takes a single input folder as an argument that contains all the input files.

4. Receive SIGTERM or SIGINT signal for gracefully stop processes

This is usually the default behavior for a script/program to stop. If your scripts/programs intercept these two signals, please make sure your task will stop itself gracefully, e.g. removing all intermediate files or stop all child processes.

5. Construct a container image for your scripts to run.

We strongly encourage you to plan all your tasks running inside Docker or Signularity containers. In this way, these tasks are cross-platformed and users need not worry about the task running environments. In the BDP, the container images are suggested to contain only the dependent libraries or packages, so that you can edit and then directly test your scripts on the BDP. Also, you do not have to build a new image whenever you modify your scripts. Your scripts are mounted to the container at run time.

Tip

Many Task Adapters of BDP can be developed to deploy containerized tasks on different environments. (e.g. a server with a cluster architecture with SGE/PBS, cloud computing (Google Cloud/Microsoft Azure/Amazon web service, etc.), with CWL or WDL, or simply on a single workstation.)

6. Do NOT handle batch tasks by yourself

Some tasks may process a bunch of files with same arguments. This is called a batch task. Most of the time, batch tasks are considered as independent tasks and can be executed parallel. If you hard-coded the for-loop inside your script, these tasks can only be executed one by one. The Workflow Playbook help you to configure this kind of batch tasks. You may loop over some kinds of lists, e.g. file globs, list from csv or excel files. Developers should focus on building each task unit and need NOT worry about the parallel task executions, since there may be different kinds of task parallelism or no parallel environments at run time.

Different kinds of adapters can handle the task parallelisms for you. For each task, developers configure ONE single task definition and the task can then be deployed on different run-time environments, e.g. a local workstation, High-Performance computing (HPC) clusters, cloud computings.

See also

Please click here to see how to define a batch task with the introduction of Workflow Playbook.

7. Be careful when outputting files in input folders

Usually, we do NOT recommended writing files in input folders, since the file writing conflicts may occur. That is, multiple process may write data into the same file. Usually, the OS will lock the file for the first process, and other process will not have permissions to write data. Hence, the task will fail.

Bio-Data Processor allows writing files into input folders because the output details are not tightly controled. We preserve the most flexible circumstances to run a task. However, be sure to create unique files under the input folders.

Tip

Please write output or temp files in the output files. If the intermediate temp files are not so important, you may consider to write temp files into the /tmp/ folder. Each task is independent in a docker container. Namely, the /tmp is not shared among different processes. In this way, you can write the temp file using the same name.