NGLESS
NGS With Less Work

Problem Setting

Input Data

@SRR172902.1_USI-EAS376:1:1:1:1204/1
CGGTCGTACTCGCGCTGGGCCNCGCCCAGCGCCAGCCGCGAGTNGATTTCTAACGCCTGC
+
a\^Sa_T``Z`Y`S^SX\YaZDY^`_ZKU^Z`YLW\\Y\]TYKDTETYZ^VRW\BBBBBB
@SRR172902.2_USI-EAS376:1:1:1:1811/1
AGATATAAAACCTGAAAACTTNAATTTAGATATTTATTATGAANACGAAGACGTTGCTGT
+
Xabbba\a\_bbaba^a`bb^DZ`bbbbbbabbbbabaabb[]D^bb^^babb`abbaa^

Data Structure

  1. Header line (may have structure).
  2. Sequence line, typically short (35-350 bps).
  3. Plus sign (separator).
  4. Quality of each base, encoded as ASCII characters
    (a means that the machine is 99.97% confident.)

Some Technical Complications

Mapping to a Reference is a Primitive

Typical Pipelines Share the Same Initial Steps

  1. Pre-process & filter (based on qualities).
  2. Map to a reference.
  3. Compute statistics of mapping.
  4. Problem-specific analysis.

For version 1.0, we wish to support the first 3 steps plus one or two applications in the 4th step.

Problems are Simple, but Scale is Large

  1. A single run generates millions of reads.
  2. Typical file is 1~10GiB (compressed!).
  3. A genome can have giga basepairs.
  4. In many problems, we have multiple genomes, many read files (hundreds, now growing to thousands).
  5. OTOH, most of the computations are embarrassingly parallel.

Example of NGLESS

ngless "0.0"
input = fastq(['ctrl1.fq','ctrl2.fq','stim1.fq','stim2.fq'])
input = unique(input, max_copies=2)
preprocess(input) using |read|:
    read = read[5:]
    read = substrim(read, min_quality=26)
    if len(read) < 31:
        discard
mapped = map(input, reference='hg19')
write(count(mapped, features=[{gene}]),
        ofile='gene_counts.csv')
    

Basic Properties of the Language

Target Market

The Language Design

  1. Good practices should be automatic.
  2. Types should allow for bug discovery.
  3. Scripts should be reproducible: use of version string.

Quality Control is Implicit

  1. Loading the files triggers quality control.
  2. After post processing, quality control is run again.
  3. Mapping also generates quality statistics.
  4. User is not allowed to skip quality control.

Types Are Domain Types

Types in NGLESS

Basic Domain Knowledge is Builtin

Status of Project

  1. Draft language design is finished.
  2. Interpreter for the language is being implemented.