NGLESS
NGS With Less Work
Problem Setting
- NGS (Next Generation Sequencing) has many applications.
- It is impossible to define the definitive processing pipeline (context-dependent).
- Computational analysis is the bottleneck for many labs.
Not computation time, but setting up pipelines & interpreting results.
Input Data
@SRR172902.1_USI-EAS376:1:1:1:1204/1
CGGTCGTACTCGCGCTGGGCCNCGCCCAGCGCCAGCCGCGAGTNGATTTCTAACGCCTGC
+
a\^Sa_T``Z`Y`S^SX\YaZDY^`_ZKU^Z`YLW\\Y\]TYKDTETYZ^VRW\BBBBBB
@SRR172902.2_USI-EAS376:1:1:1:1811/1
AGATATAAAACCTGAAAACTTNAATTTAGATATTTATTATGAANACGAAGACGTTGCTGT
+
Xabbba\a\_bbaba^a`bb^DZ`bbbbbbabbbbabaabb[]D^bb^^babb`abbaa^
Data Structure
- Header line (may have structure).
- Sequence line, typically short (35-350 bps).
- Plus sign (separator).
- Quality of each base, encoded as ASCII characters
(a means that the machine is 99.97% confident.)
Some Technical Complications
- Header may have structure.
- Sequences may not all be of same size.
- There are two sub-formats (ways of encoding qualities).
- There is a more complex type of data called paired-end, which we will ignore in this presentation.
Mapping to a Reference is a Primitive
- Mapping means given a reference and a read, finding
the read in the reference.
- For example, take the read
GCACATCCAGGCTGGTCAGTGTGGCAACCAGATCGGTGCCAAGTTCT
- We can
computationally find it in the human genome (chromossome 6, in the TUBB
[tubulin, beta 1] gene).
- In many cases, it is not an exact match.
Typical Pipelines Share the Same Initial Steps
- Pre-process & filter (based on qualities).
- Map to a reference.
- Compute statistics of mapping.
- Problem-specific analysis.
For version 1.0, we wish to support the first 3 steps plus one
or two applications in the 4th step.
Problems are Simple, but Scale is Large
- A single run generates millions of reads.
- Typical file is 1~10GiB (compressed!).
- A genome can have giga basepairs.
- In many problems, we have multiple genomes, many read files (hundreds, now growing to thousands).
- OTOH, most of the computations are embarrassingly parallel.
Example of NGLESS
ngless "0.0"
input = fastq(['ctrl1.fq','ctrl2.fq','stim1.fq','stim2.fq'])
input = unique(input, max_copies=2)
preprocess(input) using |read|:
read = read[5:]
read = substrim(read, min_quality=26)
if len(read) < 31:
discard
mapped = map(input, reference='hg19')
write(count(mapped, features=[{gene}]),
ofile='gene_counts.csv')
Basic Properties of the Language
- Pythonesque syntax with Ruby-like blocks.
- Statically and strictly typed.
- Types are implicit, but limited language allows for type inference and checking.
Target Market
- Bioinformaticians working in a wetlab setting.
- Every serious biological lab in the world now needs to hire at least one.
- They know programming (at least basic programming), but are not method developers.
- The tool can still be useful for more advanced users.
The Language Design
- Good practices should be automatic.
- Types should allow for bug discovery.
- Scripts should be reproducible: use of version string.
Quality Control is Implicit
- Loading the files triggers quality control.
- After post processing, quality control is run again.
- Mapping also generates quality statistics.
- User is not allowed to skip quality control.
Types Are Domain Types
- Reads, references, mapped reads...
- Allows for fast error detection & good error messages.
- Bad error messages can be a huge barrier to adoption.
- Alternative solutions (workflow engines) work great, but fail badly.
Types in NGLESS
- Read
- ReadSet
- MappedRead
- MappedReadSet
- CountMap
- int, float, bool, symbol, str
Basic Domain Knowledge is Builtin
- Human (and other model organisms) are built-in.
mapped = map(input, reference='hg19')
annotated = count(mapped, features=[{gene}])
The user does not need to specify which annotation database to use
in the second line, as the reads had been mapped to hg19.
The language knows which reference to map and to annotate with.
Some types of common mistakes (mismatched reference files, &c) are impossible or hard to express.
Tool can ship with or auto-download the necessary data dependencies.
Status of Project
- Draft language design is finished.
- Interpreter for the language is being implemented.