I thought to start by explaining briefly how newbler works. I’ll do this by following the output newbler generates during the assembly process. This information is displayed during assembly, and can also be found in the 454NewblerProgress.txt file. It is a good thing anyways to have a look at this file, as it sometimes displays certain warnings (see below).
This example assembly is based on a read dataset consisting of both shotgun reads, and paired end reads.
The first thing you’ll see is a message stating that the assembly computation started, and which version of newbler you used.
Then, you’ll see messages for each input file saying Indexing XXXXXXX.sff…, and a counter. During indexing, newbler scans the input file, performs some checks and trims the reads (sometimes more than the base-calling software already did). One of the checks is for possible 3′ and 5′ primers: if a certain percentage of reads contains the same sequence on either the 3′ or 5′ end, this is mentioned. I’ve had some surprises here, such as finding out that reads I got from another group contained an adaptor sequence, which caused problems during the assembly. More on primer removal later…
If an input sff file contains paired end reads, this will be mentioned, as well as the number of reads that contained the paired end linker sequence, for example:
224024 reads, 58599257 bases, 112080 paired reads.
Next:
Setting up long overlap detection…
XXXXX reads to align
Building a tree for YYYYYY seeds…
Computing long overlap alignments…
The first phase of assembly is finding overlap between reads. Newbler splits this phase into one for long reads (this goes very fast) and shorter reads (can take quite some time). As aligning all reads against each other would take too long time, newbler (and many other programs) actually make seeds, 16-mers of each read, where each seed starts 12 bases upstream of the previous one. These seed length and step sizes can be changed if you want (I’ve never tried this, though). When two different reads have identical seeds the program tries to extend the overlap between the reads until the minimum overlap (default 40 bp) with the minimum alignment percentage default 90%) has been reached. These settings can also be changed and influence the alignment stringency, this I will come back to in a later post.
When reads overlap they can be used to generate consensus contigs. The illustration at the top of the post shows an ideal situation with reads that all are the same length and have no variation in sequence in the overlap. In real assemblies, reads will actually show differences, and the contig sequence is based on a consensus estimate.
After long overlap follows short overlap:
Setting up overlap detection…
XXXXXXX reads to align
Building a tree for YYYYYYYY seeds…
Computing alignments…
Then the curious message appears:
Checkpointing…
Basically, checkpointing means writing the intermediate results to disc, so that in the case of a crash, you could continue the assembly from the last ‘checkpoint’.
At this point, newbler, as many other assemblers, has created a contig graph. Aligned reads form the ‘nodes’, reads going from one contig to another form the ‘edges’. For example, a small part of the graph could like like this:
Why would reads go from contig 1 to contig 3, as well as from contig 2 to contig 3? If the sequence from contig 3 is repeated in the genome, the reads coming from these repeats are very similar and collapse into a single contig. At the beginning and end of this contig there will be reads extending into the respective neighboring, single copy genomic regions. Essentially, here is the problem with assembly: repeats cause a complicated contig graph structure.
The graph in the above picture can be simplified to this:
After aligning all the reads, the contig graph potentially has many nodes and edges. The size and complexity of the graph depend on the size of the genome and the repeat structure. The ‘real’ genome is a path through the graph visiting all nodes.
Newbler continues:
Detangling alignments…
-> Phase 1…
-> Phase 2…
-> Level 1, Phase 3, Round 1…
…
-> Level 4, Phase 9, Round 2…
Checkpointing…
At this stage, the graph is simplified, I’ll leave out the details.
Setting up short read alignment detection…
-> 39593 of 39593…
Mapping short reads to consensi…
Paired end read halves shorter than a certain size are not assembled during alignment, but mapped to the contigs afterwards. The longer halves are assembled with the shotgun reads. See below.
Adding mapped short reads to alignments…
Checkpointing…
Building contigs/scaffolds…
Using all uniquely mapped paired end halves, the contigs can be ordered and oriented. At least two paired reads must link two contigs for them to be scaffolded. In order to estimate gap sizes between the contigs, newbler uses the paired end reads where both halves map to the same contig for estimating the paired end library insert size (this is reported in the 454NewblerMetrics.txt file). This insert size is than used to estimate the gap size between contigs. So, a scaffold is a series of contig-gap-contig-gap-contig and so on.
Next, newbler reports on how many scaffolds and contigs it found
-> XX scaffolds, YYY large contigs
(‘Large’ refers to contigs of at least 500 bp) , and starts producing output:
Computing signals…
Reading flowgrams…
Checkpointing…
Generating output…
Reading flowgrams…
Newbler here reads in all information from the flowgrams (sff file, these include the light intensity during sequencing for each read) and calculates for each position in the contig the consensus signal. Also, a consensus quality score is calculated for each base. Several output files are written during this phase.
Finally, newbler tells you it finished successfully.
From: http://contig.wordpress.com/2010/02/09/how-newbler-works/