利用RNA-Seq技术来分析转录组现在是一种很普遍的方法,在我读PhD期间分析过细菌的转录组数据。做差异表达分析的基本流程是:做质量控制->利用bwa map reads到基因组上->计算各个基因上unique mapped的reads->利用DEGSeq来做差异表达分析(有空的时候我在把这个整理上来)。由于原核生物不存在可变剪接,所以无论使用bwa或者bowtie都可以。最近在处理human genome的转录组数据,所使用的方法还是参照之前PLoB上发布的文章《利用tophat和Cufflinks做转录组差异表达分析的步骤详解》。(更多关于转录组分析工具和方法,譬如饱和度评估、Tophat使用等等可以直接在PLoB搜索)。
在最近的一个月内,三篇介绍RNA-Seq数据分析新方法的文章发表在Nature集团旗下的刊物上,其中一篇发表在《Nature Methods》上,另外两篇都发表在《Nature Biotechnology》上。
有趣的是,这三篇文章都有一位共同的作者,那就是约翰霍普金斯大学计算生物学中心的Steven Salzberg。Salzberg是生物信息学和计算生物学领域的杰出科学家,在基因组组装上经验丰富,曾参与人类基因组计划。自新一代测序出现以来,他和他的团队开发了一系列应用程序,其中Bowtie和TopHat程序被广泛下载和引用。
这三篇文章分别介绍了三种新工具:HISAT、StringTie和Ballgown。它们分别取代了Salzberg之前开发的早期工具,为RNA-Seq的原始读取到差异表达分析提供了一种全新的方式。
HISAT全称为Hierarchical Indexing for Spliced Alignment of Transcripts,由约翰霍普金斯大学开发。它取代Bowtie/TopHat程序,能够将RNA-Seq的读取与基因组进行快速比对。这项成果发表在3月9日的《Nature Methods》上。
HISAT利用大量FM索引,以覆盖整个基因组。以人类基因组为例,它需要48,000个索引,每个索引代表~64,000 bp的基因组区域。这些小的索引结合几种比对策略,实现了RNA-Seq读取的高效比对,特别是那些跨越多个外显子的读取。尽管它利用大量索引,但HISAT只需要4.3 GB的内存。这种应用程序支持任何规模的基因组,包括那些超过40亿个碱基的。
HISAT软件可从以下地址获取:http://ccb.jhu.edu/software/hisat/index.shtml。
StringTie则由约翰霍普金斯大学联合德州大学西南医学中心开发,能够组装转录本并预计表达水平。它应用网络流算法和可选的de novo组装,将复杂的数据集组装成转录本。与Cufflinks等程序相比,在分析模拟和真实的数据集时,StringTie实现了更完整、更准确的基因重建,并更好地预测了表达水平。
例如,对于从人类血液中获得的9000万个读取,StringTie正确组装了10,990个转录本,而第二名的组装程序Cufflinks只组装了7,187个,提高了53%。对于模拟的数据集,StringTie正确组装了7,559个转录本,比Cufflinks的6,310个提高了20%。此外,它的运行速度也比其他组装软件更快。StringTie软件可从以下地址获取:http://ccb.jhu.edu/software/stringtie/。
Ballgown于3月初发表在《Nature Biotechnology》上,是开展差异表达分析的工具。它能利用RNA-Seq实验的数据,预测基因、转录本或外显子的差异表达。Ballgown软件的详细说明如下:https://github.com/alyssafrazee/ballgown。
三款软件的论文摘要预览:
HISAT: a fast spliced aligner with low memory requirements
HISAT (hierarchical indexing for spliced alignment of transcripts) is a highly efficient system for aligning reads from RNA sequencing experiments. HISAT uses an indexing scheme based on the Burrows-Wheeler transform and the Ferragina-Manzini (FM) index, employing two types of indexes for alignment: a whole-genome FM index to anchor each alignment and numerous local FM indexes for very rapid extensions of these alignments. HISAT's hierarchical index for the human genome contains 48,000 local FM indexes, each representing a genomic region of ~64,000 bp. Tests on real and simulated data sets showed that HISAT is the fastest system currently available, with equal or better accuracy than any other method. Despite its large number of indexes, HISAT requires only 4.3 gigabytes of memory. HISAT supports genomes of any size, including those larger than 4 billion bases.
StringTie enables improved reconstruction of a transcriptome from RNA-seq reads
Methods used to sequence the transcriptome often produce more than 200 million short sequences. We introduce StringTie, a computational method that applies a network flow algorithm originally developed in optimization theory, together with optional de novo assembly, to assemble these complex data sets into transcripts. When used to analyze both simulated and real data sets, StringTie produces more complete and accurate reconstructions of genes and better estimates of expression levels, compared with other leading transcript assembly programs including Cufflinks, IsoLasso, Scripture and Traph. For example, on 90 million reads from human blood, StringTie correctly assembled 10,990 transcripts, whereas the next best assembly was of 7,187 transcripts by Cufflinks, which is a 53% increase in transcripts assembled. On a simulated data set, StringTie correctly assembled 7,559 transcripts, which is 20% more than the 6,310 assembled by Cufflinks. As well as producing a more complete transcriptome assembly, StringTie runs faster on all data sets tested to date compared with other assembly software, including Cufflinks.
Ballgown bridges the gap between transcriptome assembly and expression analysis
To the Editor:
Analysis of raw reads from RNA sequencing (RNA-seq) makes it possible to reconstruct complete gene structures, including multiple splice variants, without relying on previously established annotations. Downstream statistical modeling of summarized gene or transcript expression data output from these pipelines is facilitated by the Bioconductor project,…
参考来源:http://www.ebiotrade.com/newsf/2015-3/2015316173503298.htm