Victor Solovyev

A Tool for Reconstructing Sequences and Transcriptome Analysis using Next-Generation Sequencing Data

Victor Solovyev1, Igor Seledtsov2, Peter Kosarev2, Vladimir Molodtsov2

1Department of Computer Science, Royal Holloway, University of London, Egham, Surrey TW20 0EX,UK; 2Softberry Inc., 116 Radio Circle, Suite 400, Mount Kisco, NY 10549, USA


Recent introduction of next-generation sequencing instruments: 454 (Roche), Genome Analyzer (Illumina), ABI-SOLiD (Life Technologies), and several others, all require extensive informatics support in forms of automatically operated pipelines.

We have developed OligoZip tool for processing short reads that provides effective solutions to the following tasks: De novo reconstruction of genomic sequence; Reconstruction of sequences based on reference genome from same or close species; Mutation profiling and SNP discovery in a given set of genes.  OligoZip tool uses L-plets hashing technique to achieve fast data processing, and it takes into account reads quality information. We have tested the ab initio assembling on artificial oligs ofse phaeral phages and Arabidopsis chromosome.  The real reads of a few Methanopyrus bacterial strains have been assembled into several hundred contigs.

To map RNA-Seq data to a reference genome, assemble them into transcripts and quantify the abundance of these transcripts in particular datasets we build the Transomics computational pipeline. The pipeline produces alignment of RNA-Seq reads to genome, identification of alternative transcripts, and measuring expression levels of predicted transcript isoforms. To implement these components, we have adopted our existing general-purpose software, some of which is widely used in analysis of biological sequences, to this specific task.  For an initial step of mapping sequence reads to a genome, we applied a special variant of our fast pairwise alignment family of program SCAN2, which will classify given reads into two groups: “exonic” reads that demonstrate high-quality, non-interrupted alignment to a genomic sequence, and “non-mapped” reads. Potentially, this step would map most of the reads to a genome, and the remaining “non-mapped” group would be small enough to be subjected to more thorough analysis. As a second step, we have used a modified variant of our EST_MAP program to align these “non-mapped” reads set.  The program is using splice site matrices and produces very accurate alignment of reads that interrupted by an intron sequence. We incorporated read information into our FGENESH ab initio gene prediction program and developed iterative procedure of identification of alternative splicing gene variants.

Finally, we have developed a module to compute a relative abundance of alternative transcripts generated by the above-described approach using a solution of a system of linear equations. The initial variant of Transomics pipeline has been successfully applied to data of RGASP project.

Using the OligoZip assembler and our metagenomics gene prediction pipeline FgenesB we have developed a novel computational approach of differentiation between toxic and non-toxic bacterial serotypes using next-generation sequences data. This approach analysing DNA sequences extracted from environmental samples can be applied for analysis of bacterial infections, environmental and food contaminations.