Murdoch University Research Repository

Welcome to the Murdoch University Research Repository

The Murdoch University Research Repository is an open access digital collection of research
created by Murdoch University staff, researchers and postgraduate students.

Learn more

G4009.04: Data analysis support for existing projects in SP2 with emphasis on analysis of next generation sequencing data

Varshney, R.ORCID: 0000-0002-4562-9131, Thakur, V., Balaji, J., Shah, T., Bhanuprakash, A., May, G., Farmer, A., Studholme, D., Jones, J., Mauleon, R.P. and Bruskiewich, R. (2011) G4009.04: Data analysis support for existing projects in SP2 with emphasis on analysis of next generation sequencing data. Project Updates. CGIAR Generation Challenge Programme .

Abstract

Summary
Next generation sequencing has made possible the generation of genomic resources for even orphan crops due to high-throughput sequencing and cost-effectiveness. The data obtained is huge and requires efficient computational set-up for rapid and accurate analysis. Since one of the objectives of the project is to obtain variants or polymorphisms between genotypes, a number of sequential analyses are required (such as mapping, assembly, SNP calling) in processing the raw data. This could be practically achieved by creating a pipeline that integrates available open-source tools for NGS data analysis.

This project was started with three main objectives: 1) Benchmark available open source short reads assembly and downstream analysis programs/software, 2) Data analysis support, and 3) Data integration, availability and visualization. The proposed NGS data analysis pipeline was integrated with open-source tools like Maq, NovoAlign as they have the complete repertoire of analysis functions required and inhouse Perl scripts for the identification of SNPs between parental genotypes. The above mentioned tools were benchmarked and the identified SNPs were experimentally validated. GBrowse was configured to visualize the mapped short reads on to the reference sequence, and also the variants. The results are maintained as session based output files to avoid any discrepancy generated by the simultaneous use of the pipeline by different users. Various server side in-house Perl scripts were integrated into the pipeline which automates the process of generating gff3 files, which contains the mapping results and updating the configuration file of database which are required for GBrowse. As an alternate Tablet- alight weight client side tool can also be used.

The proposed activities under objective 1 are completely achieved. All the activities/quantifiable outputs under objective 2 and 3 are achieved except, the implementation of digital gene expression under objective 2 and implementation of the NGS pipeline in Taverna workflow of objective 3 are not achieved. The pipeline can be accessed at http://hpc.icrisat.cgiar.org/ngs/ .

In addition to the proposed objectives, SOAP (Li et al. 2009) has also been benchmarked as it uses Burrows Wheeler Transformation (BWT) algorithm which helps in faster alignment of reads to the reference with reduced memory usage, it will be integrated soon, in the phase II of the project, into the pipeline along with other inhouse Perl scripts. Regarding the data availability, documentation of various procedural/methodological aspects of the pipeline is under progress and several of the datasets have been available either in the public domain or locally, appropriate ones will be posted very soon on Cropforge (http://cropforge.org/)

Conclusions
Benchmarking of tools for the purpose of variant detection showed the mapping tools to be better than de novo ones. Among mapping tools, we found MAQ to be one of the most useful ones. We found several tool-specific parameters, which affect the quality of results; the SNP calling was done at variable level of stringency of these parameters. In order to identify highly specific SNPs, the existing approach for depth based consensus calling was modified to include the high quality bases. Several Perl scripts have been written to extract the desired information from the output of the above mentioned tools. Finally, a list of SNPs has been generated for two chickpea genotypes using the three approaches/tools and comparison has been made. A sample of predicted SNPs is experimentally validated to access the accuracy of prediction by said tools/approaches. The pipeline was developed and can be accessed at http://hpc.icrisat.cgiar.org/ngs/ .

Item Type: Others
Publisher: CIMMYT
Publisher's Website: https://www.generationcp.org/
URI: http://researchrepository.murdoch.edu.au/id/eprint/63334
Item Control Page Item Control Page