Cannell, R.*, Aron, S., Hazelhurst, S.
ydney Brenner Institute for Molecular Bioscience, University of the Witwatersrand, Johannesburg, South Africa
The identification of genetic variants associated with a disease or phenotype forms the basis of most human genetic studies. Whole genome sequencing (WGS) serves as the optimal approach to identify variation across the entire human genome in a single experiment. While this has proven a costly approach in the past, a significant reduction in sequencing costs has now made it feasible to sequence large cohorts to address specific biological questions. The reduction in costs has transformed the scale at which genomic data is produced, shifting the focus to optimizing computational methods required to accurately and efficiently process and analyse WGS data. A number of algorithms and pipelines have been developed to call genetic variants from WGS data, however, not many have proved to scale well to accurately process thousands or tens of thousands of samples efficiently on modest computing infrastructure. Illumina has developed the Dynamic Read Analysis for GENomics (DRAGEN) pipeline accessible on a variety of computing infrastructure, including the cloud, via the Illumina Connected Analytics (ICA) platform. DRAGEN has been shown to accurately and comprehensively call a range of variant types at scale. ICA provides an interface to run the DRAGEN pipeline to call variants from WGS data utilising cloud resources, however, this can prove to be a costly approach with large datasets due to the amount of time input and output datasets are required to be stored on cloud resources. To assess the viability of using DRAGEN to cost efficiently call variants from WGS data we developed a Nextflow workflow aimed at optimizing the upload and download of files during the DRAGEN variant calling pipeline to reduce cloud storage costs associated with the analysis. The Nextflow workflow consists of several individual processes aimed at optimizing each step of the pipeline. These processes execute Bash commands in addition to commands from the ICA command line interface (CLI). The workflow begins by uploading an input file or file path to an existing project. The necessary data attributes (file ID, name, or path) of the file is extracted and passed as input into the following process. A polling mechanism in the following process checks the file to see if it is available for analysis. Once the file is available, an analysis using the DRAGEN pipeline is initiated. Another polling mechanism is then triggered to monitor the progress of the pipeline run. When the pipeline run is complete, the output files are downloaded, and the initial uploaded data and the output files are deleted. Nextflow's ability to run several processes concurrently allows us to upload multiple files at once, and to trigger several pipeline runs so that the analysis takes place in parallel. Although ICA provides a user interface to upload and run DRAGEN, the development of this workflow allows for more flexibility and control of data transfer and cloud storage costs. Final benchmarking and costing of the completed workflow will allow for an assessment of the feasibility of utilizing the DRAGEN platform to call variants in large WGS studies.
Keywords: DRAGEN, WGS, platform