Wolberg, Y. A.*1,2, Hazelhurst, S.2,3, Lombard, Z.1,4
1 Division of Human Genetics, National Health Laboratory Service and School of Pathology, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa
2 Sydney Brenner Institute for Molecular Bioscience, University of the Witwatersrand, Johannesburg, South Africa
3 School of Electrical and Information Engineering, Faculty of Engineering and the Built Environment, University of the Witwatersrand, Johannesburg, South Africa
4 Department of Internal Medicine, School of Clinical Medicine, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa
Copy number variations are large structural variations that have been implicated in a number of conditions, including neurodevelopmental disorders. Currently, the first-tier diagnostic test for individuals with developmental or congenital disabilities is chromosomal microarray. However, this has limited resolution compared to next generation sequencing approaches, with whole exome sequencing also being used to detect CNVs. Deciphering Developmental Disorders in Africa (DDD-Africa) is a study that aims to uncover the genetic basis behind developmental disorders in African populations and in so doing, improve diagnostic options in resource-constrained environments. To this end, it will use whole exome sequencing to detect SNVs, InDels and CNVs. However, detecting CNVs from whole exome sequencing data remains a challenge despite the steadily growing number of available tools designed for this purpose. Furthermore, there is no standard method to merging similar CNVs. Finally, not all papers provide their pipelines or their tools may have outdated dependencies. Therefore, this project seeks to create a bioinformatics workflow that integrates CNV calling with different calling algorithms, to optimise variant identification. The workflow will include a process of merging overlapping CNVs and a random forest model used to perform in-silico validation of the predicted CNVs. The calling and merging tools along with the random forest model will be incorporated into a bioinformatics pipeline designed using Nextflow, which enables reproducibility and portability (through containers), and scalability and efficiency (through in-built task-level parallelization). The completed pipeline will then be used to detect CNVs from the DDD-Africa exome sequencing data. A preliminary workflow as well as preliminary results using these tools on data from the DDD-UK study will be presented. This project aims to provide a reproducible, portable, scalable and computationally efficient pipeline that accurately detects CNVs in exome sequencing data.