Aron, S. L.1*, Farat, M.2, Panji, S.2, Fields, C.3, Mulder, N.2, and members of the RefGraph Work Package
1 Sydney Brenner Institute for Molecular Bioscience, University of the Witwatersrand, Johannesburg, South Africa
2 Computational Biology Division, Department of Integrative Biomedical Sciences, IDM, CIDRI Africa Wellcome Trust Centre, University of Cape Town, Cape Town, South Africa
3 High Performance Computing in Biology, Roy J. Carver Biotechnology Center, University of Illinois Urbana-Champaign, Urbana and Champaign, Illinois, United States of America
The human reference genome (GRCh38) forms the basis for the identification of variation in genomic studies. Although the completion of the human reference genome was a landmark achievement, rapid advancements in long-read sequencing technologies combined with the development of novel algorithmic approaches presents the opportunity to generate an improved, complete and more representative human reference genome. The effort to generate the first Telomere-to-Telomere human reference sequence has taken the first steps towards filling in the missing pieces of the genome that have been overlooked for several years. Building on this achievement, the Human Pangenome Reference Consortium has generated the first graph-based representation of the human genome, incorporating information from a diverse set of high-quality human genome assemblies. This pangenome graph structure allows for common variation within each population to be represented in the reference and has been shown to improve the accuracy of calling both simple and complex variants. Given these advancements coupled with the extent of genetic diversity observed across Africa, we aimed to generate a pangenome graph based on 60x coverage Pacific Bioscience HiFi long read sequence data from 27 African samples, representing East, West and Southern Africa. We have built a workflow to generate and assess the quality of the long-read based de novo assemblies and another focused on generating pangenome graphs based on two current methods. An additional workflow is under development to extract and assess African variation within specific regions of the graphs as well as call variants using the pangenome graphs as a reference. Initial analyses based on a preliminary dataset of 30x coverage long-read sequence data has generated high quality assemblies with contig N50s of between 31-49 Mb. These assemblies have been used to generate draft pangenome graphs using the reference-free PanGenome Graph Builder (PGGB) and reference derived Minigraph-Cactus algorithms. Various approaches are being explored for the graph-building methods as they require a significant amount of computational resources to run in a reasonable amount of time. The workflows are being optimised and will be used to analyse the final 60x coverage dataset. While a global pangenome graph based on genomes from multiple populations will most likely be more appropriate as a reference resource, an African based pangenome graph will serve as a valuable resource to provide insights into regions of complex variation in African populations. The current draft pangenome graphs are being assessed for accuracy and completeness and will form the basis for an upcoming data jamboree aimed at assisting African researchers with using the pangenome graphs to improve variant calling in African samples as well as to interrogate the graphs to better understand the genetic variation present across the African genomes.
Keywords: African, pangenome, reference graph, genome assembly, long-read sequencing