Simelane. N. J.*1,2, Sathekge, K. H.*1, Lötter, A.1, Christie, N.1, Myburg, A. A.1
1 Department of Biochemistry, Genetics and Microbiology & Department of Computer Science, Forest Molecular Genetics (FMG) Programme, Forestry and Agricultural Biotechnology Institute (FABI), University of Pretoria, Pretoria, South Africa
2 Department of Computer Science, University of Pretoria
Haplotype imputation, which predicts missing genetic variants using patterns from observed data, is a critical tool in population genomics. In sparsely sequenced datasets, imputation enables the recovery of co-inherited variants or haplotypes. Traditional imputation tools, like IMPUTE and BEAGLE, rely on Hidden Markov Models (HMMs) but struggle with rare variants, which are often associated with important traits. In this study, we explore bioinformatic tools and machine learning (ML) approaches to improve imputation accuracy for rare and uncommon variants in wild Eucalyptus grandis populations, known for their genetic diversity and economic importance in forestry. We will construct a haplotype map (HapMap) which will serve as a reference panel using GATK and DeepVariant for variant calling, followed by SHAPEIT and BEAGLE for haplotype phasing. The panel will consist of deeply sequenced E. grandis individuals and serve as a foundation for imputation using both HMM- and ML-based methods, including neural networks and generative adversarial networks. The RefRGim method will aid in selecting study-specific reference panels and optimising for genetic diversity through phylogenetic and coalescence-based strategies. This pipeline will be applied to a large E. grandis dataset, assessing imputation accuracy, computational cost, and scalability. Our approach has the potential to revolutionize breeding programs, enabling large-scale genotype imputation, improving trait association studies, and enhancing genomic selection efficiency. Ultimately, the creation of the E. grandis HapMap will provide a valuable resource for forestry genomics and population genetics.
Keywords: haplotype reference panel, imputation, bioinformatics, machine learning