Cannell, R.*, Aron, S., Hazelhurst, S.
Sydney Brenner Institute for Molecular Bioscience, University of the Witwatersrand,
Johannesburg, South Africa
Large collaborative genomic studies rely on the ability to efficiently store, organise, manage and share data. This can be a significantly challenging aspect where data originates from a combination of sources and in a variety of data formats. A data commons provides a framework for sharing data across a network of sites based on a standardised metadata schema allowing for the ease of data access and interoperability. One of the main goals of the eLwazi Open Data Science Platform (ODSP) is to develop a flexible, scalable infrastructure that integrates data, tools, and workflow components for the execution of analyses on diverse computing environments via simple graphical user interfaces. The Gen3 data commons is a collection of microservices that work together to provide users with a unified ecosystem to house, share and analyse data via a user-friendly interface. The microservice architecture of Gen3, which makes use of container technologies, provides the key components upon which to build customised data-specific or user-based solutions. While Gen3 has been developed to function using both cloud-based and on-premises resources, there is a significant amount of work required to implement an instance of the platform on a local computing cluster. As a potential component of the eLwazi ODSP, we aimed to set up an onsite implementation of Gen3 on the University of the Witwatersrand (Wits) computing cluster with the goal of creating a data commons for various genomics and other research sites across Africa. As mentioned, Gen3 has been developed for use on public cloud services and can also be deployed on local computational infrastructure. Although the cloud-based solution is widely used, associated usage costs are still a major limitation in an African setting. For our implementation we therefore focused on deploying an on-premises solution which entailed adapting several Gen3 services to interact with our local computational resources. Our implementation (https://github.com/SBIMB/gen3-dev) makes use of Kubernetes for container orchestration and is installed on a bare metal machine running a Linux operating system (Ubuntu 22.04). Helm charts have been used for the deployment of the Gen3 microservices on to our Kubernetes cluster. Current functionality includes the ability for a user to create an account and upload data files to be shared or analysed. The uploading of data files takes place using the Gen3 command line interface tool, or the Gen3 Python SDK via an application programming interface. Data files can be linked to metadata variables that are organized into a data model of logically related categories (or nodes). The pilot Gen3 deployment will be set up using a genomics related data dictionary to allow for querying and visualisation of metadata. For data analysis, Gen3 provides a workspace interface via JupyterHub. Interactive programming sessions in both Python and R are supported by the JupyterHub and workspace sessions and the associated data are saved onto a local drive. Upon successful deployment of Gen3 at the Wits site, the system will be assessed for deployment at additional sites within the eLwazi network.
Keywords: HPC, Gen3, Python, analysis