Treballs Fi de GrauEnginyeria Informàtica i Matemàtiques

A compressed file partitioner for scalable genomics analysis with serverless technology

  • Identification data

    Identifier:  TFG:5628
    Authors:  Maleno Gonzalez, Francisco Damián
    Abstract:
    The advances made in next-generation sequencing technologies have revolutionized the study of molecular biology by enabling the sequencing of millions of genomic sequences on a massive scale. Unimaginable amounts of genomic data require exhaustive bioinformatic processing for their correct interpretation, a need that traditional computing is struggling to cope with. Therefore, serverless architectures have been resorted, which allow the processing of otherwise unfeasible volumes of data from a personal computer, taking responsibilities such as resource provisioning and management away from the programmer, and based on the principles of simplicity, scalability, and billing only for the resources used. Motivated by its better performance and lower cost, bioinformatics research groups have decided to migrate their experiments to this architecture using serverless data analysis frameworks, such as Lithops. However, despite having fewer limitations in terms of data storage with these architectures, these frameworks have not been designed to work with all types of data. Genomic data is often stored in Gzip compressed files of tens of terabytes, so it is necessary to implement a utility able to decompress portions of these large files 'on-the-fly' for their analysis in serverless functions. Thanks to the data partitioner and retriever for Gzip-compressed files implemented in this study, bioinformaticians will be able to perform their experiments using the Lithops serverless data analysis framework in a simple way, enjoying a programming experience driven by data rather than by resource management. To validate the efficiency of this system, Cloudbutton's genomic use case 'SNP Variant Caller' has been implemented with satisfactory results.
  • Others:

    Department: Enginyeria Informàtica i Matemàtiques
    TFG credits: 12
    Subject: Dades. Recuperació (Informàtica)
    Work's public defense date: 2022-01-21
    Creation date in repository: 2023-02-07
    Academic year: 2021-2022
    Student: Maleno Gonzalez, Francisco Damián
    Access rights: info:eu-repo/semantics/openAccess
    Education area(s): Enginyeria Informàtica
    Entity: Universitat Rovira i Virgili (URV)
    Confidenciality: No
    Project director: García López, Pedro
    Language: en
  • Keywords:

    data analytics
    genomics
    compressed files
    partitioner
    Computer engineering
  • Documents:

  • Cerca a google

    Search to google scholar