Treballs Fi de Grau> Enginyeria Informàtica i Matemàtiques

A compressed file partitioner for scalable genomics analysis with serverless technology

  • Identification data

    Identifier: TFG:5628
  • Authors:

    Maleno Gonzalez, Francisco Damián
  • Others:

    Creation date in repository: 2023-02-07
    Abstract: The advances made in next-generation sequencing technologies have revolutionized the study of molecular biology by enabling the sequencing of millions of genomic sequences on a massive scale. Unimaginable amounts of genomic data require exhaustive bioinformatic processing for their correct interpretation, a need that traditional computing is struggling to cope with. Therefore, serverless architectures have been resorted, which allow the processing of otherwise unfeasible volumes of data from a personal computer, taking responsibilities such as resource provisioning and management away from the programmer, and based on the principles of simplicity, scalability, and billing only for the resources used. Motivated by its better performance and lower cost, bioinformatics research groups have decided to migrate their experiments to this architecture using serverless data analysis frameworks, such as Lithops. However, despite having fewer limitations in terms of data storage with these architectures, these frameworks have not been designed to work with all types of data. Genomic data is often stored in Gzip compressed files of tens of terabytes, so it is necessary to implement a utility able to decompress portions of these large files 'on-the-fly' for their analysis in serverless functions. Thanks to the data partitioner and retriever for Gzip-compressed files implemented in this study, bioinformaticians will be able to perform their experiments using the Lithops serverless data analysis framework in a simple way, enjoying a programming experience driven by data rather than by resource management. To validate the efficiency of this system, Cloudbutton's genomic use case 'SNP Variant Caller' has been implemented with satisfactory results.
    Subject: Dades. Recuperació (Informàtica)
    Language: en
    Subject areas: Ingeniería informática Computer engineering Enginyeria informàtica
    Department: Enginyeria Informàtica i Matemàtiques
    Student: Maleno Gonzalez, Francisco Damián
    Academic year: 2021-2022
    Title in different languages: Un particionador de ficheros comprimidos para un análisis genómico escalable mediante la tecnología serverless A compressed file partitioner for scalable genomics analysis with serverless technology Un particionador de fitxers comprimits per a una anàlisi genòmica escalable mitjançant la tecnologia serverless
    Work's public defense date: 2022-01-21
    Access rights: info:eu-repo/semantics/openAccess
    Keywords: análisis de datos, genómica, serverless, archivos comprimidos, particionador data analytics, serverless, genomics, compressed files, partitioner anàlisis de dades, genòmica, serverless, arxius comprimits, particionador
    Confidenciality: No
    TFG credits: 12
    Title in original language: A compressed file partitioner for scalable genomics analysis with serverless technology
    Project director: García López, Pedro
    Education area(s): Enginyeria Informàtica
    Entity: Universitat Rovira i Virgili (URV)
  • Keywords:

    Ingeniería informática
    Computer engineering
    Enginyeria informàtica
    Dades. Recuperació (Informàtica)
  • Documents:

  • Cerca a google

    Search to google scholar