top of page
Computational methods in evolution-aware pangenomics for graph and sequence analyses
Computational methods in evolution-aware pangenomics for graph and sequence analyses

pi 20. 12.

|

Virtual event

Computational methods in evolution-aware pangenomics for graph and sequence analyses

Time & Location

20. 12. 2024, 19:00 – 23:00

Virtual event

About the event

Abstract


Pangenomes, either as a graph or as a collection of genomes, inherently capture more variability than a single reference genome. To make the transition from a reference genome as a string to a pangenome graph, it is important to have procedures for the construction of pangenome graphs that are suitable for the application of sequence-to-graph tools while working with the increasing amount of genomes demand novel methods to efficiently and accurately deal with pangenomes.


We present an approach to construct variation graphs starting from a multiple sequence alignment (MSA), leveraging the notion of maximal blocks, called pangeblocks. The MSA naturally highlights similarities and differences between a set of genomic sequences, and blocks capture a subset of sequences in an interval of columns sharing a substring in the MSA. pangeblocks is an Integer Linear programming approach that finds a tiling of the MSA using blocks. The construction is guided by several objective function criteria that aim to force the desired properties of the final graph, using the most natural criteria, like the number of nodes, the length of node labels, and others intended to ensure good properties of the graph for downstream analyses, like optimizing the number of seeds for sequence-to-graph tools. The nature of our approach restricts our analyses to relatively short genomic sequences, like genes, plasmids, or viral sequences like SARS-CoV-2, whose genomes are around 30 Kbp.


Additionally, we exploit an encoding called the Chaos Game Representation of DNA (CGR), on top of which is a k-mer-based representation of a sequence, known as the Frequency Matrix of the CGR (FCGR). The FCGR is a matrix storing k-mer occurrences of a DNA sequence, and when visualized as an image it shows fractal patterns. We develop novel deep learning architectures for exploiting the FCGR combined with metric learning approaches, and propose an embedding-based index for the largest bacterial dataset, allowing data curation, fast queries at the assembly level, and accurate taxonomic classification at species and genus levels. Most notably, the index for 1.8MM bacterial genomes uses roughly 1GB of disk space, and queries require less than 5 GB of RAM, building bridges for analyses of high-volume genomic datasets with low computational resources. The classification based on queries to the index is very accurate at the genus level but imposes challenges at the level of species for very similar sequences in their k-mer representations.



Link to the recording:

https://drive.google.com/file/d/1ax7NSgADnC6RgRId4uiLUIYv9Byi6htP/view?usp=sharing

Share this event

This project has received funding from the Horizon Europe program under grant agreement No. 101160008 

bottom of page