Pangenomics Alignment

In the past decade, there has been an effort to sequence and compare a large number of individual genomes of a given species, resulting in a large number of (reference) genomes of various species being made publicly available. For example, there is now public data for the 1,000 Genome Project, the 100K Genome Project, the 1001 Arabidopsis Genomes project, the Rice Genome Annotation Project, and the Bird 10,000 Genomes (B10K) Project. Short read aligners — such as BWA, Bowtie, and SOAP2 — have been fundamental to the analysis of these datasets, and have enabled the discovery of genetic markers that have causal relationships with countless diseases and phenotypes. These methods take as input a set of sequence reads and a reference genome, build an index from the reference genome, and use this index to find alignments with the limitation that few insertions and deletions are allowed.

One way to correct these issues is to move beyond indexing only a small number of reference genomes, and build a pangenomics index from the genomes of an entire population. Conceptually this is simple but two main questions arise: (1) how to represent a population of reference genomes so that all relevant variations and combinations of variations are identified, and (2) how to build an index of this representation in a manner that allows for efficient sequence alignment. Both of these questions are non-trivial due to the sheer number and size of the reference genomes of current population studies.  Our goal is to address these questions by developing a scalable index of a pangenomics graph that allows for efficient alignment of reads.

Investigative team

Dr. Christina Boucher
Dr. Travis Gagie
Dr. Ben Langmead

Project News


NEW publications

NEW talks

  • June 2021: Dr. Gagie presented “MONI, PHONI and SPUMONI” at Chilean Center for Biotechnology and Bioengineering.  Available online  here.

  • July 2021: Dr. Gagie presented “Working in BWT-runs bounded space” at CPM 2021.  Highlight talk.  Available online here.
  • Ongoing: Dr. Langmead presents a series of online lectures on Wheeler graphs and indexing

News

 

research presentations:

  • November 2020: Dr. Boucher presented: “Building an index for a large number of genomes” at Cold Spring Harbor’s Biological Data Science Meeting. Presented remotely. Abstract available here.
  • November 2020: Dr. Gagie presented: “Multi-genome references” at CEGEB. Presented remotely.   Available online here
  • November 2020: Dr. Boucher gave a recruitment lecture atClaflin University.
  • November 2020: Dr. Langmead presented “Fighting reference bias with pangenomes”  CeBiB Workshop, Chile. Presented remotely.   Available online here.
  • January 2021: Dr. Langmead presented “Pan-genome approaches for defeating reference bias” at UCSD Genetics, Bioinformatics & Systems Biology Symposium. San Diego, CA. Presented remotely. 

  • February 2021:  Dr. Langmead presented “Advances in pan-genomics for addressing reference bias” at Stanford Biostatistics Workshop. Stanford, CA. Presented remotely.

  • March 2021: Dr. Gagie presented “Pangenomics with the r-index (and friends)” EU RISE project PANGAIA.  Available online here.

  • June 2021: Dr. Gagie presented “MONI, PHONI and SPUMONI” at Chilean Center for Biotechnology and Bioengineering.  Available online  here.

  • July 2021: Dr. Gagie presented “Working in BWT-runs bounded space” at CPM 2021.  Highlight talk.  Available online here.
  • Ongoing: Dr. Langmead presents a series of online lectures on Wheeler graphs and indexing

 

The investigative team greatly appreciates the funding received for this project byNational Science Foundation IIBR  (Grant No. 2029552).

.