Look for any podcast host, guest or anyone
Showing episodes and shows of

Roman Cheplyaka

Shows

the bioinformatics chatthe bioinformatics chatPrioritizing drug target genes with Marie SadlerIn this episode, Marie Sadler talks about her recent Cell Genomics paper, Multi-layered genetic approaches to identify approved drug targets. Previous studies have found that the drugs that target a gene linked to the disease are more likely to be approved. Yet there are many ways to define what it means for a gene to be linked to the disease. Perhaps the most straightforward approach is to rely on the genome-wide association studies (GWAS) data, but that data can also be integrated with quantitative trait loci (eQTL or pQTL) information to establish less obvious links between genetic...2023-12-2152 minthe bioinformatics chatthe bioinformatics chatSuffix arrays in optimal compressed space and δ-SA with Tomasz Kociumaka and Dominik KempaToday on the podcast we have Tomasz Kociumaka and Dominik Kempa, the authors of the preprint Collapsing the Hierarchy of Compressed Data Structures: Suffix Arrays in Optimal Compressed Space. The suffix array is one of the foundational data structures in bioinformatics, serving as an index that allows fast substring searches in a large text. However, in its raw form, the suffix array occupies the space proportional to (and several times larger than) the original text. In their paper, Tomasz and Dominik construct a new index, δ-SA, which on the one hand can be used in t...2023-09-2956 minthe bioinformatics chatthe bioinformatics chatPhylogenetic inference from raw reads and Read2Tree with David DylusIn this episode, David Dylus talks about Read2Tree, a tool that builds alignment matrices and phylogenetic trees from raw sequencing reads. By leveraging the database of orthologous genes called OMA, Read2Tree bypasses traditional, time-consuming steps such as genome assembly, annotation and all-versus-all sequence comparisons. Links: Inference of phylogenetic trees directly from raw sequencing reads using Read2Tree (David Dylus, Adrian Altenhoff, Sina Majidian, Fritz J. Sedlazeck, Christophe Dessimoz) Background story Read2Tree on GitHub OMA browser The Guardian’s podcast about Victoria Amelina and Volodymyr Vakulenko If you enjoyed this episode, please consider su...2023-08-2849 minthe bioinformatics chatthe bioinformatics chatAlphaFold and variant effect prediction with Amelie SteinThis is the third and final episode in the AlphaFold series, originally recorded on February 23, 2022, with Amelie Stein, now an associate professor at the University of Copenhagen. In the episode, Amelie explains what 𝛥𝛥G is, how it informs us whether a particular protein mutation affects its stability, and how AlphaFold 2 helps in this analysis. A note from Amelie: Something that has happened in the meantime is the publication of methods that predict 𝛥𝛥G with ML methods, so much faster than Rosetta. One of them, RaSP, is from our group, while ddMut is from another subs...2023-07-2935 minthe bioinformatics chatthe bioinformatics chatAlphaFold and shape-mers with Janani DurairajThis is the second episode in the AlphaFold series, originally recorded on February 14, 2022, with Janani Durairaj, a postdoctoral researcher at the University of Basel. Janani talks about how she used shape-mers and topic modelling to discover classes of proteins assembled by AlphaFold 2 that were absent from the Protein Data Bank (PDB). The bioinformatics discussion starts at 03:35. Links: A structural biology community assessment of AlphaFold2 applications (Mehmet Akdel, Douglas E. V. Pires, Eduard Porta Pardo, Jürgen Jänes, Arthur O. Zalevsky, Bálint Mészáros, Patrick Bryant, Lydia L. Good, Roman A. La...2023-07-1020 minthe bioinformatics chatthe bioinformatics chatAlphaFold and protein interactions with Pedro BeltraoIn this episode, originally recorded on February 9, 2022, Roman talks to Pedro Beltrao about AlphaFold, the software developed by DeepMind that predicts a protein’s 3D structure from its amino acid sequence. Pedro is an associate professor at ETH Zurich and the coordinator of the structural biology community assessment of AlphaFold2 applications project, which involved over 30 scientists from different institutions. Pedro talks about the origins of the project, its main findings, the importance of the confidence metric that AlphaFold assigns to its predictions, and Pedro’s own area of interest — predicting pockets in proteins and protein-protein intera...2023-06-2152 minthe bioinformatics chatthe bioinformatics chatEnformer: predicting gene expression from sequence with Žiga AvsecIn this episode, Jacob Schreiber interviews Žiga Avsec about a recently released model, Enformer. Their discussion begins with life differences between academia and industry, specifically about how research is conducted in the two settings. Then, they discuss the Enformer model, how it builds on previous work, and the potential that models like it have for genomics research in the future. Finally, they have a high-level discussion on the state of modern deep learning libraries and which ones they use in their day-to-day developing. Links: Effective gene expression prediction from sequence by integrating long-range interactions (Žiga Avsec, Vi...2021-11-0959 minthe bioinformatics chatthe bioinformatics chatBioinformatics Contest 2021 with Maksym Kovalchuk and James Matthew HoltThe Bioinformatics Contest is back this year, and we are back to discuss it! This year’s contest winners Maksym Kovalchuk (1st prize) and Matt Holt (2nd prize) talk about how they approach participating in the contest and what strategies have earned them the top scores. Timestamps and links for the individual problems: 00:10:36 Genotype Imputation 00:21:26 Causative Mutation 00:30:27 Superspreaders 00:37:22 Minor Haplotype 00:46:37 Isoform Matching Links: Matt’s solutions Max’s solutions If you enjoyed this episode, please consider supporting the podcast on Patreon.2021-09-271h 00the bioinformatics chatthe bioinformatics chatSteady states of metabolic networks and Dingo with Apostolos ChalkisIn this episode, Apostolos Chalkis presents sampling steady states of metabolic networks as an alternative to the widely used flux balance analysis (FBA). We also discuss dingo, a Python package written by Apostolos that employs geometric random walks to sample steady states. You can see dingo in action here. Links: Dingo on GitHub Searching for COVID-19 treatments using metabolic networks Tweag open source fellowships This episode was originally published on the Compositional podcast. If you enjoyed this episode, please consider supporting the podcast on Patreon.2021-07-2838 minthe bioinformatics chatthe bioinformatics chat3D genome organization and GRiNCH with Da-Inn Erika LeeIn this episode, Jacob Schreiber interviews Da-Inn Erika Lee about data and computational methods for making sense of 3D genome structure. They begin their discussion by talking about 3D genome structure at a high level and the challenges in working with such data. Then, they discuss a method recently developed by Erika, named GRiNCH, that mines this data to identify spans of the genome that cluster together in 3D space and potentially help control gene regulation. Links: GRiNCH: simultaneous smoothing and detection of topological units of genome organization from sparse chromatin contact count matrices with matrix...2021-06-231h 09the bioinformatics chatthe bioinformatics chatDifferential gene expression and DESeq2 with Michael LoveIn this episode, Michael Love joins us to talk about the differential gene expression analysis from bulk RNA-Seq data. We talk about the history of Mike’s own differential expression package, DESeq2, as well as other packages in this space, like edgeR and limma, and the theory they are based upon. Mike also shares his experience of being the author and maintainer of a popular bioninformatics package. Links: Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 (Love, M.I., Huber, W. & Anders, S.) DESeq2 on Bioconductor Chan Zuckerberg Initiative: Ensuring Re...2021-05-121h 31the bioinformatics chatthe bioinformatics chatProteomics calibration with Lindsay PinoIn this episode, Lindsay Pino discusses the challenges of making quantitative measurements in the field of proteomics. Specifically, she discusses the difficulties of comparing measurements across different samples, potentially acquired in different labs, as well as a method she has developed recently for calibrating these measurements without the need for expensive reagents. The discussion then turns more broadly to questions in genomics that can potentially be addressed using proteomic measurements. Links: Talus Bioscience Matrix-Matched Calibration Curves for Asssessing Analytical Figures of Merit in Quantitative Proteomics (Lindsay K. Pino, Brian C. Searle, Han-Yin Yang, Andrew N. Hoofnagle...2021-04-2148 minthe bioinformatics chatthe bioinformatics chatB cell maturation and class switching with Hamish KingIn this episode, we learn about B cell maturation and class switching from Hamish King. Hamish recently published a paper on this subject in Science Immunology, where he and his coauthors analyzed gene expression and antibody repertoire data from human tonsils. In the episode Hamish talks about some of the interesting B cell states he uncovered and shares his thoughts on questions such as «When does a B cell decide to class-switch?» and «Why is the antibody isotype correlated with its affinity?» Links: Single-cell analysis of human B cell maturation predicts how antibody class switching shapes sele...2021-03-311h 29the bioinformatics chatthe bioinformatics chatEnhancers with Molly GasperiniIn this episode, Jacob Schreiber interviews Molly Gasperini about enhancer elements. They begin their discussion by talking about Octant Bio, and then dive into the surprisingly difficult task of defining enhancers and determining the mechanisms that enable them to regulate gene expression. Links: Octant Bio Towards a comprehensive catalogue of validated and target-linked human enhancers (Molly Gasperini, Jacob M. Tome, and Jay Shendure) If you enjoyed this episode, please consider supporting the podcast on Patreon.2021-03-1046 minthe bioinformatics chatthe bioinformatics chatPolygenic risk scores in admixed populations with Bárbara BitarelloPolygenic risk scores (PRS) rely on the genome-wide association studies (GWAS) to predict the phenotype based on the genotype. However, the prediction accuracy suffers when GWAS from one population are used to calculate PRS within a different population, which is a problem because the majority of the GWAS are done on cohorts of European ancestry. In this episode, Bárbara Bitarello helps us understand how PRS work and why they don’t transfer well across populations. Links: Polygenic Scores for Height in Admixed Populations (Bárbara D. Bitarello, Iain Mathieson) What is ancestry? (Iain Math...2021-02-171h 30the bioinformatics chatthe bioinformatics chatPhylogenetics and the likelihood gradient with Xiang JiIn this episode, we chat about phylogenetics with Xiang Ji. We start with a general introduction to the field and then go deeper into the likelihood-based methods (maximum likelihood and Bayesian inference). In particular, we talk about the different ways to calculate the likelihood gradient, including a linear-time exact gradient algorithm recently published by Xiang and his colleagues. Links: Gradients Do Grow on Trees: A Linear-Time O(N)-Dimensional Gradient for Statistical Phylogenetics (Xiang Ji, Zhenyu Zhang, Andrew Holbrook, Akihiko Nishimura, Guy Baele, Andrew Rambaut, Philippe Lemey, Marc A Suchard) BEAGLE: the package that implements the...2021-01-1357 minthe bioinformatics chatthe bioinformatics chatSeeding methods for read alignment with Markus SchmidtIn this episode, Markus Schmidt explains how seeding in read alignment works. We define and compare k-mers, minimizers, MEMs, SMEMs, and maximal spanning seeds. Markus also presents his recent work on computing variable-sized seeds (MEMs, SMEMs, and maximal spanning seeds) from fixed-sized seeds (k-mers and minimizers) and his Modular Aligner. Links: A performant bridge between fixed-size and variable-size seeding (Arne Kutzner, Pok-Son Kim, Markus Schmidt) MA the Modular Aligner Calibrating Seed-Based Heuristics to Map Short Reads With Sesame (Guillaume J. Filion, Ruggero Cortini, Eduard Zorita) — another interesting recent work on seeding methods (though we didn’t get...2020-12-161h 00the bioinformatics chatthe bioinformatics chatReal-time quantitative proteomics with Devin SchweppeIn this episode, Jacob Schreiber interviews Devin Schweppe about the analysis of mass spectrometry data in the field of proteomics. They begin by delving into the different types of mass spectrometry methods, including MS1, MS2, and, MS3, and the reasons for using each. They then discuss a recent paper from Devin, Full-Featured, Real-Time Database Searching Platform Enables Fast and Accurate Multiplexed Quantitative Proteomics that involved building a real-time system for quantifying proteomic samples from MS3, and the types of analyses that this system allows one to do. Links: Full-Featured, Real-Time Database Searching Platform Enables Fast and...2020-11-181h 03the bioinformatics chatthe bioinformatics chatHow 23andMe finds identical-by-descent segments with William FreymanIn this episode, Will Freyman talks about identity-by-descent (IBD): how it’s used at 23andMe, and how the templated positional Burrows-Wheeler transform can find IBD segments in the presence of genotyping and phasing errors. Links: Fast and robust identity-by-descent inference with the templated positional Burrows-Wheeler transform (William A. Freyman, Kimberly F. McManus, Suyash S. Shringarpure, Ethan M. Jewett, Katarzyna Bryc, the 23andMe Research Team, Adam Auton) 23andMe research If you enjoyed this episode, please consider supporting the podcast on Patreon.2020-10-2742 minthe bioinformatics chatthe bioinformatics chatBasset and Basenji with David KelleyIn this episode, Jacob Schreiber interviews David Kelley about machine learning models that can yield insight into the consequences of mutations on the genome. They begin their discussion by talking about Calico Labs, and then delve into a series of papers that David has written about using models, named Basset and Basenji, that connect genome sequence to functional activity and so can be used to quantify the effect of any mutation. Links: Calico Labs Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks (David R. Kelley, Jasper Snoek, and John Rinn) ...2020-10-071h 13the bioinformatics chatthe bioinformatics chatENCODE3 with Jill MooreIn this episode, Jacob Schreiber interviews Jill Moore about recent research from the ENCODE Project. They begin their discussion with an overview and goals of the ENCODE Project, and then discuss a bundle of papers that were recently published in various Nature journals and the flagship paper, Expanded encyclopaedias of DNA elements in the human and mouse genomes. They conclude their discussion by talking about the challenges with managing a large project as a trainee in a consortium setting. Links: Expanded encyclopaedias of DNA elements in the human and mouse genomes (The ENCODE Project Consortium, Jill...2020-09-1056 minthe bioinformatics chatthe bioinformatics chatMost Permissive Boolean Networks with Loïc PaulevéIn systems biology, Boolean networks are a way to model interactions such as gene regulation or cell signaling. The standard interpretations of Boolean networks are the synchronous, asynchronous, and fully asynchronous semantics. In this episode, Loïc Paulevé explains how the same Boolean networks can be interpreted in a new, “most permissive” way. Loïc proved mathematically that his semantics can reproduce all behaviors achievable by a compatible quantitative model, whereas the traditional interpretations in general cannot. Furthermore, it turns out that deciding whether a certain state in a Boolean network is reachable can be done much more ef...2020-08-191h 04the bioinformatics chatthe bioinformatics chatMachine learning for drug development with Marinka ZitnikIn this episode, Jacob Schreiber interviews Marinka Zitnik about applications of machine learning to drug development. They begin their discussion with an overview of open research questions in the field, including limiting the search space of high-throughput testing methods, designing drugs entirely from scratch, predicting ways that existing drugs can be repurposed, and identifying likely side-effects of combining existing drugs in novel ways. Focusing on the last of these areas, they then discuss one of Marinka’s recent papers, Modeling polypharmacy side effects with graph convolutional networks. Links: Modeling polypharmacy side effects with graph convolutional networks (Ma...2020-07-291h 25the bioinformatics chatthe bioinformatics chatReproducible pipelines and NGLess with Luis Pedro CoelhoNGLess is a programming language specifically targeted at next generation sequencing (NGS) data processing. In this episode we chat with its main developer, Luis Pedro Coelho, about the benefits of domain-specific languages, pros and cons of Haskell in bioinformatics, reproducibility, and of course NGLess itself. Links: NGLess on GitHub NG-meta-profiler: fast processing of metagenomes using NGLess, a domain-specific language (Luis Pedro Coelho, Renato Alves, Paulo Monteiro, Jaime Huerta-Cepas, Ana Teresa Freitas, Peer Bork) If you enjoyed this episode, please consider supporting the podcast on Patreon.2020-06-2457 minthe bioinformatics chatthe bioinformatics chatHiFi reads and HiCanu with Sergey Nurk and Sergey KorenIn this episode, I continue to talk (but mostly listen) to Sergey Koren and Sergey Nurk. If you missed the previous episode, you should probably start there. Otherwise, join us to learn about HiFi reads, the tradeoff between read length and quality, and what tricks HiCanu employs to resolve highly similar repeats. Links: HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads (Sergey Nurk, Brian P. Walenz, Arang Rhie, Mitchell R. Vollger, Glennis A. Logsdon, Robert Grothe, Karen H. Miga, Evan E. Eichler, Adam M. Phillippy, Sergey Koren) Canu on GitHub...2020-05-271h 09the bioinformatics chatthe bioinformatics chatGenome assembly and Canu with Sergey Koren and Sergey NurkIn this episode, Sergey Nurk and Sergey Koren from the NIH share their thoughts on genome assembly. The two Sergeys tell the stories behind their amazing careers as well as behind some of the best known genome assemblers: Celera assembler, Canu, and SPAdes. Links: Canu on GitHub SPAdes on GitHub If you enjoyed this episode, please consider supporting the podcast on Patreon.2020-05-201h 16the bioinformatics chatthe bioinformatics chatDNA tagging and Porcupine with Kathryn DoroschakPorcupine is a molecular tagging system—a way to tag physical objects with pieces of DNA called molecular bits, or molbits for short. These DNA tags then can be rapidly sequenced on an Oxford Nanopore MinION device without any need for library preparation. In this episode, Katie Doroschak explains how Porcupine works—how molbits are designed and prepared, and how they are directly recognized by the software without an intermediate basecalling step. Links: Porcupine: Rapid and robust tagging of physical objects using nanopore-orthogonal DNA strands (Kathryn Doroschak, Karen Zhang, Melissa Queen, Aishwarya Mandyam, Karin Stra...2020-04-2945 minthe bioinformatics chatthe bioinformatics chatGeneralized PCA for single-cell data with William TownesWill Townes proposes a new, simpler way to analyze scRNA-seq data with unique molecular identifiers (UMIs). Observing that such data is not zero-inflated, Will has designed a PCA-like procedure inspired by generalized linear models (GLMs) that, unlike the standard PCA, takes into account statistical properties of the data and avoids spurious correlations (such as one or more of the top principal components being correlated with the number of non-zero gene counts). Also check out Will’s paper for a feature selection algorithm based on deviance, which we didn’t get a chance to discuss on the podcast. ...2020-03-2759 minthe bioinformatics chatthe bioinformatics chatSpectrum-preserving string sets and simplitigs with Amatur Rahman and Karel BřindaIn this episode, we hear from Amatur Rahman and Karel Břinda, who independently of one another released preprints on the same concept, called simplitigs or spectrum-preserving string sets. Simplitigs offer a way to efficiently store and query large sets of k-mers—or, equivalently, large de Bruijn graphs. Links: Simplitigs as an efficient and scalable representation of de Bruijn graphs (Karel Břinda, Michael Baym, Gregory Kucherov) Representation of k-mer sets using spectrum-preserving string sets (Amatur Rahman, Paul Medvedev) Open mic If you enjoyed this episode, please consider supporting the podcast on Patreon.2020-02-2853 minthe bioinformatics chatthe bioinformatics chatEpidemic models with Kris ParagKris Parag is here to teach us about the mathematical modeling of infectious disease epidemics. We discuss the SIR model, the renewal models, and how insights from information theory can help us predict where an epidemic is going. Links: Optimising Renewal Models for Real-Time Epidemic Prediction and Estimation (KV Parag, CA Donnelly) Adaptive Estimation for Epidemic Renewal and Phylogenetic Skyline Models (KV Parag, CA Donnelly) The listener survey If you enjoyed this episode, please consider supporting the podcast on Patreon.2020-01-271h 08the bioinformatics chatthe bioinformatics chatPlasmid classification and binning with Sergio Arredondo-Alonso and Anita SchürchDoes a given bacterial gene live on a plasmid or the chromosome? What other genes live on the same plasmid? In this episode, we hear from Sergio Arredondo-Alonso and Anita Schürch, whose projects mlplasmids and gplas answer these types of questions. Links: mlplasmids: a user-friendly tool to predict plasmid- and chromosome-derived sequences for single species (Sergio Arredondo-Alonso, Malbert R. C. Rogers, Johanna C. Braat, Tess D. Verschuuren, Janetta Top, Jukka Corander, Rob J. L. Willems, Anita C. Schürch) gplas: a comprehensive tool for plasmid analysis using short-read graphs (Sergio Arredondo-Alonso, Martin Bootsma, Ya...2019-12-3045 minthe bioinformatics chatthe bioinformatics chatAmplicon sequence variants and bias with Benjamin CallahanIn this episode, Benjamin Callahan talks about some of the issues faced by microbiologists when conducting amplicon sequencing and metagenomic studies. The two main themes are: Why one should probably avoid using OTUs (operational taxonomic units) and use exact sequence variants (also called amplicon sequence variants, or ASVs), and how DADA2 manages to deduce the exact sequences present in the sample. Why abundances inferred from community sequencing data are biased, and how we can model and correct this bias. Links: Exact sequence variants should replace operational taxonomic units in marker-gene data analysis (Benjamin J Callahan, Paul...2019-11-291h 01the bioinformatics chatthe bioinformatics chatIssues in legacy genomes with Luke Anderson-TrocméIn this episode, Luke Anderson-Trocmé talks about his findings from the 1000 Genomes Project. Namely, the early sequenced genomes sometimes contain specific mutational signatures that haven’t been replicated from other sources and can be found via their association with lower base quality scores. Listen to Luke telling the story of how he stumbled upon and investigated these fake variants and what their impact is. Links: Legacy Data Confounds Genomics Studies (bioRxiv, Molecular Biology and Evolution (paywall)) (Luke Anderson-Trocmé, Rick Farouni, Mathieu Bourgey, Yoichiro Kamatani, Koichiro Higasa, Jeong-Sun Seo, Changhoon Kim, Fumihiko Matsuda and Simon Gravel) If y...2019-10-221h 01the bioinformatics chatthe bioinformatics chatCausality and potential outcomes with Irineo CabrerosIn this episode, I talk with Irineo Cabreros about causality. We discuss why causality matters, what does and does not imply causality, and two different mathematical formalizations of causality: potential outcomes and directed acyclic graphs (DAGs). Causal models are usually considered external to and separate from statistical models, whereas Irineo’s new paper shows how causality can be viewed as a relationship between particularly chosen random variables (potential outcomes). Links: Causal models on probability spaces (Irineo Cabreros, John D. Storey) The Book of Why: The New Science of Cause and Effect (Judea Pearl, Dana Mackenzie) If...2019-09-2740 minthe bioinformatics chatthe bioinformatics chatscVI with Romain Lopez and Gabriel MisrachiIn this episode, we hear from Romain Lopez and Gabriel Misrachi about scVI—Single-cell Variational Inference. scVI is a probabilistic model for single-cell gene expression data that combines a hierarchical Bayesian model with deep neural networks encoding the conditional distributions. scVI scales to over one million cells and can be used for scRNA-seq normalization and batch effect removal, dimensionality reduction, visualization, and differential expression. We also discuss the recently implemented in scVI automatic hyperparameter selection via Bayesian optimization. Links: Deep generative modeling for single-cell transcriptomics (Romain Lopez, Jeffrey Regier, Michael Cole, Michael I. Jordan, Nir Yosef) sc...2019-08-301h 20the bioinformatics chatthe bioinformatics chatThe role of the DNA shape in transcription factor binding with Hassan SameeEven though the double-stranded DNA has the famous regular helical shape, there are small variations in the geometry of the helix depending on what exact nucleotides its made of at that position. In this episode of the bioinformatics chat, Hassan Samee talks about the role the DNA shape plays in recognition of the DNA by DNA-binding proteins, such as transcription factors. Hassan also explains how his algorithm, ShapeMF, can deduce the DNA shape motifs from the ChIP-seq data. Links: A De Novo Shape Motif Discovery Algorithm Reveals Preferences of Transcription Factors for DNA Shape...2019-07-261h 01the bioinformatics chatthe bioinformatics chatPower laws and T-cell receptors with Kristina GrigaityteAn αβ T-cell receptor is composed of two highly variable protein chains, the α chain and the β chain. However, based only on bulk DNA or RNA sequencing it is impossible to determine which of the α chain and β chain sequences were paired in the same receptor. In this episode, Kristina Grigaityte talks about her analysis of 200,000 paired αβ sequences, which have been obtained by targeted single-cell RNA sequencing. Kristina used the power law distribution to model the T-cell clone sizes, which led her to reject the commonly held assumptions about the independence of the α and β chains. We also talk about Bayesian inference...2019-06-291h 26the bioinformatics chatthe bioinformatics chatGenome assembly from long reads and Flye with Mikhail KolmogorovModern genome assembly projects are often based on long reads in an attempt to bridge longer repeats. However, due to the higher error rate of the current long read sequencers, assemblers based on de Bruijn graphs do not work well in this setting, and the approaches that do work are slower. In this episode, Mikhail Kolmogorov from Pavel Pevzner’s lab joins us to talk about some of the ideas developed in the lab that made it possible to build a de Bruijn-like assembly graph from noisy reads. These ideas are now implemented in the Flye assembler, wh...2019-05-311h 12the bioinformatics chatthe bioinformatics chatDeep tensor factorization and a pitfall for machine learning methods with Jacob SchreiberIn this episode, we hear from Jacob Schreiber about his algorithm, Avocado. Avocado uses deep tensor factorization to break a three-dimensional tensor of epigenomic data into three orthogonal dimensions corresponding to cell types, assay types, and genomic loci. Avocado can extract a low-dimensional, information-rich latent representation from the wealth of experimental data from projects like the Roadmap Epigenomics Consortium and ENCODE. This representation allows you to impute genome-wide epigenomics experiments that have not yet been performed. Jacob also talks about a pitfall he discovered when trying to predict gene expression from a mix of genomic...2019-04-291h 15the bioinformatics chatthe bioinformatics chatBioinformatics Contest 2019 with Alexey Sergushichev and Gennady KorotkevichThe third Bioinformatics Contest took place in February 2019. Alexey Sergushichev, one of the organizers of the contest, and Gennady Korotkevich, the 1st prize winner, join me to discuss this year’s problems. Timestamps and links for the individual problems: Qualification round 00:07:14 Bee Population 00:14:12 Sequencing Errors 00:30:20 Transposable Elements Final round 00:41:35 Cancer and Chromosome Rearrangements 00:56:01 Epigenomic Marks 01:10:02 Bacterial Communities 01:27:06 Minimal Genome 01:34:56 Endangered Species Links: The contest problems on Stepik The final scoreboard Episode #18: Bioinformatics Contest 2018 If you enjoyed this episode, please consider supporting the podcast on Patreon.2019-03-241h 46the bioinformatics chatthe bioinformatics chatBayesian inference of chromatin structure from Hi-C data with Simeon CarstensHi-C is a sequencing-based assay that provides information about the 3-dimensional organization of the genome. In this episode, Simeon Carstens explains how he applied the Inferential Structure Determination (ISD) framework to build a 3D model of chromatin and fit that model to Hi-C data using Hamiltonian Monte Carlo and Gibbs sampling. Links: Bayesian inference of chromatin structure ensembles from population Hi-C data (Simeon Carstens, Michael Nilges, Michael Habeck) Inferential Structure Determination of Chromosomes from Single-Cell Hi-C Data (Simeon Carstens, Michael Nilges, Michael Habeck) If you enjoyed this episode, please consider supporting the podcast on Patreon.2019-02-271h 05the bioinformatics chatthe bioinformatics chatHaplotype-aware genotyping from long reads with Trevor PesoutLong read sequencing technologies, such as Oxford Nanopore and PacBio, produce reads from thousands to a million base pairs in length, at the cost of the increased error rate. Trevor Pesout describes how he and his colleagues leverage long reads for simultaneous variant calling/genotyping and phasing. This is possible thanks to a clever use of a hidden Markov model, and two different algorithms based on this model are now implemented in the MarginPhase and WhatsHap tools. Links: Preprint: Haplotype-aware genotyping from noisy long reads (Jana Ebler, Marina Haukness, Trevor Pesout, Tobias Marschall, Benedict Paten) ...2019-01-271h 12the bioinformatics chatthe bioinformatics chatSpace-efficient variable-order Markov models with Fabio CunialThis time you’ll hear from Fabio Cunial on the topic of Markov models and space-efficient data structures. First we recall what a Markov model is and why variable-order Markov models are an improvement over the standard, fixed-order models. Next we discuss the various data structures and indexes that allowed Fabio and his collaborators to represent these models in a very small space while still keeping the queries efficient. Burrows-Wheeler transform, suffix trees and arrays, tries and suffix link trees, and more! Links: The preprint: A framework for space-efficient variable-order Markov models The book: Genome-Scale Algorithm De...2018-12-281h 09the bioinformatics chatthe bioinformatics chatClassification of CRISPR-induced mutations and CRISPRpic with HoJoon Lee and Seung Woo ChoIn this episode, HoJoon Lee and Seung Woo Cho explain how to perform a CRISPR experiment and how to analyze its results. HoJoon and Seung Woo developed an algorithm that analyzes sequenced amplicons containing the CRISPR-induced double-strand break site and figures out what exactly happened there (e.g. a deletion, insertion, substitution etc.) Links: CRISPRpic: Fast and precise analysis for CRISPR-induced mutations via prefixed index counting CRISPRpic on GitHub If you enjoyed this episode, please consider supporting the podcast on Patreon.2018-11-2956 minthe bioinformatics chatthe bioinformatics chatFeature selection, Relief and STIR with Trang LêRelief is a statistical method to perform feature selection. It could be used, for instance, to find genomic loci that correlate with a trait or genes whose expression correlate with a condition. Relief can also be made sensitive to interaction effects (known in genetics as epistasis). In this episode, Trang Lê joins me to talk about Relief and her version of Relief called STIR (STatistical Inference Relief). While traditional Relief algorithms could only rank features and needed a user-supplied threshold to decide which features to select, Trang’s reformulation of Relief allowed her to compute p-values and mak...2018-10-271h 08the bioinformatics chatthe bioinformatics chatTransposons and repeats with Kaushik Panda and Keith SlotkinKaushik Panda and Keith Slotkin come on the podcast to educate us about repetitive DNA and transposable elements. We talk LINEs, SINEs, LTRs, and even Sleeping Beauty transposons! Kaushik and Keith explain why repeats matter for your whole-genome analysis and answer listeners’ questions. Links: Keith’s paper: The case for not masking away repetitive DNA Questions for this episode on Reddit If you enjoyed this episode, please consider supporting the podcast on Patreon.2018-09-241h 40the bioinformatics chatthe bioinformatics chatRead correction and Bcool with Antoine LimassetAntoine Limasset joins me to talk about NGS read correction. Antoine and his colleagues built the read correction tool Bcool based on the de Bruijn graph, and it corrects reads far better than any of the current methods like Bloocoo, Musket, and Lighter. We discuss why and when read correction is needed, how Bcool works, and why it performs better but slower than k-mer spectrum methods. Links: Preprint: Toward perfect reads: self-correction of short reads via mapping on de Bruijn graphs Bcool on GitHub If you enjoyed this episode, please consider supporting the...2018-08-3159 minthe bioinformatics chatthe bioinformatics chatRNA design, EteRNA and NEMO with Fernando PortelaIn this episode, I talk to Fernando Portela, a software engineer and amateur scientist who works on RNA design — the problem of composing an RNA sequence that has a specific secondary structure. We talk about how Fernando and others compete and collaborate in designing RNA molecules in the online game EteRNA and about Fernando’s new RNA design algorithm, NEMO, which outperforms all prior published methods by a wide margin. Links: The EteRNA game The preprint about NEMO NEMO project page Single-cell RNABIO & organoids meeting in Kiev If you enjoyed this episode, please cons...2018-07-271h 31the bioinformatics chatthe bioinformatics chatsmCounter2: somatic variant calling and UMIs with Chang XuIn this episode I’m joined by Chang Xu. Chang is a senior biostatistician at QIAGEN and an author of smCounter2, a low-frequency somatic variant caller. To distinguish rare somatic mutations from sequencing errors, smCounter2 relies on unique molecular identifiers, or UMIs, which help identify multiple reads resulting from the same physical DNA fragment. Chang explains what UMIs are, why they are useful, and how smCounter2 and other tools in this space use UMIs to detect low-frequency variants. Links: smCounter2 preprint smCounter2 github repository smCounter publication Review of somatic SNV callers If you en...2018-06-291h 04the bioinformatics chatthe bioinformatics chatLinear mixed models, GWAS, and lme4qtl with Andrey ZiyatdinovLinear mixed models are used to analyze GWAS data and detect QTLs. Andrey Ziyatdinov recently released an R package, lme4qtl, that can be used to formulate and fit these models. In this episode, Andrey and I discuss linear mixed models, genome-wide association studies, and strengths and weaknesses of lme4qtl. Links: Paper: lme4qtl: linear mixed models with flexible covariance structure for genetic studies of related individuals lme4qtl on GitHub If you enjoyed this episode, please consider supporting the podcast on Patreon.2018-05-3150 minthe bioinformatics chatthe bioinformatics chatB cell receptor substitution profile prediction and SPURF with Kristian Davidsen and Amrit DharIn this episode Kristian Davidsen and Amrit Dhar present their project called SPURF. SPURF can predict the B cell receptor (BCR) substitution profile of a given clonal family based on a single representative sequence from that family. SPURF works by fitting a tensor regression model to publicly available Rep-seq data. Links: Preprint: Predicting B Cell Receptor Substitution Profiles Using Public Repertoire Data Blog post about SPURF by Erick Matsen SPURF on GitHub If you enjoyed this episode, please consider supporting the podcast on Patreon.2018-04-302h 01the bioinformatics chatthe bioinformatics chatGenome fingerprints with Gustavo GlusmanIn this episode, Gustavo Glusman explains his method of reducing a VCF file to a small “fingerprint”, which could be then used to detect duplicate genomes, infer relatedness, map the population structure, and more. Links: The genome fingerprints paper The genotype fingerprints preprint The data fingerprints preprint The blog post about time series visualization If you enjoyed this episode, please consider supporting the podcast on Patreon.2018-04-071h 28the bioinformatics chatthe bioinformatics chatBioinformatics Contest 2018 with Alexey Sergushichev and Ekaterina VyahhiThe final round of Bioinformatics Contest 2018 was held on February 24-25th, and the qualification round took place two weeks earlier. I invited the organizers of the contest, Alexey Sergushichev and Ekaterina Vyahhi, to discuss the problems and find out what it was like to organize the contest. Timestamps for the problems: Qualification round 0:41:38 Problem 1. Synthesis of ATP 0:48:46 Problem 2. Restriction Sites 1:06:42 Problem 3. Tandem Repeats Final round 1:14:00 Problem 1. Recombination of Plasmids 1:25:30 Problem 2. Species Recovering 1:28:39 Problem 3. Haplotype Phasing 1:36:34 Problem 4. Cluster the Reads 1:40:25 Problem 5. Cattle Breeding Links: The contest problems on Stepik The final scoreboard ...2018-03-031h 53the bioinformatics chatthe bioinformatics chatRarefaction, alpha diversity, and statistics with Amy WillisIn this episode, Amy Willis joins me to talk about good and bad ways to estimate taxonomic richness in microbial ecology studies. Links: Rarefaction, alpha diversity, and statistics Estimating Diversity via Frequency Ratios Estimating the Number of Species in Microbial Diversity Studies Summer Institutes 2018 at the University of Washington STAMPS: Strategies and Techniques for Analyzing Microbial Population Structures bio2040, a new podcast by Flavio Rump If you enjoyed this episode, please consider supporting the podcast on Patreon.2018-01-221h 14the bioinformatics chatthe bioinformatics chatJavier Quilez on what makes large sequencing projects successfulJavier Quilez and I discuss what it’s like to be a bioinformatician, how to improve communication between the wet and dry labs and make the research more reproducible. Make sure to read Javier’s paper we are discussing; it’s a light and entertaining read. The last author on this paper is Guillaume Filion, whom you may remember from the episode on generating functions. Links: Parallel sequencing lives, or what makes large sequencing projects successful If you enjoyed this episode, please consider supporting the podcast on Patreon.2017-12-241h 03the bioinformatics chatthe bioinformatics chatOptimal transport for single-cell expression data with Geoffrey SchiebingerGeoffrey Schiebinger explains how reconstructing developmental trajectories from single-cell RNA-seq data can be reduced to the mathematical problem called optimal transport. Links: Reconstruction of developmental landscapes by optimal-transport analysis of single-cell gene expression sheds light on cellular reprogramming. Talk by Geoffrey and Lénaïc Chizat If you enjoyed this episode, please consider supporting the podcast on Patreon.2017-11-261h 08the bioinformatics chatthe bioinformatics chatGenerating functions for read mapping with Guillaume FilionGuillaume Filion recently published a preprint in which he applies generating functions, a concept from analytic combinatorics, to estimating the optimal seed length for read mapping. In this episode, Guillaume and I attempt to explain the core concepts from analytic combinatorics and why they are useful in modeling sequences. Links: Guillaume’s preprint: Analytic combinatorics for bioinformatics I: seeding methods Once upon a BLAST Guillaume’s blog, «The Grand Locus» Dan Gusfield’s home page featuring the fast fourier transform lectures I mention in the podcast After we recorded the podcast, Guillaume wrote to me to...2017-11-131h 10the bioinformatics chatthe bioinformatics chatBracken with Jennifer LuJennifer Lu joins me to discuss species abundance estimation from metagenomic sequencing data. Links: The Bracken paper The Kraken paper The preprint that applies Kallisto to metagenomics If you enjoyed this episode, please consider supporting the podcast on Patreon.2017-10-2146 minthe bioinformatics chatthe bioinformatics chatModelling the immune system and C-ImmSim with Filippo CastiglioneIn this episode, Filippo Castiglione and I discuss different ways to model the immune system. Links: Celada’s and Seiden’s 1992 paper, “A computer model of cellular interactions in the immune system” Filippo’s 2001 paper, “Design and implementation of an immune system simulator” PLOS One paper, “Computational Immunology Meets Bioinformatics: The Use of Prediction Tools for Molecular Binding in the Simulation of the Immune System” A paper about modelling sea bass vaccination using C-ImmSim C-ImmSim homepage The online simulator based on C-ImmSim and the paper describing it Special thanks to Martina Stoycheva for bringing this work to my attenti...2017-10-081h 06the bioinformatics chatthe bioinformatics chatCollective cell migration with Linus SchumacherIn this episode, Linus Schumacher joins me to discuss mathematical models of collective cell migration and multidisciplinary research. Links: Semblance of Heterogeneity in Collective Cell Migration and associated code Multidisciplinary approaches to understanding collective cell migration in developmental biology Linus’s homepage If you enjoyed this episode, please consider supporting the podcast on Patreon.2017-09-181h 00the bioinformatics chatthe bioinformatics chatSpatially variable genes and SpatialDE with Valentine SvenssonValentine Svensson explains how he analyzes spatially-annotated single cell gene expression data using Gaussian processes. Links: Valentine’s preprint, “SpatialDE - Identification Of Spatially Variable Genes” SpatialDE code on GitHub Valentine’s personal page The Integrative Biology & Medicine conference If you enjoyed this episode, please consider supporting the podcast on Patreon.2017-09-0357 minthe bioinformatics chatthe bioinformatics chatMichael Tessler and Christopher Mason on 16S amplicon vs shotgun sequencingMichael Tessler and Christopher Mason join me to talk about their comparison of 16S amplicon sequencing and shotgun sequencing for quantifying microbial diversity. Links: The 2017 Nature paper that we discuss: Large-scale differences in microbial biodiversity discovery between 16S amplicon and shotgun sequencing Michael’s et al. 2016 paper that describes their original 16S study: A Global eDNA Comparison of Freshwater Bacterioplankton Assemblages Focusing on Large-River Floodplain Lakes of Brazil The sequencing data for these studies is available from NCBI: PRJNA310230 (16S), PRJNA389803 (shotgun) Michael’s website Christopher’s lab website The Integr...2017-08-1845 minthe bioinformatics chatthe bioinformatics chatPerfect k-mer hashing in SailfishThe original version of Sailfish, an RNA-Seq quantification tool, used minimal perfect hash functions to replace k-mers with unique integers. (The current version appears to be using a Cuckoo hashmap instead.) This is my attempt to explain how a minimal perfect hash function could be built. The algorithm described here is not exactly the same as the one Sailfish used, but it follows the same idea. Sections: Sailfish and perfect hashing (1:15) Perfect hashing based on binary search or hash tables (4:34) Random hash functions (7:34) Perfect hash function based on an acyclic graph (12:16) Links: ...2017-08-0522 minthe bioinformatics chatthe bioinformatics chatMetagenomics and KrakenWhat is metagenomics and how is it different from phylotyping? What is Kraken and how can it be faster than BLAST? Let’s try to sort this out. Sections: Culturing and its limitation (00:18) Metagenomics vs phylotyping (1:43) BLAST (5:53) The idea behind Kraken (8:14) How Kraken organizes its database (18:08) Links: The paper about Kraken A freely accessible (though a bit dated) book on metagenomics Correction: in this episode, I incorrectly state that Kraken operates on phylogenetic trees, whereas in fact it operates on taxonomic trees. In practice this means that wh...2017-07-0928 minthe bioinformatics chatthe bioinformatics chatAllele-specific expressionI talk about allele-specific expression: why it arises and how it can be reliably detected. Sections: The biology of allele-specific expression (2:17) Detecting allele-specific expression with RNA-seq (7:46) Mapping and sequencing biases (16:39) The experiment in yeast (19:47) Statistical models (21:44) Links: A powerful and flexible statistical framework for testing hypotheses of allele-specific gene expression from RNA-seq data Supplemenal information Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data The talk by John Marioni The blog post explaining the RSEM model If you enjoyed this episode, please consider supporting the podcast on Patreon.2017-06-2533 minthe bioinformatics chatthe bioinformatics chatRelative data analysis and propr with Thom QuinnIn this episode, Thom Quinn and I explore different ways to transform and analyze relative data arising in genomics. We also discuss propr, Thom’s R package to compute various proportionality measures. Links: Thom’s preprint about propr — take a look if you feel lost in all the quantities that we discuss :) David Lovell original paper introducing proportionality for relative data and a very detailed appendix Ionas Erb’s paper introducing the ρ metric Vignettes for propr If you enjoyed this episode, please consider supporting the podcast on Patreon.2017-06-1055 minthe bioinformatics chatthe bioinformatics chatChIP-seq and GenoGAM with Georg Stricker and Julien GagneurIn this episode, I meet with Georg Stricker and Julien Gagneur from the Technical University of Munich to discuss ChIP-seq data analysis and their tool, GenoGAM. Links: Preprint about GenoGAM Georg on GitHub Julien’s lab on Twitter [BC]2 — The Basel Computational Biology Conference (Basel, September 2017), where you can meet Georg The European Human Genetics Conference (Copenhagen, May 2017), where you can meet Daniel Bader from Julien’s lab Register for the Summer School in Bioinformatics & NGS Data Analysis (#NGSchool2017) (Poland, September 2017) If you enjoyed this episode, please consider supporting the podcast on Patreon.2017-05-2955 minthe bioinformatics chatthe bioinformatics chatmiRNA target site prediction and seedVicious with Antonio MarcoIn this episode Antonio Marco talks about miRNA target site prediction and his tool, seedVicious. Links: seedVicious preprint seedVicious manual If you enjoyed this episode, please consider supporting the podcast on Patreon.2017-05-1256 minthe bioinformatics chatthe bioinformatics chatSingle-cell RNA sequencing with Aleksandra KolodziejczykIn this episode Aleksandra Kolodziejczyk talks about single-cell RNA sequencing. Links: A review paper by Aleksandra Comparative analysis of single-cell RNA sequencing methods, including the cost table Power Analysis of Single Cell RNA‐Sequencing Experiments Monocle: a toolkit for analyzing single-cell gene expression experiments Questions from listeners on bioinformatics.chat and reddit If you enjoyed this episode, please consider supporting the podcast on Patreon.2017-04-291h 08the bioinformatics chatthe bioinformatics chatTranscriptome assembly and Scallop with Mingfu ShaoIn this episode, Mingfu Shao talks about Scallop, an accurate reference-based transcript assembler. Links: The preprint about Scallop The preprint about flow decomposition The video of a talk by Ben Langmead about Rail-RNA The preprint about Rail-RNA If you enjoyed this episode, please consider supporting the podcast on Patreon.2017-04-1643 min