Hypothesis-free NGS data analysis
Computational pipelines for NGS data analysis involve multiple hypotheses and simplifications leading to an important loss of information. For instance, a major limiting factor is the mapping step where NGS reads are aligned to a reference genome or transcriptome. In RNA-seq analysis, relying on a reference transcriptome amounts to ignoring novel genes, alternative transcripts and transcripts from repeats or with high levels of mutation or editing. Hundreds of dedicated software have been developed to bypass these limitations and retrieve specific event types, with highly diverging results.
Our lab has developed a method for RNA-seq data analysis, DE-kupl (1), in which NGS data is analyzed at the level of raw sequence using k-mers (i.e. subsequences of length k, with typically k=31) followed by differential expression analysis. Only k-mers that are differentially represented between two sets of libraries are extracted and analyzed.
Therefore, all biological variation present in the original NGS dataset is theoretically collected, with no prior hypothesis about their origin.
We will show how DE-kupl can be applied to various experimental settings and present our plans for future developments, including application to the discovery of novel biomarkers based on cliniciallly annotated DNA-seq or RNA-seq data.
(1) Audoux J, Philippe N, Chikhi R, Salson M, Gallopin M, Gabriel M, Le Coz J, Commes T, Gautheret D. (2017) DE-kupl: Exhaustive capture of biological variation in RNA-seq data through k-mer decomposition. Genome Biol. 18: 243.