Getting poster data...
Thomas Moerman, Dries Decap, Toni Verbeiren, Jan Fostier, Joke Reumers, Jan Aerts (VDA-lab, ESAT/STADIUS, KU Leuven, Belgium – iMinds; INTEC, UGent, Belgium – iMinds; Janssen R&D, Beerse, Belgium; ExaScience Life Lab, Leuven, Belgium)Current genomics data formats and processing pipelines are not designed to scale well to large datasets. They were conceived with single node processing in mind, and were not intended for being used in an interactive environment. This software design legacy resulted in the fragmented nature of the tool space for genomic analysis. The tendency is to invent file formats that represent intermediate computation results, which are then parsed again by other tools. The procedure steps are stitched together with scripts, leading to the typical bioinformatics pipelines we observe today. This work is an instance of a software architecture for visual exploration of potentially large genomics data sets while allowing ad-hoc queries and visualisation of the data. Spark, a scalable computation engine under the hood performs data transformations and aggregations in parallel across different compute nodes and persists intermediate results in memory. Our work is centered around variant calling, the process of determining positional genomic variation with respect to a reference. We developed an application that accommodates interactive pairwise comparison of annotated VCF files, focused on SNPs.