University of Pittsburgh

DataDigger: A Quick, Flexible Search Engine for Biomarker Data

Obtaining relevant raw data as the first step in multifactor dynamic computational analyses is becoming increasingly difficult as the pool of known protein biomarkers, single-nucleotide polymorphisms, and other molecular analytes grows. Data can also be spread over multiple files of various types—and even multiple locations across a network—further complicating this process. Data search and aggregation becomes tedious, inefficient, and vulnerable to error. Using structured query language (SQL) could expedite the process, but its use would necessitate importing and collating data into one database, as well as having adequate knowledge of SQL. Thus, there is a need for a comprehensive, flexible, and easy-to-understand tool to wrangle an ever-growing and dispersed body of diverse data.

Description

Pitt researchers used open-source Python libraries to develop a data aggregation and search application called DataDigger. Users can input multiple files and conditions for a search without the need to join the data into a single file, and DataDigger returns the appropriate subset. DataDigger can recognize, read, and query output files generated from Illumina® Genome Studio and data can be queried by the genotypes of given single-nucleotide polymorphisms (SNPs). Researchers querying genomics data or needing to “search-by-SNP” (e.g., obtain all relevant clinical data and biomarker data of a subset of patients with a specific SNP) can benefit from this simplified and improved search and aggregation application, especially if the diverse data types are housed in the cloud.

Applications

• Improving the data-gathering process
• Defining subsets of patients based on their genotype, biomarker, and
clinical features
• Formatting data for downstream statistical or machine learning analyses

Advantages

• Aggregate data from multiple files locally or in the cloud
• Read output files from Illumina® Genome Studio
• Filter data using multiple conditions
• Data can be queried by given SNP genotype

Invention Readiness

In-house beta-testing

IP Status

https://patents.google.com/patent/US20220067105A1