For those of us working in pharma drug or biomarker discovery, artificial intelligence (AI) plays a vital role in how we collect biological and pharmacological data. It's not only used in each step of the drug design pipeline, it ensures safer and more effective drugs in preclinical trials, while dramatically reducing development costs (1,2).
Yet there's a huge and potentially dangerous disadvantage when using AI-derived data—the question regarding their accuracy.
The unfortunate side effect of AI
Imagine you're a bioinformatician supporting discovery research projects in pharma. You work with biologists on experiments to prioritize leads for further drug development. You do a full analysis of existing data to help define which drug targets have the highest likelihood for therapeutic success. You use an AI-derived knowledge base to pull available 'omics data from a range of dataset repositories, and match that data with your company's internal data.
You analyze the data with your biologist colleagues to generate hypotheses, and define experiments to validate those hypotheses. After six months of costly but failed experiments, you realize something was off in the initial analysis and that your hypotheses were entirely misguided. After backtracking, you discover the AI-derived data were inconsistent in the annotated disease state, resulting in a complete misinterpretation of the data.
Now you've spent half a year, thousands of dollars and countless hours of research on following a dead lead. And your team has nothing to show for it.
AI-derived data: Does it make sense?
In the past few months, you've probably read countless news stories about Chat GPT. It's a powerful tool that uses AI to generate detailed answers to virtually any question you throw at it. Yet, a recognized drawback is that these answers are often factually inaccurate. Try asking it to write your bio, or the bio of your best friend. It will generate a lot of false information, but may appear plausibly factual to people who don't know you or your friend.
Chat GPT is just one example of how AI can be an impressive tool, but one that should be handled with extreme caution. Because, how can you trust insights or hypotheses derived from information that only might be accurate? Or partially accurate? Or worse still, completely inaccurate?
The answer is to couple AI with human-certified, manual curation.
We all recognize the incredible power and potential of AI to collect and bring together seemingly relevant data. Yet, ‘omics and biological relationships data is complex and nuanced and requires context that AI-derived data alone can’t provide.
As Figure 1 demonstrates, without the human 'magic touch' of aligning, correcting errors and removing irrelevant data, AI-derived data alone leaves you with a jumble of information that may or may not be accurate, which could send you down a rabbit hole in pursuit of your next biomarker or target discovery.
Figure 1. Decision tree for using AI-derived data.
We're confident that by using our manually curated, human-certified ‘omics data, you’ll quickly gain reliable insights to generate and confirm your hypotheses. We offer you direct access to the most extensive collections of integrated and standardized 'omics and biological relationships data, manually curated by a team of MS- and PhD-certified experts. In short, we find errors and correct them to ensure the data you work with are reliable and accurate.
This means that when you use our manually curated 'omics and biological relationships data, you'll avoid the stressful and frustrating consequences of being led astray by inaccurate data riddled with inconsistencies and errors.
Don't let bad data compromise your projects. And don't waste time fixing and cleaning the data yourself. Get direct access to 'golden' data that deliver true and immediate insights. Ready-to-use, manually curated data that are cleaned of errors and inconsistencies.
“Truth, like gold, is to be obtained not by its growth, but by washing away from it all that is not gold.”
Leo Tolstoy
Tweet
We wash away the 'dirt' so you can mine and collect clean and golden data.
References:
Are you a researcher or data scientist working in drug discovery? If so, you depend on data to help you achieve unique insights by revealing patterns across experiments. Yet, not all data are created equal. The quality of data you use to inform your research is essential. For example, if you acquire data using natural language processing (NLP) or text mining, you may have a broad pool of data, but at the high cost of a relatively large number of errors (1).
As a drug development researcher, you’re also familiar with freely available datasets from public ‘omics data repositories. You rely on them to help you gain insights for your preclinical programs. These open-source datasets aggregated in portals such as The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) contain data from thousands of samples used to validate or redirect the discovery of gene signatures, biomarkers and therapies. In theory, access to so much experimental data should be an asset. But, because the data are unintegrated and inconsistent, they are not directly usable. So in practice, it’s costly, time-consuming and utterly inefficient to spend hours sifting through these portals to find the information required to clean up these data so you can use them.
Data you can use right away
Imagine how transformative it would be if you had direct access to ‘usable data’ that you could immediately understand and work with, without searching for additional information or having to clean and structure it. Data that is comprehensive yet accurate, reliable and analysis-ready. Data you can right away begin to convert into knowledge to drive your biomedical discoveries.
Creating usable data
Data curation has become an essential requirement in producing usable data. Data scientists spend an estimated 80% of their time collecting, cleaning and processing data, leaving less than 20% of their time for analyzing the data to generate insights (2,3). But data curation is not just time-consuming. It’s costly and challenging to scale as well, particularly if legacy datasets must be revised to match updated curation standards.
What if there were a team of experts to take on the manual curation of the data you need so researchers like you could focus on making discoveries?
Our experts have been curating biomedical and clinical data for over 25 years. We’ve made massive investments in a biomedical and clinical knowledge base that contains millions of manually reviewed findings from the literature, plus information from commonly used third-party databases and ‘omics dataset repositories. Our human-certified data enables you to generate insights rather than collect and clean data. With our knowledge and databases, scientists like you can generate high-quality, novel hypotheses quickly and efficiently while using innovative and advanced approaches, including artificial intelligence.
Figure 1. Our workflow for processing 'omics data.
4 advantages of manually curated data
Our 200 dedicated curation experts follow these seven best practices for manual curation. Why do we apply so much manual effort to data curation? Based on our principles and practices for manual curation, here are the top reasons manually curated data is fundamental to your research success:
1. Metadata fields are unified, not redundant
Author-submitted metadata vary widely. Manual curation of field names can enforce alignment to a set of well-defined standards. Our curators identify hundreds of columns containing frequently-used information across studies and combine these data into unified columns to enhance cross-study analyses. This unification is evident in our TCGA metadata dictionary unification is evident in our TCGA metadata dictionary, for example, where we unified into a single field the five different fields that were used to indicate TCGA samples with a cancer diagnosis of a first-degree family member.
2. Data labels are clear and consistent
Unfortunately, it’s common that published datasets provide vague abbreviations as labels for patient groups, tissue type, drugs or other main elements. If you want to develop successful hypotheses from these data, it’s critical you understand the intended meaning and relationship among labels. Our curators take the time to investigate each study and precisely and accurately apply labels so that you can group and compare the data in the study with other relevant studies.
3. Additional contextual information and analysis
Properly labeled data enables scientifically meaningful comparisons between sample groups to reveal biomarkers. Our scientists are committed to expert manual curation and scientific review, which includes generating statistical models to reveal differential expression patterns. In addition to calculating differential expression between sample groups defined by the authors, our scientists perform custom statistical comparisons to support additional insights from the data.
4. Author errors are detected
No matter how consistent data labels are, NLP processes cannot identify misassigned sample groups, and such errors are devastating to data analysis. Unfortunately, it’s not unheard of that data are rendered uninterpretable due to conflicts in sample labeling presented in a publication versus its corresponding entry in a public ‘omics data repository. As shown in Figure 2, for a given Patient ID, both ‘Age’ and ‘Genetic Subtype’ are mismatched between the study’s GEO entry and publication table; which sample labels are correct? Our curators identify these issues and work with authors to correct errors before including the data in our databases.
Figure 2. In this submission to NCBI GEO, the ages of the various patients conflict between the GEO submission and the associated publication. What’s more, the genetic subtype labels are mixed up. Without resolving these errors, the data cannot be used. This attention to detail is required, and can only be achieved with manual curation.
At the core of our curation process, curators apply scientific expertise, controlled vocabularies and standardized formatting to all applicable metadata. The result is that you can quickly and easily find all applicable samples across data sources using simplified search criteria.
Dig deeper into the value of QIAGEN Digital Insights’ manual curation process
Ready to incorporate into your research the reliable biomedical, clinical and ‘omics data we’ve developed using manual curation best practices? Explore our QIAGEN knowledge and databases, and request a consultation to find out how our manually curated data will save you time and enable you to develop quicker, more reliable hypotheses. Learn more about the costs of free data in our industry report and download our unique and comprehensive metadata dictionary of clinical covariates to experience first-hand just how valuable manual curation really is.
References:
Have you ever done a Google search to find a restaurant or look up what your favorite actor is up to? Most of us have, and therefore understand the benefit of knowledge graphs, possibly without even knowing it. When you do a search on a platform like Google, the information box displayed in the results is made possible by a knowledge graph (1).
Because of their power and versatility, knowledge graphs are rapidly being adopted by the pharmaceutical industry to accelerate data science driven drug discovery. They facilitate integration across multiple data types and sources, such as molecular, clinical trial and drug label data. This enables powerful algorithms to work on various types of data at once, for applications ranging from prioritizing novel disease targets to predicting previously unknown drug-disease associations.
What is a knowledge graph?
A knowledge graph combines entities of various types in one network. These entities are connected by multiple types of relationships. Both entities and relationships can also carry additional attributes. Entities and attributes may also be part of an ontology (2, 3).
Figure 1. A simple example of a knowledge graph.
In the biomedical domain, entities represented in a knowledge graph can be, for example, molecules, biological functions and diseases or phenotypes. Relationships include molecular interactions, gene-functional associations, and drug-target interactions among others. Both entities and relationships are supported by underlying scientific evidence. Simple graphs are undirected, while more powerful graphs include causal relationships to allow causal inference.
Knowledge graph analytics
In drug discovery, knowledge graphs are used for target prioritization and drug repurposing. These tasks frequently involve link prediction approaches that allow the prediction and scoring of relationships between entities that were not explicitly present in the graph before. Artificial intelligence (AI)-inspired methods that have been used for this purpose include tensor factorization (4) and various deep-learning algorithms (see (5) for an example).
The QIAGEN biomedical knowledge graph
QIAGEN Biomedical Knowledge Base is ideally suited to build a large-scale biomedical knowledge graph. It is founded on a vast collection of diverse relationships between biomedical entities of various types. The relationships were manually curated from peer-reviewed biomedical literature and integrated from third-party databases with the highest accuracy.
In a knowledge graph constructed from QIAGEN Biomedical Knowledge Base, the main entities connected by relationships are molecules, drugs, targets, diseases, variants, biological functions, pathways, locations and more. The relationships have multiple attributes, including relationship type, direction, effect, context and source. Causality of the relationships is represented through direction. Causal relationships frequently carry information about the direction of effect (activation and inhibition) that can be leveraged in powerful analytics. Relationships are annotated with the full experimental context (e.g., tissues or organism). Entities also have attributes; for example, they are mapped to public identifiers and synonyms to support data integration.
Figure 2. Example of a sub-graph constructed from the QIAGEN biomedical knowledge graph. In this knowledge graph representation, gene and gene product entities are aggregated at the ortholog cluster level. Relationships between the same entities and with the same type, direction and effect are aggregated as well. Cetuximab is a metastatic colorectal cancer drug. EGFR is a target of cetuximab. Molecular interactions in the graph enable you to reconstruct a pathway between EFG, EGFR and the pathological process metastasis. EGFR is also a known member of the canonical pathway Colorectal Cancer Metastasis Signaling. In addition to metastatic colorectal cancer, genetic alterations of EGFR are involved in other diseases, for example non-small cell lung carcinoma. Activation of cell proliferation and inhibition of apoptosis by EGFR are known oncology mechanisms.
QIAGEN knowledge graph research for drug discovery
We actively use our QIAGEN biomedical knowledge graph in drug discovery projects in collaboration with industry partners, and develop new knowledge graph analysis approaches.
For example, we developed a machine learning approach for link prediction (6) that uses our knowledge graph to identify and prioritize genes and biological functions for a given disease. Using our biomedical knowledge graph and this machine-learning approach (7), we prioritized genes linked to known clinical manifestations of COVID-19 and built networks connecting those genes to SARS-CoV-2 viral proteins via protein-protein interactions. Based on these networks, we identified about 450 drugs potentially interfering with viral-host interactions, 54 of which were involved in clinical trials against COVID-19. We further used this approach and our QIAGEN biomedical knowledge graph to develop over 1500 machine-learning-generated disease networks, such as this one on pulmonary hypertensive arterial disease.
Learn more about how QIAGEN Biomedical Knowledge Base enables biomedical knowledge graph construction and analysis to fuel your data- and analytics-driven drug discovery. Request a trial to discover how this powerful tool will transform your drug discovery research.
References
With the scientific research community publishing over two million peer-reviewed articles every year since 2012 (1) and next-generation sequencing fueling a data explosion, the need for comprehensive yet accurate, reliable and analysis-ready information on the path to biomedical discoveries is now more pressing than ever.
Manual curation has become an essential requirement in producing such data. Data scientists spend an estimated 80% of their time collecting, cleaning and processing data, leaving less than 20% of their time for analyzing the data to generate insights (2,3). But manual curation is not just time-consuming. It is costly and challenging to scale as well.
We at QIAGEN take on the task of manual curation so researchers like you can focus on making discoveries. Our human-certified data enables you to concentrate on generating insights rather than collecting data. QIAGEN has been curating biomedical and clinical data for over 25 years. We've made massive investments in a biomedical and clinical knowledge base that contains millions of manually reviewed findings from the literature, plus information from commonly-used third-party databases and 'omics dataset repositories. With our knowledge and databases, scientists can generate high-quality, novel hypotheses quickly and efficiently, while using innovative and advanced approaches, including artificial intelligence.
Here are seven best practices for manual curation that QIAGEN's 200 dedicated curation experts follow, which we presented at the November 2021 Pistoia Alliance event.
These principles ensure that our knowledge base and integrated 'omics database deliver timely, highly accurate, reliable and analysis-ready data. In our experience, 40% of public ‘omics datasets include typos or other potentially critical errors in an essential element (cell lines, treatments, etc.); 5% require us to contact the authors to resolve inconsistent terms, mislabeled treatments or infections, inaccurate sample groups or errors mapping subjects to samples. Thanks to our stringent manual curation processes, we can correct such errors.
Our extensive investment in high-quality manual curation means that scientists like you don't need to spend 80% of their time aggregating and cleaning data. We've scaled our rigorous manual curation procedures to collect and structure accurate and reliable information from many different sources, from journal articles to drug labels to 'omics datasets. In short, we accelerate your journey to comprehensive yet accurate, reliable and analysis-ready data.
Ready to get your hands on reliable biomedical, clinical and 'omics data that we've manually curated using these best practices? Learn about QIAGEN knowledge and databases, and request a consultation to find out how our accurate and reliable data will save you time and get you quick answers to your questions.
References: