Discover how CLC continues to improve and develop new capabilities.

We’re excited to reveal many new improvements and enhancements to the latest release of QIAGEN CLC Genomics Workbench and its related plugins that significantly extend its value. Key improvements and new features in the new version (v24) include:

Watch “What’s new in CLC v 24” here.

RNA-Seq volcano plot shows the relationship between fold changes and p-values.

Figure 1. RNA-Seq volcano plot shows the relationship between fold changes and p-values. The reworked volcano plot allows for 1) different color gradients for positive and negative fold change values, 2) annotations, 3) legends and 4) customizable transparency of data points. Genes of interest can also be highlighted by setting thresholds.

Visualize and interact with spatial transcriptomics data.

Figure 2. Visualize and interact with spatial transcriptomics data.

Learn more about the applications supported by our portfolio of QIAGEN CLC Genomics software and request a consultation with one of our experts to help you find the right QIAGEN CLC toolset for your research goals.

Join us for our webinar, where we'll focus on miRNA data analysis using QIAGEN CLC Genomics Workbench and Biomedical Genomics Analysis plugin. Together, we'll explore how you can: 
• Import reads and metadata 
• Download miRBase database 
• Quantify miRNA expression 
Perform differential expression analysis 
• Visualize your results 
Create and use a custom database 
 
Bring any questions you may have, and we will answer them during the webinar. 

Do you ever struggle to formulate hypotheses based on your experimental expression data? You may be comparing results from healthy versus tumor cell lines or treated versus untreated samples. What do the differences between expression patterns in the data mean? Many of us struggle to make biological sense of our RNA-seq or microarray data. The massive amount of expression data generated from experiments leaves us with thousands of data points but often no understanding of their biological meaning.

Advanced pathway analysis is an excellent way to gain a deeper understanding of expression data and experimental results. Here we offer three easy ways to go from expression data to pathway analysis so you can give your experimental data biological context to start gathering meaningful insights.

QIAGEN Ingenuity Pathway Analysis (IPA) is a popular tool for analyzing, comparing and contextualizing differential gene expression results from experiments in human, mouse or rat, among other organisms. QIAGEN CLC Genomics Workbench has convenient tools for processing raw data from RNA-seq or microarray experiments and performing differential gene expression analysis. In addition, with an IPA license and the Ingenuity Pathway Analysis Pathway plugin installed, you can upload results to IPA directly. By combining QIAGEN CLC Genomics Workbench with QIAGEN IPA, we offer a versatile platform for linking various instrument readout formats to biological insights.

These are the three most common use cases we see among our customers:

  1. Processing of raw FASTQ files
  2. Processing of expression matrix files from core facilities
  3. Processing of microarray data

Use case 1: FASTQ data to IPA

Typical experiments you may be running involve sending RNA (mRNA, miRNA, lncRNA, etc.) from treatment and control samples to an NGS sequencing facility. After sequencing, you perform bioinformatics analysis using QIAGEN CLC Genomics Workbench on the FASTQ file(s) returned by the facility. First, you can do QC and trimming using the Prepare Raw Data workflow. Samples that meet QC criteria are then associated with metadata describing the experimental setup. You can identify differentially expressed genes (DEGs) using the RNA-Seq and Differential Gene Expression Analysis workflow. Differential gene expression analysis is based on the fit of a Generalized Linear Model with a negative binomial distribution, like the approaches taken by the popular tools EdgeR (Robinson et al., 2010) and DESeq2 (Love et al., 2014). Paired designs are supported, and it is possible to control for batch effects. You can then upload DEGs directly to IPA for additional analysis, comparison and contextualization. Analyze Expression Data and Upload Comparisons to IPA provides a convenient Sample to Insight workflow.

See also our manual on RNA-seq and small RNA analysis.

Use case 2: RNA-seq expression data to IPA

In this use case, the sequencing facility processes the raw FASTQ files and returns an expression matrix file, which takes up much less space than FASTQ files do. You can import expression matrices using the Import Expression Matrix tool in QIAGEN CLC Genomics Workbench. Then you can apply QC criteria, associate metadata and compare the experimental groups as described for use case 1. This use case is also supported by a Sample to Insight workflow: Analyze Count Matrix and Upload Comparisons to IPA.

Use case 3: Microarray expression data to IPA

In this third scenario, the samples have been processed on microarrays, not using NGS. Various generic and vendor-specific formats are supported in QIAGEN CLC Genomics Workbench. Steps include setting up a microarray experiment to group the samples, followed by transforming and normalizing the expression data and running a statistical test to identify differential expression. Several tests are available, including proportion-based tests, t-tests and ANOVA. Filtered DEGs can be uploaded to IPA for pathway analysis as described for use case 1.

Ready to give it a try?

QIAGEN Digital Insights bioinformatics tools for transcriptomics support microarray and RNA-seq data analysis with a variety of specialized tools. They enable you quickly go from raw instrument output to biological insights, as well as compare to over 100,000 curated public datasets. Learn more and request a consultation about our portfolio of tools for biomarker and target discovery that support expression data analysis. Ready to try these applications? Request a trial of QIGEN CLC Genomics Workbench and QIAGEN IPA to see how these tools can work together to streamline your insights from expression data.

References:

Love et al. (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15, 550.

Robinson et al. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics (Oxford, England), 26(1), 139–140.

Resources:

Plugin:

Ingenuity Pathway Analysis

 

Blogs:

Best practices for RNA-seq data analysis

Transcript discovery

Compare biological expression results with QIAGEN IPA Analysis Match

 

Webinars:

Training on RNA-seq and pathway analysis

 

Example analysis performed using QIAGEN CLC Genomics Workbench:

Shaath et al. (2021) Integrated whole transcriptome and small RNA analysis revealed multiple regulatory networks in colorectal cancer. Sci Rep 11, 14456.

 

Tutorials:

Expression analysis using RNA-Seq data

Advanced RNA-seq analysis with upload to IPA

Microarray-based expression analysis

An introduction to workflows

Building workflows in QIAGEN CLC Genomics Workbench video tutorials

How to include external applications to QIAGEN CLC Genomics Server video tutorials

Introducing a new way to improve bioinformatics efficiency in your high-throughput NGS setting

It’s exciting that advancements in high-throughput sequencing techniques and analysis enable us to generate whole genome (WGS) and whole exome (WES) data in bulk for many species, including humans. With new machines and chemistries, the cost of sequencing has decreased significantly. However, the total cost of ownership associated with bioinformatic analysis of the resulting files remains a bottleneck (1).

Whether you run a genome center, testing facility, core lab or provide sequencing services, you’ve got to deliver variant call files (VCFs) at an unbeatable price and with consistent quality and turnaround time, even at peak demand. Your customers, as well as your business, depend on it. Many high-speed NGS analysis solutions require purchasing expensive, highly specialized hardware, massive computers or large cloud computing contracts. The requirement for fast and consistent turnaround times, also at peak demand, can quickly translate into a need for more personnel, more processing power—and more investment. How do you keep costs down yet deliver quality results with quick turnaround in a world of shrinking budgets?

More investment in your bioinformatics infrastructure? Nope—now you don’t have to

What if there were a scalable, point-and-click solution that could handle all your WGS, WES and large panel data analysis needs without having to purchase vast amounts of specialized infrastructure? A software that could run with a GUI and be used by anyone with minimal training? What if you didn’t have to compromise between speed and quality?

Introducing a better, faster, cheaper and more flexible tool for WGS and WES analyses

Our all-in-one NGS bioinformatics software QIAGEN CLC Genomics Workbench Premium now offers you a faster, more accurate, more flexible and more affordable way to process WGS and WES files in bulk. This is made possible via QIAGEN CLC LightSpeed Module, which enables an ultra-fast and accurate FASTQ to VCF pipeline for hereditary germline mutation analysis.

What’s more, you’ll enjoy full flexibility. Our QIAGEN CLC Genomics Workbench Premium can process data from any sample, any panel and any species and run your analyses on a laptop, desktop, server or the cloud without depending on any new or specialized hardware.

Reduce cost with speed: Accelerate WGS secondary analysis down to just 25 mins

For certain licenses, you only pay an annual fee for software access, allowing you to run (and re-run) as many samples as you need.​ And because of our ultra-fast FASTQ to VCF pipeline, you can get more analyses done in less time. This translates into lower analysis costs, both for on-premise and cloud deployment.

In our recent benchmark study, we showed that using our ultra-fast QIAGEN CLC LightSpeed technology our FASTQ to VCF hereditary workflow analyzes 34x human WGS samples in just 25 minutes, whereas a QIAseq Exome v3 50x sample takes just 90 seconds. When run in Amazon Web Services (AWS), the incurred computing costs were about $1 per WGS and a few cents per WES. There is no other technology that can process WGS or WES this fast—or as cost-efficient.

With demonstrated high accuracy and reproducibility, and built on a scalable bioinformatics analysis platform, QIAGEN CLC LightSpeed technology will revolutionize your ability to perform high-volume whole genome sequencing.

A software that’s ‘cheaper than free’

QIAGEN CLC Genomics Workbench Premium is an NGS analysis software your core lab can’t do without. It enables you to deliver ultra-fast sequencing analysis results while controlling your costs. It does this by saving your lab time, processing capacity and energy, so you can provide affordable services. You’ll also enjoy a variety of specialized tools for all your sequencing needs. The CLC platform software QIAGEN CLC Genomics Server and QIAGEN CLC Genomics Cloud Module help you to build the scalable bioinformatics analysis architecture you need to offer a high-throughput genomics analysis service at affordable prices. In addition, our QIAGEN CLC Genomics platform is fully supported with tutorials and documentation, an excellent team of customer support professionals and dedicated trainers to ensure you have the support you need to perform your analyses. These advantages result in reduced total cost of ownership and are far cheaper than maintaining your current setup. Therefore transitioning to QIAGEN CLC Genomics Workbench Premium is a switch you’ll quickly discover is ‘cheaper than free’.

Get in touch

Learn more about the newest features of QIAGEN CLC in our latest release, check out our upcoming webinar and request a consultation from one of our experts. Ready to try it out for yourself? Request a trial of QIAGEN CLC Genomics Workbench Premium to see how this software will make it faster, easier and cheaper for you to analyze your NGS data.

Share your CLC LightSpeed results and win

Got killer runtime results using QIAGEN CLC LightSpeed? Share them with us on social media using #CLCLightSpeed. When you do, you'll enter for a chance to win one of three one-year licenses to QIAGEN CLC Genomics Workbench Premium. You may alternatively enter for a chance to win by submitting the online entry form available here. Terms and conditions apply.

References

  1. The Cost of Sequencing a Human Genome. https://www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-cost (accessed November 28, 2022)
Discover how our new LightSpeed feature changes the world of whole genome sequencing analysis

We’re excited to reveal many new improvements and enhancements to the latest release of QIAGEN CLC Genomics Workbench and its related plugins that significantly extend its value. Key improvements and new features in the new version 23 (v23) include:

Figure 1. An array of visualizations now available in QIAGEN CLC Genomics Workbench version 23.0.

Learn more about the applications supported by our portfolio of QIAGEN CLC Genomics software, and request a consultation with one of our experts to help you find the right QIAGEN CLC toolset for your research goals.

Discover an easy way to get reproducible analyses, traceability of results and efficient bulk analysis with QIGEN CLC workflows.

Stringing together bioinformatics tools into pipelines enables reproducible execution of complex workflows, producing, among other things, QC reports, data visualizations, statistical analyses, annotation and filtering of output from raw NGS data. Combined with parallel execution, the potential for efficient throughput of reproducible analyses with traceability of results can be done using QIAGEN CLC workflows.

In QIAGEN CLC Workbenches, workflows are easily created and configured using a graphical editor (1, 2). Tools can be added through drag-and-drop from the Workbench Toolbox or by selecting from a list. The output of one element is defined as the input to another simply by drawing a line between them. Fine-grained control over the execution pattern within a workflow can be added with control flow elements, supporting cases such as RNA-Seq and differential expression analysis in a single workflow or providing sets of different inputs per workflow run.

With a QIAGEN CLC Genomics Server, third-party applications (e.g., your own tools or open-source tools) can be configured as external applications, thereby expanding the analysis potential beyond the software provided by QIAGEN (3). External applications can be added to workflows using the graphical workflow editor.

Getting started with workflows is simple. Examples are provided in the Template Workflows folder in the Workbench Toolbox. These workflows can be run directly or edited to add or remove tools, change parameters, reconfigure output naming patterns and much more. The Template Workflows folder initially contains two subfolders: Basic Workflow Designs (containing RNA-seq and DNA-seq workflows) and Prepare Raw Data. When QIAGEN CLC Workbench plugins containing workflows are installed, additional subfolders are created containing those template workflows.

Outputs generated using QIGEN CLC workflows include information on provenance relevant for auditing or publication. This history information includes the version of the software used, the tool and parameter settings used, the name of the user who ran the workflow, the date and time the element was created and the data that the output was derived from. When analyses are run on a QIAGEN CLC Server, a record of the analysis is also written to the audit log.

For bulk processing, a workflow can be submitted in batch mode, where the workflow is run multiple times, once for each input, or set of inputs, specified. When a workflow is launched in batch mode on a QIAGEN CLC Workbench, the individual jobs in that batch are carried out serially – one workflow run after another. For small analyses, this is fine. However, for routine analyses and for large analyses, we recommend the parallel execution potential and intelligent queuing facilities afforded by QIAGEN CLC Genomics Server using a Job Node or Grid Node setup.

When a workflow is submitted in batch mode to a QIAGEN CLC Genomics Server with nodes, each workflow run can be executed in parallel.  The server administrator can choose the level of parallelization desired. Options include executing each workflow run on a single node, splitting execution of individual workflows across nodes or specifying parallelization at the level of sub-workflows (blocks), which are created behind the scenes during execution.

 

 

Figure 1. Serial (top) versus parallel (bottom) execution of workflows. On CLC Servers with nodes, queuing and parallel execution capacity supports optimal computational resources. On a QIAGEN CLC Genomics Server without nodes, workflows would be queued and processed serially, however they were submitted. On QIAGEN CLC Workbenches, batch jobs are run serially. Parallel execution of workflows on a QIAGEN CLC Workbench, triggered by multiple individual job launches in relatively quick succession, is not recommended; each workflow run assumes it has access to the entire system. Thus there is a risk that jobs will crash due to issues such as memory limitations.

Learn more about the features and request your free trial of QIAGEN CLC Genomics Server and QIAGEN CLC Genomics Workbench, and explore the benefits for yourself.

Have questions? Request a consultation today.

References:

  1. An introduction to workflows using QIAGEN CLC Workbenches
  2. Theiagen Consulting LCC video tutorials on how to build workflows in CLC Genomics Workbench for SARS CoV-2 analysis
  3. Theiagen Consulting LCC video tutorials on how to include external applications to the CLC Genomics Server: a) RAxML b) MAFFT c) iVAR

 

Did you know SARS-CoV-2 is shed in the feces of individuals with symptomatic or asymptomatic infection? Viral particles shed into wastewater via the sewer system are no longer infectious but can still be measured. Therefore, recent public health monitoring efforts target sewers to identify known genotypes of SARS-CoV-2. Genotyping by sequencing SARS-CoV-2 from wastewater correlates with sequencing results in patients in the wastewater catchment area, providing an efficient monitoring tool for viral epidemiology. Wastewater is readily available at sewage plants, and collection of wastewater samples avoids biases associated with sampling from hospitals or testing facilities (1).

PCR approaches are highly effective on well-targeted variants, and multiplexing strategies capable of simultaneously targeting several mutations can unravel the mutation patterns of circulating variants. However, NGS approaches can find new variants, increase the sensitivity of variant detection and provide an unbiased representation of the variants circulating in populations. It is also used for whole genome SNP analysis in local epidemiological analyses, such as hospital infection control and local outbreak tracing.

Whether using Oxford Nanopore, Illumina, PacBio or IonTorrent technology, and whether using ARTIC or vendor-designed panels, QIAGEN CLC Genomics Workbench has standard SARS-CoV-2 analysis workflows that can easily be modified towards any platform, protocol and application by exchanging workflow elements, primer design files or parameter settings.

The general approach of the workflows is mapping the reads to a reference, calling variants, generating a consensus sequence and generating outputs that enable efficient review of results, including cross-sample comparison. See (2) for examples of building workflows.

When working with several samples, multi-FASTA export of consensus sequences, as well as PDF export of the quality report, is easily accomplished.

Typically, the generated consensus sequences are manually submitted to Nextclade and Panoglin to annotate the samples with the latest phylogenetic lineage information.

For high-throughput use, any manual steps introduce errors and inefficiencies. QIAGEN CLC Genomics Server software has the capability to automate linage annotation processes by making use of its “external applications” functionality, where regularly-updated docker images of Nextclade or Pangolin can be included in CLC workflows (Figure 1). For other examples of external applications, see (3).

a)

b)

c)

Figure 1. a) Using QIAGEN CLC Genomics Server, Nextclade and Pangolin docker images are added to CLC as an “external application” so that the functionalities can be integrated into CLC workflows to assign lineage information to the sample. b) Example output of the Nextclade functionality and c) example output of the Pangolin functionality of the CLC workflow shown.

The server software is also well-suited for handling many workflow executions in parallel, as it has a “scheduler” functionality that manages the execution queue. This queuing ability ensures that parallel workflow execution is coordinated, and individual steps do not interfere with each other by competing for computational resources. External applications can also be executed in the cloud by using QIAGEN CLC Genomics Cloud Engine, reducing local hardware needs to a minimum. QIAGEN CoV-2 Insights service is an instance of this architecture, available if you wish to use this pipeline without setting up the software on your own.

These bioinformatic workflows work fine in cases where it can be assumed that there is only one dominant strain in circulation. However, in situations where a novel strain is emerging and there are several possibilities to monitor, it is a better strategy to test for evidence of marker mutations in the reads. A tool that can be used for this purpose, by monitoring predefined reference positions in read mappings, is the “Identify Known Mutations from Sample Mappings” algorithm, which outputs whether the variant could be detected or not, whether the coverage was sufficient at the given position, the frequency and other statistics of the variant(s) in the sample. As input, the tool takes the read mapping and a variant track that holds the specific variants that you wish to test for. By applying the mutation tester tool iteratively, in series, with variant tracks for each SARS-CoV-2 strain one wishes to monitor, you can test for evidence of many strains in a single workflow (Figure 2), which can then be applied on batches of samples simultaneously, providing a fully-scalable solution that only needs updating when new strains are expected to enter the population.

Figure 2. A QIAGEN CLC Genomics Workbench workflow interrogating input sample read mapping to a SARS reference at genomic positions defining known variants of the virus. The workflow can be executed in batch mode to monitor many samples simultaneously.

References:

  1. Wurtz, N., et al. (2021). Monitoring the Circulation of SARS-CoV-2 Variants by Genomic Analysis of Wastewater in Marseille, South-East France. Pathogens 10, 1042. https://doi.org/10.3390/pathogens10081042
  2. Theiagen Consulting LCC video tutorials on how to build workflows in CLC Genomics Workbench for SARS CoV-2 analysis
  3. Theiagen Consulting LCC video tutorials on how to include external applications to the CLC Genomics Server: a) RAxML b) MAFFT c) iVAR

Additional resources:

Related blog posts:

Learn more about the capabilities of QIAGEN CLC Genomics Workbench Premium and download your free trial today.

We are excited to share with you news of the new QIAGEN CLC Genomics version 22, which offers you many improvements and new features, including:

See our latest improvements page for information about all the updates in this CLC Genomics Workbench release.

See the latest improvements page for information about all the updates in this CLC Genomics Server release.

Plugins and Modules

Many plugins and modules have new features and improvements. Highlights include:

Discover all QIAGEN CLC plugins here.

New Application Notes

Assembly and annotation of plastid genomes using QIAGEN CLC Genomics Workbench

RNA-seq analysis using long and short reads from pathogen-infected plant tissues

Improving structural annotation in complex genomes with QIAGEN CLC Genomics Workbench

A strategy for evaluating genomic assemblies in QIAGEN CLC Genomics Workbench

Discover additional application notes here.

New Tutorials

Immune Repertoire Analysis using QIAseq Immune Repertoire panels

Analysis of Viral Hybrid Panel Data and Identification of Viral Integration Sites

Creating and using annotated sequences as microbial reference data

Explore additional tutorials here.

Learn more about the applications supported by our portfolio of QIAGEN CLC Genomics solutions, and request a consultation with one of our experts to help you find the right QIAGEN CLC toolset for your research goals.

 

Ab initio gene finding is a central step in genome analysis, which must account for the biology of the investigated genome(s) in order to perform adequately. Signals are many fold, and include coding potential, hexamer distributions, RNA polymerase-binding and spliceosome-binding sequences, all of which depend on GC content.

The GeneMark family algorithms have been continuously used for genome annotation, starting with the first complete genome (Haemophilus influenza) sequenced in 1995.  Currently, an algorithm of the GeneMark family is being used by NCBI as a part of the prokaryotic genome annotation pipeline. Two algorithms, MetaGeneMark and GeneMark-ES, are available as plugins in QIAGEN CLC Genomics Workbench and QIAGEN CLC Genomics Server. MetaGeneMark has proven to deliver accurate gene predictions in metagenomes. GeneMark-ES is an automatic ab initio gene prediction tool for compact eukaryotic genomes. Gene finding in whole genome-sequenced microbial genomes can also be performed using the “Find Prokaryotic Genes” tool of QIAGEN CLC Microbial Genomics Module.

MetaGeneMark

The MetaGeneMark plugin represents a new release of the gene finding algorithm for metagenomic sequences. For each metagenomic contig, MetaGeneMark uses values of the GC content of each ORF in the contig to select sets of gene model parameters (1,2). For a given GC content value, the algorithm uses parameters that vary for archaeal and bacterial domains. This approach ensures that there are no parameters that a user has to select or adjust. The algorithm is fast; it can process 1 GB of metagenomic contigs on a single CPU in less than half an hour.

GeneMark-ES

The GeneMark-ES plugin delivers ab initio predictions of protein-coding genes in eukaryotic genomes (3,4). The GeneMark.hmm algorithm employs a hidden semi-Markov model. The model parameters are determined iteratively using Viterbi training. The most probable parse of a genomic sequence into exons, introns and intergenic regions is thus determined simultaneously with unsupervised training of the model parameters from the genomic sequence, rendering GeneMark-ES a fully automatic tool. GeneMark-ES was shown to produce high gene prediction accuracy for genomes with lengths less than 400 MB.  Longer genomes present a challenge due to longer, on average, intergenic regions. The unsupervised training procedure is a computationally expensive task and may take several hours.

Find prokaryotic genes

Ab initio gene finding for microbial genomes can be performed using the “Find Prokaryotic Genes” tool of QIAGEN CLC Microbial Genomics Module. The tool creates a gene prediction model from the input sequence, which estimates GC content, conserved sequences corresponding to ribosomal binding sites, start and stop codon usages, and a statistical model (namely, an Interpolated Markov Model) for estimating the probability of a sequence to be part of a gene compared to the background. The model is then used to predict coding sequences from the input sequence. This tool is inspired by Glimmer3 (5).

Resources

The MetaGeneMark manual and GeneMark-ES manual provide detailed instructions on plugin usage. The use of the algorithms was documented in more than 2000 research publications. The QIAGEN CLC Microbial Genomics Module manual has extensive documentation on the ”Find Prokaryotic Genes” tool and settings and downstream analysis capabilities.

In case RNA-seq data exist, the QIAGEN CLC Genomics Workbench toolbox enables easy verification of ab initio gene predictions, as described in the application note 'Improving structural annotation in complex genomes with QIAGEN CLC Genomics Workbench'.

See also our blog on transcript discovery using QIAGEN CLC Genomics Workbench.

References

  1. Besemer J. and Borodovsky M. (1999) Heuristic approach to deriving models for gene finding.
    Nucleic Acids Research 27 (19): 3911.
  2. Zhu W., Lomsadze A. and Borodovsky M. (2010) Ab initio gene identification in metagenomic sequences. Nucleic Acids Research 38 (12): e132. doi: 10.1093/nar/gkq275
  3. Lomsadze A., et al. (2005) Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Research 33: 6494.
  4. Ter-Hovhannisyan V., et al. (2008) Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Research 18:1979.
  5. Delcher AL, Bratke KA, Powers EC, Salzberg SL. (2007) Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics 23 (6): 673. doi: 10.1093/bioinformatics/btm009

Explore QIAGEN CLC Genomics Workbench Premium

Food and feed safety is a main concern for food authorities, centers of disease control, departments of agriculture and public health laboratories, surveilling and acting on epidemiological outbreaks of foodborne pathogens such as Salmonella, Listeria, Vibrio, E. coli, Shigella, Campylobacter and Cronobacter reported by hospitals and doctors. Typically, the detective work involved in tracing an outbreak includes examining pre-infection food intake of patients, drinking water, supermarket receipts and social media postings of dining-out events. Laboratory steps involve culturing bacteria from suspected sources using specialized bacterial growth media to isolate the causal agent, followed by strain typing. Once the source of contamination has been identified, it is eliminated from the food chain, often by product recalls from the company producing the food. At this stage, the damage has been done, and the impacts are wide. Consumers are affected by potential fatalities. Brand reputation and immediate financial returns are diminished. A lot of food is wasted. For this reason, there is a strong industry trend to have in-house facilities and expertise to establish bacterial baselines and monitor any deviations thereof, enabling early detection and avoiding costly outbreaks. Other use cases include correct labeling and fraud detection of food items.

In addition to culturing and PCR-based identification methods, NGS-based approaches to sample characterization are gaining traction, namely whole genome sequencing of isolates and taxonomic profiling of bacterial communities. Food quality laboratories are now routinely equipped with desktop sequencing machines from Illumina or Ion Torrent and portable devices from Oxford Nanopore to provide the sequences.

Taxonomic profiling of the bacterial community can involve sampling and sequencing DNA using whole metagenome shotgun approaches, either with or without an intercalated PCR amplification step of the 16S rRNA genes. The bioinformatics pipelines for the NGS data analysis vary according to the approach taken (Jagadeesan et al., 2019).

Whole genome sequencing

Whole genome sequencing (WGS) of isolates consists of quality control, read trimming and assembly, bacterial characterization, strain typing, antimicrobial resistance characterization, variant calling, phylogenetic analysis and visualization tasks.

A recent German consortium effort for genome-based surveillance of Salmonella enterica isolates using Illumina sequencing technology lists 15 open-source bioinformatics software tools needed for the WGS analysis, whereas a recent paper by the Institute of Food Safety and Analytical Sciences, Nestlé Research, lists 13 tools for their corresponding pipeline. Web-based bioinformatics services tailored to the NGS analysis of food-borne pathogens include the US-based GenomeTrakr and the Danish Evergreen pipelines. For non-bioinformatician food scientists and microbiologists, it is time-consuming and impractical to learn these programs, let alone installing, tying together, version-controlling and maintaining them. Instead, a single, integrated platform that is easy to use, install and maintain is much preferred. QIAGEN CLC Genomics Workbench Premium has tailored tools for all steps along the pipeline and is designed for bench scientists to use without bioinformatics expertise. Workflows that tie these steps together are also available so that execution is as simple as clicking a few mouse-clicks using the graphical user interface. Scaling to enterprise levels is relatively straightforward with the QIAGEN CLC Genomics Server or QIAGEN CLC Genomics Cloud Engine software.

 

Figure 1. A schematic workflow for the analysis of NGS reads generated by whole genome sequencing of isolates.

The tutorial “Typing and Epidemiological Clustering of Common Pathogens” includes an example workflow for analyzing NGS data from isolated and cultivated bacterial samples using QIAGEN CLC Genomics Workbench. Using Illumina data from 47 cultured Salmonella enterica, the workflow identifies the best matching reference and its taxonomy, performs NGS-based multilocus sequence typing (MLST), finds antimicrobial resistance genes, identifies potential contaminants in a sample and performs outbreak analysis based on SNP-trees. The databases needed for workflow execution are also provided and include Salmonella and Staphylococcus genome references, MLST schemes and antimicrobial resistance gene databases. The workflow can easily be adopted to other bacterial species or modified to perform other tasks or search additional databases. The tutorial also demonstrates how to work with many samples to create both k-mer trees and SNP-trees and display these in the context of metadata. Metadata can be added and displayed on trees as described in the “Phylogenetic Trees and Metadata” tutorial.

Taxonomic profiling of bacterial communities

For monitoring microbial communities along the food processing chain, the bacterial isolation and genome typing workflow is often impractical, as the heavily manual laboratory process of sample culturing does not scale well and is heavily biased to identifying the “usual suspects” amongst species that can be cultured. In contrast, culture-independent approaches permit high-throughput automation while also providing unbiased information on the microbial composition in the samples. Two approaches are widely used: Amplicon-based profiling and whole shotgun metagenomics.

Amplicon-based profiling

Amplicon-based profiling is based on sequencing highly conserved regions of bacterial genomes at the 16S rRNA locus (ITS for fungi), clustering the resulting NGS reads into pseudo-species called Operational Taxonomic Units (OTUs), and compute the abundance of each OTU. In reference-based OTU clustering, a database provides taxonomy assignment for the OTUs while OTUs can be constructed without a valid match in the reference database, providing evidence for yet unknown bacterial (or fungal) species. The PCR amplification ensures a highly sensitive assay. With relatively few sequences, a representative and reproducible taxonomic profile of the samples can be obtained, making this approach highly cost-effective and scalable. The tutorial “OTU Clustering Using Workflows” provides a workflow for analyzing NGS data from soil samples using QIAGEN CLC Genomics Workbench and visualizing the results using zoomable sunburst and bar chart plots.

Whole shotgun metagenomic taxonomic profiling

A more direct approach that does not rely on PCR (and hence avoiding many of the PCR-associated potential biases) is based on whole shotgun sequencing of metagenomic DNA and performing taxonomic profiling. This is done by mapping the reads to a representative microbiome reference database and reporting back the taxonomic levels of references to which reads map and the percentage of reads mapped to a given reference as a proxy for abundance of this species in the microbiome. Evidence for unknown microbial species is contained in the reads not matching the reference database(s) and a metagenomic assembly and binning of such reads allows for the construction of metagenome-assembled genomes (MAGs) that can be incorporated into one’s reference database and serve as quality markers.

The tutorial “Taxonomic Profiling of Whole Shotgun Metagenomic Data” demonstrates the taxonomic analysis to monitor the effect of antibiotic treatment of two subjects’ gut microbiota in a time series experiment. For metagenomic assembly and binning of contigs,  the “QC, Assemble and Bin Pangenomes” workflow template is provided in the software. Constructing and maintaining the databases is explained in the “Creating and using annotated sequences as microbial reference data” tutorial.

A common source of error in whole shotgun metagenomic approaches is derived from the reads originating from the “food matrix”. This usually means the bulk of the reads are derived from the host genome or, in the case of fermented products, from the starter culture. Hence, a filtering step to remove these should be included in the analysis. The “Taxonomic Profiling” tool includes this optional filter. Beck et al. (2021) lists 31 commonly used food and feed “matrix filtering genomes” that should be used as “decoy” reference(s) in this step. Including these matrix references will also reduce false-positive findings and speed up the read mapping step, as exact matches of reads to reference are found much faster than approximate matches.

The secret to success

The choice of reference data is key to the success of taxonomic profiling approaches using NGS. If a given species in the microbiome is not represented in the reference data, this will lead to false-negative findings. If a species is not present yet reads originating from this species map to the genome of an unrelated species with similar genomic regions, this will lead to false-positive findings. This can happen if the reference databases used to perform taxonomic profiling are not representative of the habitat studied. In the food safety NGS area, false-negatives will result in overlooked problems, and false-positives will trigger unnecessary alerts. For this reason, generic reference databases that try to capture the entire (rarefied) tree of life, regardless of habitat, may be a poor choice. Instead, habitat-specific reference databases are becoming the new standard. RVDB for virus references, ProGenomes2 and MGnify for a wide range of microbial communities are recent examples of such databases. Food and feed monitoring laboratories have to set up the laboratory procedures involved in sampling along the production chain, nucleic acid extraction, library preparation and sequencing. The bioinformatics analysis then must be set up correspondingly. QIAGEN CLC Genomics Workbench not only supports this approach but also provides a single point-of-entry to bioinformatics by having a universal and flexible toolset that can be executed as workflows to do large-scale analyses without having to learn how to use and install and maintain various open-source tools and databases.

Use case and example workflow

Scientists at a global dairy company are using QIAGEN CLC Genomics Workbench with Oxford Nanopore sequencing technology to perform whole metagenome shotgun analysis to establish a “normal” microbiome community baseline for dairy products and to monitor deviations thereof which are associated with food spoilage. Advantages are the availability of a library preparation kit, sequencing technology, plug-and-play software, and most importantly, speed of analysis, which improves turnaround times from weeks with culture-based approaches of slow-growing, cold-adapted bacterial species to only a few days. The workflow used is depicted in Figure 2.

 

Figure 2. The bioinformatics workflow for the analysis of whole metagenome data. The reference data consists of known “food matrix” genomes in addition to food spoilers. The “Not annotated” reads can be de novo assembled and used as queries to find new spoilers or matrix associated reference genomes and included in the reference collection for future use. Strain typing, AMR, virulence and plasmid characterization can be easily plugged in as part of the “Iterate per taxonomy” workflow.

Learn more about QIAGEN CLC Genomics Workbench Premium.

Don’t miss these related QIAGEN CLC blogs:

OTU clustering using QIAGEN CLC Microbial Genomics Module

Taxonomic profiling using Progenomes2 in QIAGEN CLC Genomics Workbench

References:

Barretto et al. (2021) Genome sequencing applied to pathogen source tracking in food industry: Key considerations for robust bioinformatics data analysis and reliable results interpretation. Genes 12, 275.

Beck et al. (2021) Monitoring the microbiome for food safety and quality using deep shotgun sequencing. NPJ Sci Food 5, 3.

Jagadeesan et al. (2019) The use of next generation sequencing for improving food safety: Translation into practice. Food Microbiology 79, 96.

Szarvas et al. (2020). Large-scale automated phylogenomic analysis of bacterial isolates and the Evergreen Online platform. Communications biology 3, 137.

Timme et al. (2019) Utilizing the public genomeTrakr database for foodborne pathogen traceback. Methods Mol Biol 1918, 201.

Uelze et al. (2021) Toward an Integrated Genome-Based Surveillance of Salmonella enterica in Germany. Frontiers in Microbiology 12, 200.

Sample to Insight
linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram
This site is registered on wpml.org as a development site. Switch to a production site key to remove this banner.