Explore QIAGEN CLC Genomics Workbench Premium
Food and feed safety is a main concern for food authorities, centers of disease control, departments of agriculture and public health laboratories, surveilling and acting on epidemiological outbreaks of foodborne pathogens such as Salmonella, Listeria, Vibrio, E. coli, Shigella, Campylobacter and Cronobacter reported by hospitals and doctors. Typically, the detective work involved in tracing an outbreak includes examining pre-infection food intake of patients, drinking water, supermarket receipts and social media postings of dining-out events. Laboratory steps involve culturing bacteria from suspected sources using specialized bacterial growth media to isolate the causal agent, followed by strain typing. Once the source of contamination has been identified, it is eliminated from the food chain, often by product recalls from the company producing the food. At this stage, the damage has been done, and the impacts are wide. Consumers are affected by potential fatalities. Brand reputation and immediate financial returns are diminished. A lot of food is wasted. For this reason, there is a strong industry trend to have in-house facilities and expertise to establish bacterial baselines and monitor any deviations thereof, enabling early detection and avoiding costly outbreaks. Other use cases include correct labeling and fraud detection of food items.
In addition to culturing and PCR-based identification methods, NGS-based approaches to sample characterization are gaining traction, namely whole genome sequencing of isolates and taxonomic profiling of bacterial communities. Food quality laboratories are now routinely equipped with desktop sequencing machines from Illumina or Ion Torrent and portable devices from Oxford Nanopore to provide the sequences.
Taxonomic profiling of the bacterial community can involve sampling and sequencing DNA using whole metagenome shotgun approaches, either with or without an intercalated PCR amplification step of the 16S rRNA genes. The bioinformatics pipelines for the NGS data analysis vary according to the approach taken (Jagadeesan et al., 2019).
Whole genome sequencing (WGS) of isolates consists of quality control, read trimming and assembly, bacterial characterization, strain typing, antimicrobial resistance characterization, variant calling, phylogenetic analysis and visualization tasks.
A recent German consortium effort for genome-based surveillance of Salmonella enterica isolates using Illumina sequencing technology lists 15 open-source bioinformatics software tools needed for the WGS analysis, whereas a recent paper by the Institute of Food Safety and Analytical Sciences, Nestlé Research, lists 13 tools for their corresponding pipeline. Web-based bioinformatics services tailored to the NGS analysis of food-borne pathogens include the US-based GenomeTrakr and the Danish Evergreen pipelines. For non-bioinformatician food scientists and microbiologists, it is time-consuming and impractical to learn these programs, let alone installing, tying together, version-controlling and maintaining them. Instead, a single, integrated platform that is easy to use, install and maintain is much preferred. QIAGEN CLC Genomics Workbench Premium has tailored tools for all steps along the pipeline and is designed for bench scientists to use without bioinformatics expertise. Workflows that tie these steps together are also available so that execution is as simple as clicking a few mouse-clicks using the graphical user interface. Scaling to enterprise levels is relatively straightforward with the QIAGEN CLC Genomics Server or QIAGEN CLC Genomics Cloud Engine software.
Figure 1. A schematic workflow for the analysis of NGS reads generated by whole genome sequencing of isolates.
The tutorial “Typing and Epidemiological Clustering of Common Pathogens” includes an example workflow for analyzing NGS data from isolated and cultivated bacterial samples using QIAGEN CLC Genomics Workbench. Using Illumina data from 47 cultured Salmonella enterica, the workflow identifies the best matching reference and its taxonomy, performs NGS-based multilocus sequence typing (MLST), finds antimicrobial resistance genes, identifies potential contaminants in a sample and performs outbreak analysis based on SNP-trees. The databases needed for workflow execution are also provided and include Salmonella and Staphylococcus genome references, MLST schemes and antimicrobial resistance gene databases. The workflow can easily be adopted to other bacterial species or modified to perform other tasks or search additional databases. The tutorial also demonstrates how to work with many samples to create both k-mer trees and SNP-trees and display these in the context of metadata. Metadata can be added and displayed on trees as described in the “Phylogenetic Trees and Metadata” tutorial.
For monitoring microbial communities along the food processing chain, the bacterial isolation and genome typing workflow is often impractical, as the heavily manual laboratory process of sample culturing does not scale well and is heavily biased to identifying the “usual suspects” amongst species that can be cultured. In contrast, culture-independent approaches permit high-throughput automation while also providing unbiased information on the microbial composition in the samples. Two approaches are widely used: Amplicon-based profiling and whole shotgun metagenomics.
Amplicon-based profiling is based on sequencing highly conserved regions of bacterial genomes at the 16S rRNA locus (ITS for fungi), clustering the resulting NGS reads into pseudo-species called Operational Taxonomic Units (OTUs), and compute the abundance of each OTU. In reference-based OTU clustering, a database provides taxonomy assignment for the OTUs while OTUs can be constructed without a valid match in the reference database, providing evidence for yet unknown bacterial (or fungal) species. The PCR amplification ensures a highly sensitive assay. With relatively few sequences, a representative and reproducible taxonomic profile of the samples can be obtained, making this approach highly cost-effective and scalable. The tutorial “OTU Clustering Using Workflows” provides a workflow for analyzing NGS data from soil samples using QIAGEN CLC Genomics Workbench and visualizing the results using zoomable sunburst and bar chart plots.
A more direct approach that does not rely on PCR (and hence avoiding many of the PCR-associated potential biases) is based on whole shotgun sequencing of metagenomic DNA and performing taxonomic profiling. This is done by mapping the reads to a representative microbiome reference database and reporting back the taxonomic levels of references to which reads map and the percentage of reads mapped to a given reference as a proxy for abundance of this species in the microbiome. Evidence for unknown microbial species is contained in the reads not matching the reference database(s) and a metagenomic assembly and binning of such reads allows for the construction of metagenome-assembled genomes (MAGs) that can be incorporated into one’s reference database and serve as quality markers.
The tutorial “Taxonomic Profiling of Whole Shotgun Metagenomic Data” demonstrates the taxonomic analysis to monitor the effect of antibiotic treatment of two subjects’ gut microbiota in a time series experiment. For metagenomic assembly and binning of contigs, the “QC, Assemble and Bin Pangenomes” workflow template is provided in the software. Constructing and maintaining the databases is explained in the “Creating and using annotated sequences as microbial reference data” tutorial.
A common source of error in whole shotgun metagenomic approaches is derived from the reads originating from the “food matrix”. This usually means the bulk of the reads are derived from the host genome or, in the case of fermented products, from the starter culture. Hence, a filtering step to remove these should be included in the analysis. The “Taxonomic Profiling” tool includes this optional filter. Beck et al. (2021) lists 31 commonly used food and feed “matrix filtering genomes” that should be used as “decoy” reference(s) in this step. Including these matrix references will also reduce false-positive findings and speed up the read mapping step, as exact matches of reads to reference are found much faster than approximate matches.
The choice of reference data is key to the success of taxonomic profiling approaches using NGS. If a given species in the microbiome is not represented in the reference data, this will lead to false-negative findings. If a species is not present yet reads originating from this species map to the genome of an unrelated species with similar genomic regions, this will lead to false-positive findings. This can happen if the reference databases used to perform taxonomic profiling are not representative of the habitat studied. In the food safety NGS area, false-negatives will result in overlooked problems, and false-positives will trigger unnecessary alerts. For this reason, generic reference databases that try to capture the entire (rarefied) tree of life, regardless of habitat, may be a poor choice. Instead, habitat-specific reference databases are becoming the new standard. RVDB for virus references, ProGenomes2 and MGnify for a wide range of microbial communities are recent examples of such databases. Food and feed monitoring laboratories have to set up the laboratory procedures involved in sampling along the production chain, nucleic acid extraction, library preparation and sequencing. The bioinformatics analysis then must be set up correspondingly. QIAGEN CLC Genomics Workbench not only supports this approach but also provides a single point-of-entry to bioinformatics by having a universal and flexible toolset that can be executed as workflows to do large-scale analyses without having to learn how to use and install and maintain various open-source tools and databases.
Scientists at a global dairy company are using QIAGEN CLC Genomics Workbench with Oxford Nanopore sequencing technology to perform whole metagenome shotgun analysis to establish a “normal” microbiome community baseline for dairy products and to monitor deviations thereof which are associated with food spoilage. Advantages are the availability of a library preparation kit, sequencing technology, plug-and-play software, and most importantly, speed of analysis, which improves turnaround times from weeks with culture-based approaches of slow-growing, cold-adapted bacterial species to only a few days. The workflow used is depicted in Figure 2.
Figure 2. The bioinformatics workflow for the analysis of whole metagenome data. The reference data consists of known “food matrix” genomes in addition to food spoilers. The “Not annotated” reads can be de novo assembled and used as queries to find new spoilers or matrix associated reference genomes and included in the reference collection for future use. Strain typing, AMR, virulence and plasmid characterization can be easily plugged in as part of the “Iterate per taxonomy” workflow.
Learn more about QIAGEN CLC Genomics Workbench Premium.
OTU clustering using QIAGEN CLC Microbial Genomics Module
Taxonomic profiling using Progenomes2 in QIAGEN CLC Genomics Workbench
Are you struggling to find a bioinformatics analysis tool that meets your specific research needs? One that is easy-to-use, yet powerful, scalable and flexible? We are excited to announce the launch of QIAGEN CLC Genomics 21.0, packed with new features to help you take your data analysis to the next level. QIAGEN CLC Genomics has solutions for all your sequencing, NGS and 'omics data analysis needs. Get the features that meet your research goals with our new licensing models developed for this v21 release. Our favorite new features and functions now available in v21 include:
Illumina BaseSpace integration Data stored in Illumina BaseSpace can now be seamlessly imported into the Workbench. To get started, just install the Cloud Plugin. Illumina BaseSpace will then be available to select as an import location.
Sanger workflows Draw end-to-end workflows for the analysis of Sanger reads, starting with on-the-fly import of trace files. If you run the trimming and assembly of forward-reverse Sanger reads in batch mode, the outputs will be named after the batch unit – or you can use advanced custom output naming patterns in workflows to include even more information in the file names. Extract consensus sequences and create alignments within the same workflow. You can now also visualize Sanger assemblies in the wrapped view.
New in the v21 release, QIAGEN CLC Genomics now has three key offerings, with packages ranging from basic (QIAGEN CLC Main Workbench), advanced (QIAGEN CLC Genomics Workbench) and premium (QIAGEN CLC Genomics Workbench Premium), to meet your specific sequence and ‘omics data analysis needs.
QIAGEN CLC Main Workbench: For basic sequencing analysis
QIAGEN CLC Genomics Workbench: For advanced sequencing analysis
Includes all the features of the QIAGEN CLC Main Workbench, plus:
QIAGEN CLC Genomics Workbench Premium: Our full-featured solution
Includes all the features of the QIAGEN CLC Genomics Workbench, plus:
QIAGEN CLC Genomics Server: All CLC functionality is also available as enterprise software, which operates on any hardware server. The Genomics Analysis Portal allows sample- and workflow centric views of analyses run on the server.
QIAGEN CLC Genomics Cloud Engine: Run CLC workflows in the cloud on data stored in your BaseSpace or AWS S3 account. Launch workflows from the CLC Genomics Workbench or Server in the cloud using the Cloud Plugin.
Learn more about the applications supported by our portfolio of QIAGEN CLC Genomics solutions, and request a consultation with one of our experts to help you find the right QIAGEN CLC toolset for your research goals.
This blog tutorial highlights several recent improvements in the latest update to QIAGEN CLC Microbial Genomics Module 20.1. The update includes improved usability in the Download Microbial Reference Database tool and improved support for long reads in Taxonomic Profiling. Some of the improvements include:
With the 20.1 update, it is now easy to customize the Microbial Reference Database to fit your needs. Here we demonstrate two use cases:
The updated downloader makes it simple to visualize phylogenetic relationships. To create a dendrogram of the four coronavirus genera, we first create a microbial database containing only coronavirus:
Approximately 200 references remained and were downloaded with a minimum contig length of 1000. The five samples with an unknown genus were included in the downloaded database.
The phylogenies of the downloaded database of assemblies can be easily visualized using Create K-mer Tree. In Create K-mer Tree, select the downloaded database of coronavirus genomes. The dendrogram shown was created with default settings, except "Only index k-mers with prefix" was left blank due to the short length of coronavirus genomes.
Figure 1 shows a circular dendrogram with added genus metadata. For ease of viewing, 50% of both the alphacoronavirus and betacoronavirus genomes have been excluded from the tree.
In the tree, the five references without a genus are selected and their branches are shown in dark blue. From the tree, we can see that three of these references cluster with the betacoronavirus, one clusters with the alphacoronavirus and one clusters between alphacoronavirus and gammacoronavirus.
This highlights a quick and easy way to download a database of viral genomes, and how to use the database to create a phylogeny. The phylogeny can then be used to resolve samples of unknown genus.
Create K-mer tree also works with reads. In the next section, we demonstrate how to create a taxonomic profile with metagenome samples.
With the recent updates to the Download Microbial Reference Database and Taxonomic Profiling functions in QIAGEN CLC Microbial Genomics Module, it is now fast and easy to detect coronavirus presence in metagenome samples containing only a few virus reads. Taxonomic profiling now also supports long reads such as those generated by Oxford Nanopore and PacBio sequencing technologies.
For the first time setup, we create a viral database:
All complete virus genomes to date, approximately 18,500, remained and were downloaded with a minimum contig length of 1000.
The downloaded database was used to create a taxonomic profiling index using default settings.
The analysis can be carried out in a simple workflow using the curated Microbial Reference Database and human genome to create a Taxonomic Profiling index for host genome filtering (Figure 2).
Results are presented from 3 different studies with low fraction of viral reads (Table 1).
Abundance virus values have been aggregated to species level and table filtered to abundance >10. The % viral reads is the percentage of reads in the sample matching the virus database.
Sample | % viral reads | Species | Taxonomy | Abundance |
SRR10948550 |
1.0556 |
Severe acute respiratory syndrome-related coronavirus | Orthornavirae; Pisuviricota; Pisoniviricetes; Nidovirales; Coronaviridae; Betacoronavirus; Severe acute respiratory syndrome-related coronavirus | 985 |
Ambystoma tigrinum virus | Bamfordvirae; Nucleocytoviricota; Megaviricetes; Pimascovirales; Iridoviridae; Ranavirus; Ambystoma tigrinum virus | 39 | ||
Common midwife toad virus | Bamfordvirae; Nucleocytoviricota; Megaviricetes; Pimascovirales; Iridoviridae; Ranavirus; Common midwife toad virus | 26 | ||
SRR11092061 |
0.0045 |
Severe acute respiratory syndrome-related coronavirus | Orthornavirae; Pisuviricota; Pisoniviricetes; Nidovirales; Coronaviridae; Betacoronavirus; Severe acute respiratory syndrome-related coronavirus | 1304 |
Spodoptera frugiperda rhabdovirus | Orthornavirae; Negarnaviricota; Monjiviricetes; Mononegavirales; Rhabdoviridae; Spodoptera frugiperda rhabdovirus | 822 | ||
Saccharomyces 20S RNA narnavirus | Orthornavirae; Lenarviricota; Amabiliviricetes; Wolframvirales; Narnaviridae; Narnavirus; Saccharomyces 20S RNA narnavirus | 336 | ||
Stenotrophomonas virus SMA7 | Loebvirae; Hofneiviricota; Faserviricetes; Tubulavirales; Inoviridae; Subteminivirus; Stenotrophomonas virus SMA7 | 126 | ||
Influenza A virus | Orthornavirae; Negarnaviricota; Insthoviricetes; Articulavirales; Orthomyxoviridae; Alphainfluenzavirus; Influenza A virus | 112 | ||
Nipah henipavirus | Orthornavirae; Negarnaviricota; Monjiviricetes; Mononegavirales; Paramyxoviridae; Henipavirus; Nipah henipavirus | 48 | ||
Common midwife toad virus | Bamfordvirae; Nucleocytoviricota; Megaviricetes; Pimascovirales; Iridoviridae; Ranavirus; Common midwife toad virus | 12 | ||
Inoviridae sp | Loebvirae; Hofneiviricota; Faserviricetes; Tubulavirales; Inoviridae; Inoviridae sp | 12 | ||
ERR4385803 |
0.6578 |
Gokushovirus WZ-2015a | Sangervirae; Phixviricota; Malgrandaviricetes; Petitvirales; Microviridae; Gokushovirus WZ-2015a | 19753 |
Human gut gokushovirus | Sangervirae; Phixviricota; Malgrandaviricetes; Petitvirales; Microviridae; Human gut gokushovirus | 3883 | ||
Microviridae sp | Sangervirae; Phixviricota; Malgrandaviricetes; Petitvirales; Microviridae; Microviridae sp | 1726 | ||
Microviridae | Sangervirae; Phixviricota; Malgrandaviricetes; Petitvirales; Microviridae | 47 |
The negative control sample ERR4385803 correctly reports no coronavirus. The abundance of virus was correctly reported in both positive samples (Table 1).
References:
We've got a useful tip that will help you get even more value out of QIAGEN CLC Microbial Genomics Module when performing OTU clustering. Get the latest version of the SILVA OTU database within the QIAGEN CLC Microbial Genomics Module with minimal effort outside of QIAGEN CLC Genomics Workbench, even before the latest version is released through the Microbial Genomics Module. The SILVA databases are updated more regularly than the corresponding QIIME versions, which the downloader currently relies on. To avoid waiting for QIIME updates, the newest SILVA database can be used with the Create Annotated Sequence List tool, with just a bit of reformatting required.
SILVA releases are available on the FTP server https://ftp.arb-silva.de/ where each release is stored in a separate folder. Here we focus on the latest release_138, more specifically the non-redundant database at 99% sequence similarity. If you are interested in another version, please consult the corresponding README file and change the surl and corresponding turl in the top of the script accordingly. To download the correct files and format it properly right away for import into the QIAGEN CLC Genomics Workbench, the following script may be used:
import gzip, urllib.request, zipfile, io, shutil, os surl="https://ftp.arb-silva.de/release_138/Exports/SILVA_138_SSURef_NR99_tax_silva.fasta.gz" turl="https://ftp.arb-silva.de/release_138/Exports/taxonomy/taxmap_embl-ebi_ena_ssu_ref_nr99_138.txt.gz" nurl="https://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip" print("Downloading "+nurl[nurl.rfind("/")+1:]+" may take some time ... ", end="", flush=True) allowedRanks = {"superkingdom":"k__", "phylum":"p__","class":"c__","order":"o__","family":"f__","genus":"g__","species":"s__"} def sp(line): return line.replace(b"\n",b"\t").split(b"\t|\t") with zipfile.ZipFile(io.BytesIO(urllib.request.urlopen(nurl).read())) as zip_ref: with zip_ref.open([name for name in zip_ref.namelist() if os.path.basename(name) == "nodes.dmp"][0]) as zf: nodes = {sp(line)[0]:[sp(line)[1], sp(line)[2].decode("UTF-8"), ""] for line in zf} with zip_ref.open([name for name in zip_ref.namelist() if name.endswith("names.dmp")][0]) as zf: for line in zf: s = sp(line) if s[3]==b"scientific name": nodes[s[0]][2] = s[1].decode("UTF-8") def getLineage(byteTaxID): lin = {r:v for r,v in allowedRanks.items()} pid = byteTaxID if pid in nodes: mid = nodes[pid] while pid != b"1" and pid != mid[0]: if mid[1] in allowedRanks: lin[mid[1]] += mid[2] pid = mid[0] mid = nodes[pid] return "; ".join(v for k,v in lin.items()) print("done") oname1 = surl[surl.rfind("/")+1:].replace("fasta.gz","fa.gz") oname2 = oname1.replace("fa.gz","txt") print("Downloading "+turl[turl.rfind("/")+1:]+" may take some time ... ", end="", flush=True) with gzip.GzipFile(fileobj=urllib.request.urlopen(turl)) as gzTax, open(oname2,'w') as tO: next(gzTax) tO.write("Name"+"\t"+"Taxonomy"+"\n") for line in gzTax: sp = line.strip().split(b"\t") tO.write(sp[0].decode("UTF-8")+"."+sp[1].decode("UTF-8")+"."+sp[2].decode("UTF-8")+"\t"+getLineage(sp[5])+"\n") print("done") print("Taxonomy output: "+oname2) print("Downloading "+surl[surl.rfind("/")+1:]+" may take some time ... ", end="", flush=True) with gzip.GzipFile(fileobj=urllib.request.urlopen(surl)) as gzSilva, gzip.open(oname1,'wb') as fO: for line in gzSilva: if line.startswith(b">"): fO.write(line[:line.rfind(b" ", 0, line.find(b";"))]+b"\n") else: fO.write(line.replace(b"U",b"T")) print("done") print("Fasta output: "+oname1)
To run this script, you need a standard installation of python3. All you need to do is copy and paste the content above, modify the URL (if necessary), save it to a file and execute it on your system. For example, you may save the file as “get_silva.py”, then open a terminal and navigate to the folder where the script is located. Finally, execute it with:
$python get_silva.py |
Depending on your connection, this script will run for about 5 to 10 minutes. It downloads three files and performs actions on and with them:
For each of the taxids for the rRNAs, a 7-step lineage is constructed on the levels of the allowed ranks. The output of the script are two files in the folder where it is executed:
These two files can now be used in the Create Annotated Sequence List.
Now you have version 138 of the SILVA database available for OTU clustering. Quick and easy, right?
For questions about this or other tips, tricks or functionalities related to QIAGEN CLC Microbial Genomics Module or QIGAGEN CLC Genomics Workbench, contact us at bioinformaticssales@qiagen.com.
The proGenomes2 project is a set of over 85,000 consistently annotated bacterial and archaeal genomes from over 12,000 species which provides a set of reference genomes across taxonomies and specific habitats, such as disease and food-related pathogens, and microbes from aquatic and soil environments. These databases offer excellent starting points for taxonomic profiling as they are unbiased and aim to span the diversity of the specific habitats. Unfortunately, the databases are not in a format that may be used directly within QIAGEN CLC Genomics Workbench, but with scripting, you can produce similar databases from within QIAGEN CLC using the proGenomes2 fasta files as a starting point. The headers of the proGenomes2 databases are constructed in the following way:
<taxid>.<biosample>.<nucleotide_id>
We use the biosample ID to find a set of assemblies in NCBI which we can download with the ‘Download Microbial Reference Database’ tool, including all information required for taxonomic profiling. First we need to find the desired database from http://progenomes.embl.de/data/, e.g. the sediment_mud specific database (but any other progenomes2 database hosted at this URL will work, replacing the definition of "URL" in the script below). With the following simple script we can stream the headers of that (gzipped) fasta file into the unique biosample IDs and use NCBI’s Eutils API to translate them into a set of unique assembly IDs and finally collect them into a file:
import sys, time, gzip, urllib.request import xml.etree.ElementTree as ET url="http://progenomes.embl.de/data/habitats/representatives.sediment_mud.contigs.fasta.gz" print("Downloading "+url[url.rfind("/")+1:]+" may take some time ... ", end="", flush=True) with gzip.GzipFile(fileobj=urllib.request.urlopen(url)) as f: l = list({ line.decode("UTF-8").split(".")[1] for iline, line in enumerate(f) if line.startswith(b">")}) print("Done") def request(query): i = 0 while True: try: return ET.fromstring(urllib.request.urlopen(query).read().decode("utf-8")) except Exception as e: if i > 5: print("Could not reach: "+query+"\nCheck connection: "+str(e)) exit(1) time.sleep(1) i+=1 assemblies = set() interval=50 for ibiosample in range(0,len(l),interval): biosample = "+OR+".join(bs for bs in l[ibiosample:min(ibiosample+interval,len(l))]) base="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/" rparse = request(base + "esearch.fcgi?db=assembly&term="+biosample+"[biosample]&usehistory=y") query2 = base+"esummary.fcgi?db=assembly&query_key="+rparse.find("QueryKey").text+"&WebEnv="+rparse.find("WebEnv").text for res in request(query2).findall(".//AssemblyAccession"): assemblies.add(res.text[:res.text.find(".")]) print("Getting Assembly IDs from NCBI: {:.2f}%".format(min(ibiosample+interval, len(l))*100/len(l)),end="\r" if ibiosample+interval<len(l) else "\n") ofname = url[url.rfind("/")+1:].replace(".fasta.gz",".txt") print("Writing Assembly IDs to output file "+ofname) with open(ofname , 'w') as f: for assembly in sorted(assemblies): f.write(assembly+"\n")
To run this script, you need a standard installation of python3. All you need to do is copy and paste the content above, modify the URL (if necessary), save it to a file and execute it on your system. For example, you may save the file as "get_assembly_ids.py", then open a terminal and navigate to the folder where the script is located. Finally, execute it with:
$python get_assembly_ids.py |
Running this script takes about 2 minutes (for the sediment_mud database), depending on your internet connection. The output will be a file called "representatives.sediment_mud.contigs.txt" placed in the same folder as the script containing the assembly IDs from which the respective progenomes2 database has been created (if you changed the URL, the name of the output file would be changed accordingly).
This file can now be used from within QIAGEN CLC Genomics Workbench (with the Microbial Genomics Module installed). Select Toolbox → Microbial Genomics Module → Databases → Taxonomic analysis → Download Microbial Reference Database and select "Create custom database" and "Include all" sequences.
After clicking next, it is possible to supply a file with Assembly Accession IDs. Select the file "representatives.sediment_mud.contigs.txt" we have just created and click "Finish".
This will create a "Database builder" where the assemblies from "representatives.sediment_mud.contigs.txt" have been selected and staged for download. The Database builder gives an overview of the selected references and provides an estimate of the download size.
By clicking "Download Selection" the download process is started and a sequence list is saved to the selected location. From this sequence list, a Taxonomic Profiling index can be constructed by running the Create Taxonomic Profiling Index tool.
Learn more about how QIAGEN CLC Genomics and QIAGEN CLC Genomics Workbench with the Microbial Genomics Module are powerful and scalable solutions to support all your genomics analysis needs.
Centers for Disease Control and Prevention (CDC) released on November 13, 2019 their report Antibiotic Resistance Threats in the United States, 2019, showing that antibiotic-resistant bacteria and fungi cause more than 2.8 million infections and 35,000 deaths in the United States each year. This is striking, indicating that on average, someone in the US gets an antibiotic-resistant infection every 11 seconds, and that every 15 minutes someone dies from one. Check out the coverage on Twitter by following #CDCARThreats.
Nevertheless, data from the new report show progress in fighting these infections. Since 2013, prevention efforts have reduced deaths from antibiotic-resistant infections by 18% overall and by nearly 30% in hospitals. Rapid detection and prevention strategies in communities have helped protect people from two community-associated germs: vaccines have helped reduce infections from Streptococcus pneumoniae in many at-risk groups, and the cases of drug-resistant tuberculosis (TB) in the United States remain stable due to effective TB control strategies.
However, CDC is concerned about antibiotic-resistant infections that are on the rise including:
This new data show that continued vigilance is needed to maintain the progress seen thus far. Further preventing infections and stopping the spread of germs will save more lives.
QIAGEN offers tools and solutions to support public health epidemiology, clinical microbiology research and basic microbial genomics research. QIAGEN CLC Microbial Genomics Module offers unique and valuable features and functionalities to help advance research of microbial infections and their prevention. These capabilities include:
Learn more about the QIAGEN CLC Microbial Genomics Module and check out the details of how this tool can support you in the fight against emerging antimicrobial resistant (AMR) pathogens.
QIAGEN is committed to supporting advanced research into the underlying drivers of antimicrobial resistance. Earlier in 2019, as a statement of our commitment, we were the first bioinformatics company to join the joint United Nations - CDC Global AMR Challenge. Read more about our commitment and the new QMI-AR database here.
References:
CDC (2019). Antibiotic Resistance Threats in the United States, 2019. Atlanta, GA: U.S. Department of Health and Human Services, CDC.
A host of new features help you scale your research, and allow you to ramp up your productivity by taking your multi-sample analyses to the next level:
Figure 1. The ‘Iterate’ and ‘Collect and Distribute’ control elements allow batching over sections of the workflow. In this example, fastq files from a two-level factorial RNA-seq experiment performed in triplicate can be analyzed in a single workflow. The reads are trimmed, quality controlled (QC’ed) and the RNA-seq analysis reads are mapped, sample by sample. Then the RNA-seq expression levels are compared among groups, and comparisons are collected to create heat maps, Venn diagrams and PCA plots. Finally, trimming, QC and RNA-seq analysis read mapping reports are combined across samples. The workflow was used to analyze data from De Maio et al. (2016), comparing the transcriptional profile (RNA-seq) of Dengue virus 2 and mock infected human cells at 24 and 36 hours post-infection. The samples (accessions) are described in a CLC metadata table according to infection status and time point prior to workflow execution.
Figure 2. With the ‘Combined Reports’ tool you can gain a quick overview of the main results in your analysis. In this case, the GC-content has been summarized from the QC reports of 12 RNA-seq samples from De Maio et al. (2016).
Figure 3. Minimum Spanning Tree produced by QIAGEN CLC Microbial Genomics Module.
QIAGEN CLC Genomics Workbench now supports even more QIAseq UMI-based library preparation kits and panels, via a series of new ready-to-use workflows accessible through the Biomedical Genomics Analysis plugin, including:
View all supported QIAseq panels here.
Don't miss our on-demand webinar where we review these latest features of the QIAGEN CLC Genomics Workbench 20, and discuss:
References:
De Maio F.A. et al. (2016). The Dengue virus NS5 protein intrudes in the cellular spliceosome and modulates splicing. PLoS
Pathog. 12(8):e1005841.
In our recent white paper we describe how to investigate the functional potential of a microbial community in a polar desert in Antarctica using metagenomic shotgun sequencing data. In the original paper (1), the authors supplemented their microbiome data with qPCR analyses to investigate the expression of the most interesting genes discovered in the functional profiles to support their hypothesis that the microbial community survive by scavenging atmospheric trace gases. However, what if they had instead included RNA-seq transcriptomic data to evaluate gene activity in their samples? In this post, we show you how to add transcriptomics data to a microbiome survey using the tools of CLC Genomics Workbench.
The example below presents a de novo assembly based approach to metatranscriptomic analysis using CLC Genomics Workbench and the Microbial Genomics Module. There are, in fact, multiple approaches to performing metatranscriptomics data analysis, depending on the specific questions you may have. For a deeper review on best-practices in metatranscriptomics analysis we recommend you review Bashiardes et. al. (2), or read published examples where CLC Genomics Workbench was used for metatranscriptomics research; some recent interesting examples include a study of thehoney bee (3) and termite (4) microbiomes and their associated metatranscriptomes.
The example metatranscriptomic pipeline presented below consists of two parts (shown in Figure 1). Part 1 includes: assembling the metagenome; grouping contigs into bins to reconstruct the microbial genomes; and finding and annotating genes. It is also described in further detail in our recent white paper on Antarctic microbiome profiling. A common approach and caveat of comparing metatranscriptomes from multiple samples is often to create a “co-assembly” across your samples that serves as a single reference list of contigs and genes for the downstream RNAseq analysis. A good example of this approach can be found in Marynowska et. al. (4).
Part 2 of the analysis pipeline involves adding the transcriptomic data to supplement the metagenomic survey with information on gene activity. Part 2 is the focus of this post and will be described below.
CLC Genomics Workbench include a suite a of tools designed for analyzing gene expression data. For this blog post, we will use just only a few of them. The RNA-Seq Analysis tool will start with mapping reads to the genome and the coding sequences. The tool requires a file with the reference genome and a file with annotations for protein coding sequences (CDS) or genes. If these are not already available from Part 1 of the pipeline (Figure 1), they can be generated using Track Tools -> Track Conversion -> Convert to Tracks. This will take an annotated genome or list of contigs as input and generate individual track files. Additional details on this conversion step can be found in our manual. In this case we need to generate a track for the genome and one for the annotated coding regions. From the read mappings, reads are categorized and assigned, and expression values are calculated. The output from the RNA-Seq Analysis tool is a table describing for each gene the number of reads mapped, the number of reads per kilobase gene, and the expression value. The results can be visualized in a track list along with the genes and the read mappings (Figure 2). The track list is interactively linked to the results table, and marking a CDS of interest in the table view, will shift the focus of the track list to that particular region.
From the track view read mappings can be manually inspected by zooming in on individual genes (Figure 3). In the case of the desert soil microbiome in Antarctica, genes supporting the use of atmospheric trace gases as carbon and energy sources could be searched out from the table, and the expression values inspected.
If your microbiome investigation involves comparing microbial communities at different times or under different conditions, transcriptomes can be compared across multiple states. This analysis can be performed with the tool Differential Expression for RNA-Seq. The tool performs a statistical test of the differential expression of two or more samples. The output is a table displaying for each gene, the fold change and the p-value for the statistical comparison. From this list, genes significantly changing expression levels under different biological conditions can be found.
CLC Genomics Workbench contain several additional tools for analyzing RNA-Seq data for more sophisticated comparisons and visualizations than what have been shown here. If you are interested in learning more or trying out the functionalities, you can always download a free trial.
When investigating the composition of microbial communities, researchers often need to calculate and visualize the diversity within and between samples, often referred to respectively as the alpha and beta diversity of samples. Based on feedback from our users, we have added several new data visualization options for microbial diversity in the latest release of CLC Microbial Genomics Module (version 4.5), which are described in more detail below.
With QIAGEN’s CLC Microbial Genomics Module, we provide a number of different metrics for estimating the alpha diversity, including Total Number of OTUs, Chao 1, Simpson’s index, Shannon entropy, and the phylogenetic diversity. The choice of index for an analysis often depends on the underlying experiments and the dataset itself, but often a resulting alpha diversity estimate for a single or multiple samples is visualized with line graph similar to a receiver operator curve. Based on feedback from our users, we have included in the latest release of CLC Microbial Genomics Module (version 4.5) the ability to also represent alpha diversity of a sample using box plots. This new functionality has been integrated into the existing tool for calculating alpha diversity, and the box plots will be generated automatically when running the tool Alpha Diversity.
In the examples below, we used the same data from our recent white paper on the microbial diversity in a polar desert in Antarctica. Alpha diversity, estimated as the total number of OTUs at the taxonomic level of Order, is displayed in a line graph on the left and as a box plot on the right. In the left figure all samples are shown and colored by location, but any desired metadata parameter could have been chosen. In the box plot on the right, samples have been grouped by location. Individual data points and outliers can be displayed, as well as indicators for mean and median. Groups can be compared with a Kruskal-Wallis test and the p-values for any pairwise comparison displayed above the plot (as shown). In the example of the Antarctica microbiomes, the microbial diversity was significantly higher in the Dry Valleys soil as compared to the saline water in Ace Lake (p = 0.03), and the microbial diversity was significantly lower in the Dry Valleys soil as compared to the marine sediment at Adelie Basin (p = 0.03).
CLC Microbial Genomics Module also provides several different metrics for estimating the beta diversity in a set of samples, including Bray-Curtis, Jaccard, Euclidean, and UniFrac. The latest release now enables users to display beta diversity in either a 2D or 3D PCoA plot. Below is shown the beta diversity among samples from different locations in Antarctica. On the left, the beta diversity is visualized in the existing 3D PCoA plot, and on the right, the diversity is visualized in the new 2D PCoA plot. The new 2D PCoA plot will be generated automatically when running the tool Beta Diversity. The data can be sorted and displayed with any user defined metadata. In the example below, data points are colored by location. As evident from both graphical representations, the microbial communities in Antarctica are clearly separated by geographic location.
There are several new features in the latest release of CLC Microbial Genomics Module. If you haven’t already done so, upgrade your installation today to take advantage of these new visualizations simply by opening. If you are new to CLC Genomics Workbench or the CLC Microbial Genomics Module, you can download the software with a free 14-day trial license here.
Enjoy!
Microbiome research presents us with an opportunity to study all microorganisms on Earth. Nonetheless, many are difficult to isolate in the lab and remain uncultured using traditional microbiology methods, despite more than 100 years of research into developing new cultivation methods. Unraveling the currently undiscovered biodiversity of microbiomes remains a major challenge in microbiology, and it is estimated that more that 99% of all microbes remains uncharacterized by traditional culture methods (1). Just 20 years ago, in 1998, Handelsman first proposed to analyze a soil microbial community without prior cultivation (2). The use of culture-independent metagenomics approaches grew rapidly once the advantages became clear, with just one publication listed in PubMed in 1998 to now more than 11,000 publications.
Metagenomic sequencing is a powerful approach to investigate the microbial diversity of complex samples, with taxonomic classification of organisms sometime reaching strain level precision. Shotgun metagenomics can not only reveal specific organisms in a sample, but is also a powerful approach to characterize the functional genomic profile encoded within microbiomes, and potentially to discover genes with new functions. Although the specific sample preparation, library preparation, and sequencing platform used are all important factors that influence the quality of your results, ultimately the downstream bioinformatics pipelines and reference databases used become the analysis bottleneck. With this last point in mind, we have released a new white paper describing how to carry out functional genomics characterization of unbiased shotgun metagenomics data using CLC Genomics Workbench and the add-on CLC Microbial Genomics Module.
To demonstrate the broad capabilities of our software, we re-analyzed previously published data from Mukan Ji and co-workers (3). Ji et al investigated the surprisingly diverse microbial soil community of a polar desert in Antarctica and sought to understand how these microbes survive in such a harsh and nutrient deficient habitat.
For an in-depth discussion of the study and their exciting findings, we recommend listening to the podcast with microbiology experts Vincent Racaniello, Michael Schmidt, Elio Schaechter, and Michelle Swanson on This Week in Microbiology, TWiM. The paper was discussed in Episode 169 – Breatharian Bacteria.
Read our white paper on functional metagenomics with CLC Genomics Workbench and the Microbial Genomics Module and learn how to reveal the functional potential of microbiomes sequenced using shotgun metagenomics methods.