CLC Microbial Genomics Module - QIAGEN Digital Insights

Explore QIAGEN CLC Genomics Workbench Premium

Food and feed safety is a main concern for food authorities, centers of disease control, departments of agriculture and public health laboratories, surveilling and acting on epidemiological outbreaks of foodborne pathogens such as Salmonella, Listeria, Vibrio, E. coli, Shigella, Campylobacter and Cronobacter reported by hospitals and doctors. Typically, the detective work involved in tracing an outbreak includes examining pre-infection food intake of patients, drinking water, supermarket receipts and social media postings of dining-out events. Laboratory steps involve culturing bacteria from suspected sources using specialized bacterial growth media to isolate the causal agent, followed by strain typing. Once the source of contamination has been identified, it is eliminated from the food chain, often by product recalls from the company producing the food. At this stage, the damage has been done, and the impacts are wide. Consumers are affected by potential fatalities. Brand reputation and immediate financial returns are diminished. A lot of food is wasted. For this reason, there is a strong industry trend to have in-house facilities and expertise to establish bacterial baselines and monitor any deviations thereof, enabling early detection and avoiding costly outbreaks. Other use cases include correct labeling and fraud detection of food items.

In addition to culturing and PCR-based identification methods, NGS-based approaches to sample characterization are gaining traction, namely whole genome sequencing of isolates and taxonomic profiling of bacterial communities. Food quality laboratories are now routinely equipped with desktop sequencing machines from Illumina or Ion Torrent and portable devices from Oxford Nanopore to provide the sequences.

Taxonomic profiling of the bacterial community can involve sampling and sequencing DNA using whole metagenome shotgun approaches, either with or without an intercalated PCR amplification step of the 16S rRNA genes. The bioinformatics pipelines for the NGS data analysis vary according to the approach taken (Jagadeesan et al., 2019).

Whole genome sequencing

Whole genome sequencing (WGS) of isolates consists of quality control, read trimming and assembly, bacterial characterization, strain typing, antimicrobial resistance characterization, variant calling, phylogenetic analysis and visualization tasks.

A recent German consortium effort for genome-based surveillance of Salmonella enterica isolates using Illumina sequencing technology lists 15 open-source bioinformatics software tools needed for the WGS analysis, whereas a recent paper by the Institute of Food Safety and Analytical Sciences, Nestlé Research, lists 13 tools for their corresponding pipeline. Web-based bioinformatics services tailored to the NGS analysis of food-borne pathogens include the US-based GenomeTrakr and the Danish Evergreen pipelines. For non-bioinformatician food scientists and microbiologists, it is time-consuming and impractical to learn these programs, let alone installing, tying together, version-controlling and maintaining them. Instead, a single, integrated platform that is easy to use, install and maintain is much preferred. QIAGEN CLC Genomics Workbench Premium has tailored tools for all steps along the pipeline and is designed for bench scientists to use without bioinformatics expertise. Workflows that tie these steps together are also available so that execution is as simple as clicking a few mouse-clicks using the graphical user interface. Scaling to enterprise levels is relatively straightforward with the QIAGEN CLC Genomics Server or QIAGEN CLC Genomics Cloud Engine software.

Figure 1. A schematic workflow for the analysis of NGS reads generated by whole genome sequencing of isolates.

The tutorial “Typing and Epidemiological Clustering of Common Pathogens” includes an example workflow for analyzing NGS data from isolated and cultivated bacterial samples using QIAGEN CLC Genomics Workbench. Using Illumina data from 47 cultured Salmonella enterica, the workflow identifies the best matching reference and its taxonomy, performs NGS-based multilocus sequence typing (MLST), finds antimicrobial resistance genes, identifies potential contaminants in a sample and performs outbreak analysis based on SNP-trees. The databases needed for workflow execution are also provided and include Salmonella and Staphylococcus genome references, MLST schemes and antimicrobial resistance gene databases. The workflow can easily be adopted to other bacterial species or modified to perform other tasks or search additional databases. The tutorial also demonstrates how to work with many samples to create both k-mer trees and SNP-trees and display these in the context of metadata. Metadata can be added and displayed on trees as described in the “Phylogenetic Trees and Metadata” tutorial.

Taxonomic profiling of bacterial communities

For monitoring microbial communities along the food processing chain, the bacterial isolation and genome typing workflow is often impractical, as the heavily manual laboratory process of sample culturing does not scale well and is heavily biased to identifying the “usual suspects” amongst species that can be cultured. In contrast, culture-independent approaches permit high-throughput automation while also providing unbiased information on the microbial composition in the samples. Two approaches are widely used: Amplicon-based profiling and whole shotgun metagenomics.

Amplicon-based profiling

Amplicon-based profiling is based on sequencing highly conserved regions of bacterial genomes at the 16S rRNA locus (ITS for fungi), clustering the resulting NGS reads into pseudo-species called Operational Taxonomic Units (OTUs), and compute the abundance of each OTU. In reference-based OTU clustering, a database provides taxonomy assignment for the OTUs while OTUs can be constructed without a valid match in the reference database, providing evidence for yet unknown bacterial (or fungal) species. The PCR amplification ensures a highly sensitive assay. With relatively few sequences, a representative and reproducible taxonomic profile of the samples can be obtained, making this approach highly cost-effective and scalable. The tutorial “OTU Clustering Using Workflows” provides a workflow for analyzing NGS data from soil samples using QIAGEN CLC Genomics Workbench and visualizing the results using zoomable sunburst and bar chart plots.

Whole shotgun metagenomic taxonomic profiling

A more direct approach that does not rely on PCR (and hence avoiding many of the PCR-associated potential biases) is based on whole shotgun sequencing of metagenomic DNA and performing taxonomic profiling. This is done by mapping the reads to a representative microbiome reference database and reporting back the taxonomic levels of references to which reads map and the percentage of reads mapped to a given reference as a proxy for abundance of this species in the microbiome. Evidence for unknown microbial species is contained in the reads not matching the reference database(s) and a metagenomic assembly and binning of such reads allows for the construction of metagenome-assembled genomes (MAGs) that can be incorporated into one’s reference database and serve as quality markers.

The tutorial “Taxonomic Profiling of Whole Shotgun Metagenomic Data” demonstrates the taxonomic analysis to monitor the effect of antibiotic treatment of two subjects’ gut microbiota in a time series experiment. For metagenomic assembly and binning of contigs, the “QC, Assemble and Bin Pangenomes” workflow template is provided in the software. Constructing and maintaining the databases is explained in the “Creating and using annotated sequences as microbial reference data” tutorial.

A common source of error in whole shotgun metagenomic approaches is derived from the reads originating from the “food matrix”. This usually means the bulk of the reads are derived from the host genome or, in the case of fermented products, from the starter culture. Hence, a filtering step to remove these should be included in the analysis. The “Taxonomic Profiling” tool includes this optional filter. Beck et al. (2021) lists 31 commonly used food and feed “matrix filtering genomes” that should be used as “decoy” reference(s) in this step. Including these matrix references will also reduce false-positive findings and speed up the read mapping step, as exact matches of reads to reference are found much faster than approximate matches.

The secret to success

The choice of reference data is key to the success of taxonomic profiling approaches using NGS. If a given species in the microbiome is not represented in the reference data, this will lead to false-negative findings. If a species is not present yet reads originating from this species map to the genome of an unrelated species with similar genomic regions, this will lead to false-positive findings. This can happen if the reference databases used to perform taxonomic profiling are not representative of the habitat studied. In the food safety NGS area, false-negatives will result in overlooked problems, and false-positives will trigger unnecessary alerts. For this reason, generic reference databases that try to capture the entire (rarefied) tree of life, regardless of habitat, may be a poor choice. Instead, habitat-specific reference databases are becoming the new standard. RVDB for virus references, ProGenomes2 and MGnify for a wide range of microbial communities are recent examples of such databases. Food and feed monitoring laboratories have to set up the laboratory procedures involved in sampling along the production chain, nucleic acid extraction, library preparation and sequencing. The bioinformatics analysis then must be set up correspondingly. QIAGEN CLC Genomics Workbench not only supports this approach but also provides a single point-of-entry to bioinformatics by having a universal and flexible toolset that can be executed as workflows to do large-scale analyses without having to learn how to use and install and maintain various open-source tools and databases.

Use case and example workflow

Scientists at a global dairy company are using QIAGEN CLC Genomics Workbench with Oxford Nanopore sequencing technology to perform whole metagenome shotgun analysis to establish a “normal” microbiome community baseline for dairy products and to monitor deviations thereof which are associated with food spoilage. Advantages are the availability of a library preparation kit, sequencing technology, plug-and-play software, and most importantly, speed of analysis, which improves turnaround times from weeks with culture-based approaches of slow-growing, cold-adapted bacterial species to only a few days. The workflow used is depicted in Figure 2.

Figure 2. The bioinformatics workflow for the analysis of whole metagenome data. The reference data consists of known “food matrix” genomes in addition to food spoilers. The “Not annotated” reads can be de novo assembled and used as queries to find new spoilers or matrix associated reference genomes and included in the reference collection for future use. Strain typing, AMR, virulence and plasmid characterization can be easily plugged in as part of the “Iterate per taxonomy” workflow.

Learn more about QIAGEN CLC Genomics Workbench Premium.

Don’t miss these related QIAGEN CLC blogs:

OTU clustering using QIAGEN CLC Microbial Genomics Module

Taxonomic profiling using Progenomes2 in QIAGEN CLC Genomics Workbench

References:

Barretto et al. (2021) Genome sequencing applied to pathogen source tracking in food industry: Key considerations for robust bioinformatics data analysis and reliable results interpretation. Genes 12, 275.

Beck et al. (2021) Monitoring the microbiome for food safety and quality using deep shotgun sequencing. NPJ Sci Food 5, 3.

Jagadeesan et al. (2019) The use of next generation sequencing for improving food safety: Translation into practice. Food Microbiology 79, 96.

Szarvas et al. (2020). Large-scale automated phylogenomic analysis of bacterial isolates and the Evergreen Online platform. Communications biology 3, 137.

Timme et al. (2019) Utilizing the public genomeTrakr database for foodborne pathogen traceback. Methods Mol Biol 1918, 201.

Uelze et al. (2021) Toward an Integrated Genome-Based Surveillance of Salmonella enterica in Germany. Frontiers in Microbiology 12, 200.

Check out the new features of QIAGEN CLC Genomics

Are you struggling to find a bioinformatics analysis tool that meets your specific research needs? One that is easy-to-use, yet powerful, scalable and flexible? We are excited to announce the launch of QIAGEN CLC Genomics 21.0, packed with new features to help you take your data analysis to the next level. QIAGEN CLC Genomics has solutions for all your sequencing, NGS and 'omics data analysis needs. Get the features that meet your research goals with our new licensing models developed for this v21 release. Our favorite new features and functions now available in v21 include:

Import reads from Illumina BaseSpace or Amazon S3, using the Cloud Plugin
Build end-to-end Sanger sequencing workflows, from trace data to consensus alignments: Sanger assemblies can now also be visualized with read wrapping
Name workflow outputs automatically based on metadata or batch identifiers

Illumina BaseSpace integration Data stored in Illumina BaseSpace can now be seamlessly imported into the Workbench. To get started, just install the Cloud Plugin. Illumina BaseSpace will then be available to select as an import location.

Sanger workflows Draw end-to-end workflows for the analysis of Sanger reads, starting with on-the-fly import of trace files. If you run the trimming and assembly of forward-reverse Sanger reads in batch mode, the outputs will be named after the batch unit – or you can use advanced custom output naming patterns in workflows to include even more information in the file names. Extract consensus sequences and create alignments within the same workflow. You can now also visualize Sanger assemblies in the wrapped view.

New in the v21 release, QIAGEN CLC Genomics now has three key offerings, with packages ranging from basic (QIAGEN CLC Main Workbench), advanced (QIAGEN CLC Genomics Workbench) and premium (QIAGEN CLC Genomics Workbench Premium), to meet your specific sequence and ‘omics data analysis needs.

QIAGEN CLC Main Workbench: For basic sequencing analysis

Primer design
Multiple sequence alignment tools
Phylogenic analysis tools
Sanger sequencing analysis: Workflow enabled with v21
Molecular cloning
Gene expression analysis
3D molecular modeling
Support most sequence formats, including Vector NTI
Workflow editor
Whole genome alignment: The v21 release includes several improvements to visualizations and functionality to help you more easily gain insights into your microbial genome research. Read more here.

QIAGEN CLC Genomics Workbench: For advanced sequencing analysis

Includes all the features of the QIAGEN CLC Main Workbench, plus:

Supports de novo assembly of NGS reads
Supports all organisms
Resequencing analysis and variant calling
Long read analysis (PacBio, Oxford Nanopore): For the v21 release, the Long Read Support plugin now offers full functionality and a range of tools for working with long, error-prone reads, such as the long reads typically produced by PacBio or Oxford Nanopore sequencing technologies.
RNA-seq (including miRNA and lncRNA), ChIPseq, DNA methylation
Biomedical genomics analyses
Haplotype calling: (Expected release: June, 2021) Allows direct import, export and validation of variants and supports phasing information and delivers variant locus, allele variants, haplotype alleles and haplotypes.
QIAseq panel analysis workflows
Download data stored in your BaseSpace or AWS S3 account using the Cloud Plugin.

QIAGEN CLC Genomics Workbench Premium: Our full-featured solution

Includes all the features of the QIAGEN CLC Genomics Workbench, plus:

QIAGEN CLC Microbial Genomics Module, for:
- Microbial typing
- Antimicrobial resistance
- Metagenomics characterization
- Outbreak and strain typing analysis
QIAGEN CLC Genome Finishing Module for assembling and finishing of genomes
The new QIAGEN CLC Single Cell Analysis Module released in this v21 launch enables analysis from raw FASTQ files or imported count matrices to clusters of cells with annotated cell types and differentially expressed genes. Visualize data from over a million cells at once.

QIAGEN CLC Genomics Server: All CLC functionality is also available as enterprise software, which operates on any hardware server. The Genomics Analysis Portal allows sample- and workflow centric views of analyses run on the server.

QIAGEN CLC Genomics Cloud Engine: Run CLC workflows in the cloud on data stored in your BaseSpace or AWS S3 account. Launch workflows from the CLC Genomics Workbench or Server in the cloud using the Cloud Plugin.

Learn more about the applications supported by our portfolio of QIAGEN CLC Genomics solutions, and request a consultation with one of our experts to help you find the right QIAGEN CLC toolset for your research goals.

Using viral reference databases for phylogeny construction and taxonomic profiling of samples with low viral load

This blog tutorial highlights several recent improvements in the latest update to QIAGEN CLC Microbial Genomics Module 20.1. The update includes improved usability in the Download Microbial Reference Database tool and improved support for long reads in Taxonomic Profiling. Some of the improvements include:

Faster load times for the selection table, which now loads in just seconds
Full access to the latest assemblies from NCBI with a taxonomy-aware download selection
No deduplication: The tool no longer removes duplicate sequences, as this functionality has been moved to Create Taxonomic Profiling Index

With the 20.1 update, it is now easy to customize the Microbial Reference Database to fit your needs. Here we demonstrate two use cases:

Visualizing phylogenetic relationships of all coronavirus genomes
Creating a taxonomic profiling index of all viral genomes and carrying out taxonomic profiling of viral metagenome samples containing severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in a few simple steps

Visualizing phylogenetic relationships made easy

The updated downloader makes it simple to visualize phylogenetic relationships. To create a dendrogram of the four coronavirus genera, we first create a microbial database containing only coronavirus:

Run the Download Microbial Reference Database tool to load the Database builder
Filter the table to show only entries where the Taxonomy column contains 'coronoviridae'
Aggregate rows on Genus; we observe five samples which do not include the genus
Use Quick Selection: "Complete genomes in RefSeq" to quickly select all complete, coronavirus genomes

Approximately 200 references remained and were downloaded with a minimum contig length of 1000. The five samples with an unknown genus were included in the downloaded database.

The phylogenies of the downloaded database of assemblies can be easily visualized using Create K-mer Tree. In Create K-mer Tree, select the downloaded database of coronavirus genomes. The dendrogram shown was created with default settings, except "Only index k-mers with prefix" was left blank due to the short length of coronavirus genomes.

Figure 1 shows a circular dendrogram with added genus metadata. For ease of viewing, 50% of both the alphacoronavirus and betacoronavirus genomes have been excluded from the tree.

In the tree, the five references without a genus are selected and their branches are shown in dark blue. From the tree, we can see that three of these references cluster with the betacoronavirus, one clusters with the alphacoronavirus and one clusters between alphacoronavirus and gammacoronavirus.

This highlights a quick and easy way to download a database of viral genomes, and how to use the database to create a phylogeny. The phylogeny can then be used to resolve samples of unknown genus.

Figure 1. Dendrogram of the four coronavirus genera.

Create K-mer tree also works with reads. In the next section, we demonstrate how to create a taxonomic profile with metagenome samples.

Create a taxonomic profiling index and detect abundance of coronavirus in metagenome samples with low coronavirus copy number

With the recent updates to the Download Microbial Reference Database and Taxonomic Profiling functions in QIAGEN CLC Microbial Genomics Module, it is now fast and easy to detect coronavirus presence in metagenome samples containing only a few virus reads. Taxonomic profiling now also supports long reads such as those generated by Oxford Nanopore and PacBio sequencing technologies.

For the first time setup, we create a viral database:

Run the Download Microbial Reference Database tool to load the Database builder
Filter the table to show only entries where the Taxonomy column contains 'virae' - we skip the remaining virus kingdom in the interest of speed
Use ’Quick Selection: Complete genomes in RefSeq’ to quickly select all complete, viral genomes

All complete virus genomes to date, approximately 18,500, remained and were downloaded with a minimum contig length of 1000.

The downloaded database was used to create a taxonomic profiling index using default settings.

The analysis can be carried out in a simple workflow using the curated Microbial Reference Database and human genome to create a Taxonomic Profiling index for host genome filtering (Figure 2).

Results are presented from 3 different studies with low fraction of viral reads (Table 1).

SRR10948550: Long read sequencing using Oxford Nanopore (1)
SRR11092061: Paired end sequencing using Illumina HiSeq 3000 (2)
ERR4385803: Paired end sequencing using Illumina HiSeq 2500 (gut virome sample - negative for SARS-CoV-2)

Abundance virus values have been aggregated to species level and table filtered to abundance >10. The % viral reads is the percentage of reads in the sample matching the virus database.

Table 1. Abundances for the different samples (results have been aggregated to species level)

Sample	% viral reads	Species	Taxonomy	Abundance
SRR10948550	1.0556	Severe acute respiratory syndrome-related coronavirus	Orthornavirae; Pisuviricota; Pisoniviricetes; Nidovirales; Coronaviridae; Betacoronavirus; Severe acute respiratory syndrome-related coronavirus	985
		Ambystoma tigrinum virus	Bamfordvirae; Nucleocytoviricota; Megaviricetes; Pimascovirales; Iridoviridae; Ranavirus; Ambystoma tigrinum virus	39
		Common midwife toad virus	Bamfordvirae; Nucleocytoviricota; Megaviricetes; Pimascovirales; Iridoviridae; Ranavirus; Common midwife toad virus	26
SRR11092061	0.0045	Severe acute respiratory syndrome-related coronavirus	Orthornavirae; Pisuviricota; Pisoniviricetes; Nidovirales; Coronaviridae; Betacoronavirus; Severe acute respiratory syndrome-related coronavirus	1304
		Spodoptera frugiperda rhabdovirus	Orthornavirae; Negarnaviricota; Monjiviricetes; Mononegavirales; Rhabdoviridae; Spodoptera frugiperda rhabdovirus	822
		Saccharomyces 20S RNA narnavirus	Orthornavirae; Lenarviricota; Amabiliviricetes; Wolframvirales; Narnaviridae; Narnavirus; Saccharomyces 20S RNA narnavirus	336
		Stenotrophomonas virus SMA7	Loebvirae; Hofneiviricota; Faserviricetes; Tubulavirales; Inoviridae; Subteminivirus; Stenotrophomonas virus SMA7	126
		Influenza A virus	Orthornavirae; Negarnaviricota; Insthoviricetes; Articulavirales; Orthomyxoviridae; Alphainfluenzavirus; Influenza A virus	112
		Nipah henipavirus	Orthornavirae; Negarnaviricota; Monjiviricetes; Mononegavirales; Paramyxoviridae; Henipavirus; Nipah henipavirus	48
		Common midwife toad virus	Bamfordvirae; Nucleocytoviricota; Megaviricetes; Pimascovirales; Iridoviridae; Ranavirus; Common midwife toad virus	12
		Inoviridae sp	Loebvirae; Hofneiviricota; Faserviricetes; Tubulavirales; Inoviridae; Inoviridae sp	12
ERR4385803	0.6578	Gokushovirus WZ-2015a	Sangervirae; Phixviricota; Malgrandaviricetes; Petitvirales; Microviridae; Gokushovirus WZ-2015a	19753
		Human gut gokushovirus	Sangervirae; Phixviricota; Malgrandaviricetes; Petitvirales; Microviridae; Human gut gokushovirus	3883
		Microviridae sp	Sangervirae; Phixviricota; Malgrandaviricetes; Petitvirales; Microviridae; Microviridae sp	1726
		Microviridae	Sangervirae; Phixviricota; Malgrandaviricetes; Petitvirales; Microviridae	47

The negative control sample ERR4385803 correctly reports no coronavirus. The abundance of virus was correctly reported in both positive samples (Table 1).

References:

Zhou, P. et al. (2020) A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 7798: 270-273.
Chan, J.F.W. et al. (2020) A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster. The Lancet 10223: 514-523.

We've got a useful tip that will help you get even more value out of QIAGEN CLC Microbial Genomics Module when performing OTU clustering. Get the latest version of the SILVA OTU database within the QIAGEN CLC Microbial Genomics Module with minimal effort outside of QIAGEN CLC Genomics Workbench, even before the latest version is released through the Microbial Genomics Module. The SILVA databases are updated more regularly than the corresponding QIIME versions, which the downloader currently relies on. To avoid waiting for QIIME updates, the newest SILVA database can be used with the Create Annotated Sequence List tool, with just a bit of reformatting required.

SILVA releases are available on the FTP server https://ftp.arb-silva.de/ where each release is stored in a separate folder. Here we focus on the latest release_138, more specifically the non-redundant database at 99% sequence similarity. If you are interested in another version, please consult the corresponding README file and change the surl and corresponding turl in the top of the script accordingly. To download the correct files and format it properly right away for import into the QIAGEN CLC Genomics Workbench, the following script may be used:

import gzip, urllib.request, zipfile, io, shutil, os
surl="https://ftp.arb-silva.de/release_138/Exports/SILVA_138_SSURef_NR99_tax_silva.fasta.gz"
turl="https://ftp.arb-silva.de/release_138/Exports/taxonomy/taxmap_embl-ebi_ena_ssu_ref_nr99_138.txt.gz"
nurl="https://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip"
print("Downloading "+nurl[nurl.rfind("/")+1:]+" may take some time ... ", end="", flush=True)
allowedRanks = {"superkingdom":"k__", "phylum":"p__","class":"c__","order":"o__","family":"f__","genus":"g__","species":"s__"}
def sp(line):
    return line.replace(b"\n",b"\t").split(b"\t|\t")
with zipfile.ZipFile(io.BytesIO(urllib.request.urlopen(nurl).read())) as zip_ref:
    with zip_ref.open([name for name in zip_ref.namelist() if os.path.basename(name) == "nodes.dmp"][0]) as zf:
        nodes = {sp(line)[0]:[sp(line)[1], sp(line)[2].decode("UTF-8"), ""] for line in zf}
    with zip_ref.open([name for name in zip_ref.namelist() if name.endswith("names.dmp")][0]) as zf:
        for line in zf:
            s = sp(line)
            if s[3]==b"scientific name":
                nodes[s[0]][2] = s[1].decode("UTF-8")
def getLineage(byteTaxID):
    lin = {r:v for r,v in allowedRanks.items()}
    pid = byteTaxID
    if pid in nodes:
        mid = nodes[pid]
        while pid != b"1" and pid != mid[0]:
            if mid[1] in allowedRanks:
                lin[mid[1]] += mid[2]
            pid = mid[0]
            mid = nodes[pid]
    return "; ".join(v for k,v in lin.items())
print("done")
oname1 = surl[surl.rfind("/")+1:].replace("fasta.gz","fa.gz")
oname2 = oname1.replace("fa.gz","txt")
print("Downloading "+turl[turl.rfind("/")+1:]+" may take some time ... ", end="", flush=True)
with gzip.GzipFile(fileobj=urllib.request.urlopen(turl)) as gzTax, open(oname2,'w') as tO:
    next(gzTax)
    tO.write("Name"+"\t"+"Taxonomy"+"\n")
    for line in gzTax:
        sp = line.strip().split(b"\t")
        tO.write(sp[0].decode("UTF-8")+"."+sp[1].decode("UTF-8")+"."+sp[2].decode("UTF-8")+"\t"+getLineage(sp[5])+"\n")
print("done")
print("Taxonomy output: "+oname2)
print("Downloading "+surl[surl.rfind("/")+1:]+" may take some time ... ", end="", flush=True)
with gzip.GzipFile(fileobj=urllib.request.urlopen(surl)) as gzSilva, gzip.open(oname1,'wb') as fO:
    for line in gzSilva:
        if line.startswith(b">"):
            fO.write(line[:line.rfind(b" ", 0, line.find(b";"))]+b"\n")
        else:
            fO.write(line.replace(b"U",b"T"))
print("done")
print("Fasta output:    "+oname1)

To run this script, you need a standard installation of python3. All you need to do is copy and paste the content above, modify the URL (if necessary), save it to a file and execute it on your system. For example, you may save the file as “get_silva.py”, then open a terminal and navigate to the folder where the script is located. Finally, execute it with:

$python get_silva.py

Depending on your connection, this script will run for about 5 to 10 minutes. It downloads three files and performs actions on and with them:

The most recent NCBI Taxonomy: taxdmp.zip. The script loads the taxids, parent ids, ranks and names of the taxonomy into memory.
Taxonomy Mappings from SILVA: taxmap_embl-ebi_ena_ssu_ref_nr99_138.txt.gz. The script uses this file to get the mapping from the SILVA names to taxids in the NCBI taxonomy. Note that the SILVA database is updated biannually and the NCBI corresponding taxonomy is updated daily and thus there is not always a one-to-one correspondence between the final taxonomies and the original SILVA taxonomies.
The SILVA rRNA database: SILVA_138_SSURef_NR99_tax_silva.fasta.gz. The script strips the provided taxonomies from this file, keeps the names and translates U to T.

For each of the taxids for the rRNAs, a 7-step lineage is constructed on the levels of the allowed ranks. The output of the script are two files in the folder where it is executed:

SILVA_138_SSURef_NR99_tax_silva.fa.gz: Fasta file with the rRNA sequences and the sequence names in the header
SILVA_138_SSURef_NR99_tax_silva.txt: A tab-separated file connecting the name of an rRNA sequence to its taxonomy in QIIME format

These two files can now be used in the Create Annotated Sequence List.

Import the SILVA_138_SSURef_NR99_tax_silva.fa.gz file using a standard import, or drag and drop the file into the CLC Genomics Workbench
Run the Create Annotated Sequence List on the resulting CLC file in the Workbench and click “Next”
Select SILVA_138_SSURef_NR99_tax_silva.txt as taxonomy file
Set the similarity percentage to 99% (if you have selected the NR99 version of SILVA, otherwise this should be adjusted)
Click “Next” and in the “Select input file and map columns to attributes” under Parsing select Separator as “Tab”
Click "Next" and "Finish"

Now you have version 138 of the SILVA database available for OTU clustering. Quick and easy, right?

For questions about this or other tips, tricks or functionalities related to QIAGEN CLC Microbial Genomics Module or QIGAGEN CLC Genomics Workbench, contact us at bioinformaticssales@qiagen.com.

Disclaimer: QIAGEN does not support the SILVA databases constructed this way, and the information provided in this article is given without any warranty, expressed or implied. Users are solely responsible for the application of any code or information provided. The SILVA databases version 138 are free for academic and commercial use under the Create Commons Attribution 4.0 (CC-BY 4.0) license.

The proGenomes2 project is a set of over 85,000 consistently annotated bacterial and archaeal genomes from over 12,000 species which provides a set of reference genomes across taxonomies and specific habitats, such as disease and food-related pathogens, and microbes from aquatic and soil environments. These databases offer excellent starting points for taxonomic profiling as they are unbiased and aim to span the diversity of the specific habitats. Unfortunately, the databases are not in a format that may be used directly within QIAGEN CLC Genomics Workbench, but with scripting, you can produce similar databases from within QIAGEN CLC using the proGenomes2 fasta files as a starting point. The headers of the proGenomes2 databases are constructed in the following way:

We use the biosample ID to find a set of assemblies in NCBI which we can download with the ‘Download Microbial Reference Database’ tool, including all information required for taxonomic profiling. First we need to find the desired database from http://progenomes.embl.de/data/, e.g. the sediment_mud specific database (but any other progenomes2 database hosted at this URL will work, replacing the definition of "URL" in the script below). With the following simple script we can stream the headers of that (gzipped) fasta file into the unique biosample IDs and use NCBI’s Eutils API to translate them into a set of unique assembly IDs and finally collect them into a file:

import sys, time, gzip, urllib.request
import xml.etree.ElementTree as ET
url="http://progenomes.embl.de/data/habitats/representatives.sediment_mud.contigs.fasta.gz"
print("Downloading "+url[url.rfind("/")+1:]+" may take some time ... ", end="", flush=True)
with gzip.GzipFile(fileobj=urllib.request.urlopen(url)) as f:
    l = list({ line.decode("UTF-8").split(".")[1] for iline, line in enumerate(f) if line.startswith(b">")})
print("Done")
def request(query):
    i = 0
    while True:
        try:
            return ET.fromstring(urllib.request.urlopen(query).read().decode("utf-8"))
        except Exception as e:
            if i > 5:
                print("Could not reach: "+query+"\nCheck connection: "+str(e))
                exit(1)
            time.sleep(1)
            i+=1
assemblies = set()
interval=50
for ibiosample in range(0,len(l),interval):
    biosample = "+OR+".join(bs for bs in l[ibiosample:min(ibiosample+interval,len(l))])
    base="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"
    rparse = request(base + "esearch.fcgi?db=assembly&term="+biosample+"[biosample]&usehistory=y")
    query2 = base+"esummary.fcgi?db=assembly&query_key="+rparse.find("QueryKey").text+"&WebEnv="+rparse.find("WebEnv").text
    for res in request(query2).findall(".//AssemblyAccession"):
        assemblies.add(res.text[:res.text.find(".")])
    print("Getting Assembly IDs from NCBI: {:.2f}%".format(min(ibiosample+interval, len(l))*100/len(l)),end="\r" if ibiosample+interval<len(l) else "\n")
ofname = url[url.rfind("/")+1:].replace(".fasta.gz",".txt")
print("Writing Assembly IDs to output file "+ofname)
with open(ofname , 'w') as f:
    for assembly in sorted(assemblies):
        f.write(assembly+"\n")

To run this script, you need a standard installation of python3. All you need to do is copy and paste the content above, modify the URL (if necessary), save it to a file and execute it on your system. For example, you may save the file as "get_assembly_ids.py", then open a terminal and navigate to the folder where the script is located. Finally, execute it with:

$python get_assembly_ids.py

Running this script takes about 2 minutes (for the sediment_mud database), depending on your internet connection. The output will be a file called "representatives.sediment_mud.contigs.txt" placed in the same folder as the script containing the assembly IDs from which the respective progenomes2 database has been created (if you changed the URL, the name of the output file would be changed accordingly).

This file can now be used from within QIAGEN CLC Genomics Workbench (with the Microbial Genomics Module installed). Select Toolbox → Microbial Genomics Module → Databases → Taxonomic analysis → Download Microbial Reference Database and select "Create custom database" and "Include all" sequences.

After clicking next, it is possible to supply a file with Assembly Accession IDs. Select the file "representatives.sediment_mud.contigs.txt" we have just created and click "Finish".

This will create a "Database builder" where the assemblies from "representatives.sediment_mud.contigs.txt" have been selected and staged for download. The Database builder gives an overview of the selected references and provides an estimate of the download size.

By clicking "Download Selection" the download process is started and a sequence list is saved to the selected location. From this sequence list, a Taxonomic Profiling index can be constructed by running the Create Taxonomic Profiling Index tool.

Learn more about how QIAGEN CLC Genomics and QIAGEN CLC Genomics Workbench with the Microbial Genomics Module are powerful and scalable solutions to support all your genomics analysis needs.

Disclaimer: QIAGEN does not support the proGenomes2 databases, and the information provided in this article is given without any warranty, expressed or implied. Users are solely responsible for the application of any code or information provided. The proGenomes2 databases are free for academic use. For any intended commercial use, please refer to http://progenomes.embl.de/other.cgi.

Centers for Disease Control and Prevention (CDC) released on November 13, 2019 their report Antibiotic Resistance Threats in the United States, 2019, showing that antibiotic-resistant bacteria and fungi cause more than 2.8 million infections and 35,000 deaths in the United States each year. This is striking, indicating that on average, someone in the US gets an antibiotic-resistant infection every 11 seconds, and that every 15 minutes someone dies from one. Check out the coverage on Twitter by following #CDCARThreats.

Nevertheless, data from the new report show progress in fighting these infections. Since 2013, prevention efforts have reduced deaths from antibiotic-resistant infections by 18% overall and by nearly 30% in hospitals. Rapid detection and prevention strategies in communities have helped protect people from two community-associated germs: vaccines have helped reduce infections from Streptococcus pneumoniae in many at-risk groups, and the cases of drug-resistant tuberculosis (TB) in the United States remain stable due to effective TB control strategies.

However, CDC is concerned about antibiotic-resistant infections that are on the rise including:

More than half a million resistant gonorrhea infections occur each year, which is twice as many as reported in 2013. Gonorrhea-causing bacteria have developed resistance to all but one class of antibiotics, and half of all infections are resistant to at least one antibiotic.
Extended-spectrum beta-lactamase (ESBL)-producing Enterobacteriaceae are one of the leading causes of death from resistant germs. They make urinary tract infections harder to treat, especially in women, and could undo progress made in hospitals if allowed to spread there.
Erythromycin-resistant group A Streptococcus infections have quadrupled since the 2013 report. If resistance continues to grow, infections and deaths could rise.

This new data show that continued vigilance is needed to maintain the progress seen thus far. Further preventing infections and stopping the spread of germs will save more lives.

QIAGEN offers tools and solutions to support public health epidemiology, clinical microbiology research and basic microbial genomics research. QIAGEN CLC Microbial Genomics Module offers unique and valuable features and functionalities to help advance research of microbial infections and their prevention. These capabilities include:

QIAGEN’s Microbial Insights AR database (QMI-AR), integrating multiple AMR databases into a single curated resource of over 5000 genes
Exclusive research-use access to ARESdb from ARES-Genetics, a database of over 2000 AMR markers obtained from phentotypic testing of over 11,000 clinical isolates of resistant pathogen
Microbiome taxonomic profiling
Advanced tools for typing of microbial genomes
Antimicrobial resistance characterization
De novo assembly of isolates and metagenomes
Functional metagenomics
Quick and easy reference database customization

Learn more about the QIAGEN CLC Microbial Genomics Module and check out the details of how this tool can support you in the fight against emerging antimicrobial resistant (AMR) pathogens.

QIAGEN is committed to supporting advanced research into the underlying drivers of antimicrobial resistance. Earlier in 2019, as a statement of our commitment, we were the first bioinformatics company to join the joint United Nations - CDC Global AMR Challenge. Read more about our commitment and the new QMI-AR database here.

References:

CDC (2019). Antibiotic Resistance Threats in the United States, 2019. Atlanta, GA: U.S. Department of Health and Human Services, CDC.

Check out some of the many new features delivered in the QIAGEN CLC solutions

QIAGEN CLC Genomics Workbench 20.0:

A host of new features help you scale your research, and allow you to ramp up your productivity by taking your multi-sample analyses to the next level:

Start workflows directly from raw sequence data: No need to import FASTQ files first.
Advanced batching made simple: Your study may produce many samples, but don’t worry– we’ve got you covered on the analysis. Workflows make it easy to streamline the analysis of large numbers of samples, especially when many tools are involved. When running workflows in 'Batch' mode, it is now possible to match inputs up with each other (e.g., a number of case samples and matched controls) and analyze the entire dataset in the same run. Furthermore, workflows can now be built with ‘Iterate’ and ‘Collect and Distribute’ control elements (Figure 1, light green), which allow customized batching and aggregation across batch units within a single workflow, while reducing the need for manual interaction and optimizing resource consumption when executed on QIAGEN CLC Genomics Server (see below).

**Figure 1.** The ‘Iterate’ and ‘Collect and Distribute’ control elements allow batching over sections of the workflow. In this example, fastq files from a two-level factorial RNA-seq experiment performed in triplicate can be analyzed in a single workflow. The reads are trimmed, quality controlled (QC’ed) and the RNA-seq analysis reads are mapped, sample by sample. Then the RNA-seq expression levels are compared among groups, and comparisons are collected to create heat maps, Venn diagrams and PCA plots. Finally, trimming, QC and RNA-seq analysis read mapping reports are combined across samples. The workflow was used to analyze data from De Maio et al. (2016), comparing the transcriptional profile (RNA-seq) of Dengue virus 2 and mock infected human cells at 24 and 36 hours post-infection. The samples (accessions) are described in a CLC metadata table according to infection status and time point prior to workflow execution.

Work smarter with metadata: Use metadata as a versatile and convenient way to help you organize your samples and results within the workbench. Metadata can help you find objects, define the grouping of inputs in batching, direct samples to different paths in a workflow (e.g., in a tumor-normal or trio study) or be used in statistical analyses and visualization for RNA-seq. All you need to get started is an Excel spreadsheet. Quickly retrieve the results you want, even for large batches of samples. Metadata tables now organize the workflow results so you can quickly find the answers you need.
Automatically export reports from workflows in pdf or JSON format: Using JSON formatted results enables advanced users to programmatically parse the reports and create custom reports, and fully integrate their CLC workflows into existing systems. Export reports and combined reports directly from workflows in pdf or JSON formats (Figure 1, dark blue elements) from QIAGEN CLC Genomics Workbench 20.0 or QIAGEN CLC Genomics Server 20.0, with the option to include a history log for file provenance.
Combining reports: When executing multi-step workflows and batching over multiple samples, each step and each sample will create multiple reports, a situation that quickly generates information overload. Gain a quick overview of crucial QC parameters and main results by combining reports and results across tools and samples, using the new ‘Combine Reports’ tool, which is fully compatible with the advanced batching functionalities (Figure 1, light blue workflow elements, and example output report in Figure 2). Reports from over 20 NGS-related tools, including biomedical and microbial tools, are supported, as well as statistics over variant tracks.

**Figure 2.** With the ‘Combined Reports’ tool you can gain a quick overview of the main results in your analysis. In this case, the GC-content has been summarized from the QC reports of 12 RNA-seq samples from De Maio et al. (2016).

The QIAGEN CLC Genomics Workbench 20.0 and QIAGEN CLC Genomics Server 20.0 also feature updates to many tools, as well as significant performance improvements over previous versions. You can the full list of latest improvements here.
Additional content can be added to the comprehensive toolbox in QIAGEN CLC Genomics Workbench 20.0 and QIAGEN CLC Genomics Server 20.0 by installing feature-rich modules and plugins. Both the free Biomedical Genomics Analysis plugin and the QIAGEN CLC Microbial Genomics Module have been updated substantially for this release (see below).

QIAGEN CLC Main Workbench:

The new ‘Biomolecule Generator’ tool makes it possible to generate or extract biomolecules based on symmetry information in PDB files.
A homology model of a sequence can be created in just two steps, using the new ‘Find and Model Structure’ tool. The tool identifies suitable protein templates from the Protein Data Bank (PDB) and automatically builds a structure model for a given input sequence. From the resulting table, a structure model of the sequence can be created with one click.
Molecule structures in a Molecule Project can be exported to a PDB file.
These new tools apply to both QIAGEN CLC Main Workbench and QIAGEN CLC GenomicsWorkbench. You can the full list of latest improvements here.

QIAGEN CLC Genomics Server:

A new workflow-queuing option has been introduced, so that workflows utilizing advanced batching functionalities can be executed efficiently in a multi-node environment.
All new features now available with the other QIAGEN CLC products mentioned in this blog are also applicable to QIAGEN CLC Genomics Server.

QIAGEN CLC Microbial Genomics Module:

With whole genome sequencing revolutionizing clinical microbiology, MLST of microbial genomes is rapidly becoming the standard. QIAGEN CLC Microbial Genomics Module now includes cg/wgMLST in addition to the interactive minimum spanning tree visualization of MLSTs for outbreak analysis (Figure 3). The tools also provide direct access to pubMLST.org and other online public databases with internationally recognized schemas. Collectively, these tools provide researchers with total flexibility in one tool set for the analysis of isolates – regardless whether it’s a virus, bacteria or fungal genome. You can the full list of latest improvements here.

Biomedical Genomics Analysis Plugin:

QIAGEN CLC Genomics Workbench now supports even more QIAseq UMI-based library preparation kits and panels, via a series of new ready-to-use workflows accessible through the Biomedical Genomics Analysis plugin, including:

The QIAseq Multimodal Panels are supported in a single-workflow solution.
The QIAseq Fusion XP Panels are supported, including variant calling, fusion detection and expression quantification.
Easy analysis of the QIAseq MSI Booster Panel in hg19 and hg38 – a new MSI workflow is provided, making it possible to create a shared baseline for multiple samples.
The QIAseq Methylation Panel and QIAseq Methyl Library Kit are now supported, including differential methylation-level calling.
Additional improvements include: Improved reporting and de-multiplexing for the QIAseq 3' UPX solutions, better detection of gene fusions with new visualizations, and integration of fusion and CNV calling with QCI Interpret through export of CNV and fusion call results in VCF format. You can the full list of latest improvements here.

View all supported QIAseq panels here.

Don't miss our on-demand webinar where we review these latest features of the QIAGEN CLC Genomics Workbench 20, and discuss:

One-click solutions and expert tools for NGS data analysis
Working with reads from various platforms (Illumina, IonTorrent, Oxford Nanopore, Pacific Biosciences, BGI/MGI)
Tailored solutions for RNA-seq, DNA-seq and methylation
Efficient algorithms for read trimming, mapping, de novo assembly and variant calling
Effective management of reference data
Scalable processing of many samples, with advanced workflow and reporting capabilities
Easy installation on Windows, Mac and Linux

References:

De Maio F.A. et al. (2016). The Dengue virus NS5 protein intrudes in the cellular spliceosome and modulates splicing. PLoS
Pathog. 12(8):e1005841.

Analysis of microbiome transcriptomes

In our recent white paper we describe how to investigate the functional potential of a microbial community in a polar desert in Antarctica using metagenomic shotgun sequencing data. In the original paper (1), the authors supplemented their microbiome data with qPCR analyses to investigate the expression of the most interesting genes discovered in the functional profiles to support their hypothesis that the microbial community survive by scavenging atmospheric trace gases. However, what if they had instead included RNA-seq transcriptomic data to evaluate gene activity in their samples? In this post, we show you how to add transcriptomics data to a microbiome survey using the tools of CLC Genomics Workbench.

**Figure 1. Tools from CLC Genomics Workbench and CLC Microbial Genomics Module used in the analysis pipeline.**

The example below presents a de novo assembly based approach to metatranscriptomic analysis using CLC Genomics Workbench and the Microbial Genomics Module. There are, in fact, multiple approaches to performing metatranscriptomics data analysis, depending on the specific questions you may have. For a deeper review on best-practices in metatranscriptomics analysis we recommend you review Bashiardes et. al. (2), or read published examples where CLC Genomics Workbench was used for metatranscriptomics research; some recent interesting examples include a study of thehoney bee (3) and termite (4) microbiomes and their associated metatranscriptomes.

The example metatranscriptomic pipeline presented below consists of two parts (shown in Figure 1). Part 1 includes: assembling the metagenome; grouping contigs into bins to reconstruct the microbial genomes; and finding and annotating genes. It is also described in further detail in our recent white paper on Antarctic microbiome profiling. A common approach and caveat of comparing metatranscriptomes from multiple samples is often to create a “co-assembly” across your samples that serves as a single reference list of contigs and genes for the downstream RNAseq analysis. A good example of this approach can be found in Marynowska et. al. (4).

Part 2 of the analysis pipeline involves adding the transcriptomic data to supplement the metagenomic survey with information on gene activity. Part 2 is the focus of this post and will be described below.

Combining RNA-Seq Data with Existing Metagenomics Data

CLC Genomics Workbench include a suite a of tools designed for analyzing gene expression data. For this blog post, we will use just only a few of them. The RNA-Seq Analysis tool will start with mapping reads to the genome and the coding sequences. The tool requires a file with the reference genome and a file with annotations for protein coding sequences (CDS) or genes. If these are not already available from Part 1 of the pipeline (Figure 1), they can be generated using Track Tools -> Track Conversion -> Convert to Tracks. This will take an annotated genome or list of contigs as input and generate individual track files. Additional details on this conversion step can be found in our manual. In this case we need to generate a track for the genome and one for the annotated coding regions. From the read mappings, reads are categorized and assigned, and expression values are calculated. The output from the RNA-Seq Analysis tool is a table describing for each gene the number of reads mapped, the number of reads per kilobase gene, and the expression value. The results can be visualized in a track list along with the genes and the read mappings (Figure 2). The track list is interactively linked to the results table, and marking a CDS of interest in the table view, will shift the focus of the track list to that particular region.

From the track view read mappings can be manually inspected by zooming in on individual genes (Figure 3). In the case of the desert soil microbiome in Antarctica, genes supporting the use of atmospheric trace gases as carbon and energy sources could be searched out from the table, and the expression values inspected.

**Figure 3. Track list displaying read mappings.**

If your microbiome investigation involves comparing microbial communities at different times or under different conditions, transcriptomes can be compared across multiple states. This analysis can be performed with the tool Differential Expression for RNA-Seq. The tool performs a statistical test of the differential expression of two or more samples. The output is a table displaying for each gene, the fold change and the p-value for the statistical comparison. From this list, genes significantly changing expression levels under different biological conditions can be found.

CLC Genomics Workbench contain several additional tools for analyzing RNA-Seq data for more sophisticated comparisons and visualizations than what have been shown here. If you are interested in learning more or trying out the functionalities, you can always download a free trial.

References

Ji M., et al. (2017) Atmospheric trace gases support primary production in Antarctic desert surface soil. Nature 552(7685):400–3.
Bashiardes S., et al. (2016) Use of Metatranscriptomics in Microbiome Research. Bioinform Biol Insights 10:19–25.
Schoonvaere K., et al. (2018) Study of the Metatranscriptome of Eight Social and Solitary Wild Bee Species Reveals Novel Viruses and Bee Parasites. Front Microbiol. 9:177.
Marynowska M., et al. (2017) Optimization of a metatranscriptomic approach to study the lignocellulolytic potential of the higher termite gut microbiome. BMC Genomics 18(1):681. doi: 10.1186/s12864-017-4076-9.

New visualizations for diversity

When investigating the composition of microbial communities, researchers often need to calculate and visualize the diversity within and between samples, often referred to respectively as the alpha and beta diversity of samples. Based on feedback from our users, we have added several new data visualization options for microbial diversity in the latest release of CLC Microbial Genomics Module (version 4.5), which are described in more detail below.

Alpha diversity visualizations

With QIAGEN’s CLC Microbial Genomics Module, we provide a number of different metrics for estimating the alpha diversity, including Total Number of OTUs, Chao 1, Simpson’s index, Shannon entropy, and the phylogenetic diversity. The choice of index for an analysis often depends on the underlying experiments and the dataset itself, but often a resulting alpha diversity estimate for a single or multiple samples is visualized with line graph similar to a receiver operator curve. Based on feedback from our users, we have included in the latest release of CLC Microbial Genomics Module (version 4.5) the ability to also represent alpha diversity of a sample using box plots. This new functionality has been integrated into the existing tool for calculating alpha diversity, and the box plots will be generated automatically when running the tool Alpha Diversity.

In the examples below, we used the same data from our recent white paper on the microbial diversity in a polar desert in Antarctica. Alpha diversity, estimated as the total number of OTUs at the taxonomic level of Order, is displayed in a line graph on the left and as a box plot on the right. In the left figure all samples are shown and colored by location, but any desired metadata parameter could have been chosen. In the box plot on the right, samples have been grouped by location. Individual data points and outliers can be displayed, as well as indicators for mean and median. Groups can be compared with a Kruskal-Wallis test and the p-values for any pairwise comparison displayed above the plot (as shown). In the example of the Antarctica microbiomes, the microbial diversity was significantly higher in the Dry Valleys soil as compared to the saline water in Ace Lake (p = 0.03), and the microbial diversity was significantly lower in the Dry Valleys soil as compared to the marine sediment at Adelie Basin (p = 0.03).

Beta diversity visualizations

CLC Microbial Genomics Module also provides several different metrics for estimating the beta diversity in a set of samples, including Bray-Curtis, Jaccard, Euclidean, and UniFrac. The latest release now enables users to display beta diversity in either a 2D or 3D PCoA plot. Below is shown the beta diversity among samples from different locations in Antarctica. On the left, the beta diversity is visualized in the existing 3D PCoA plot, and on the right, the diversity is visualized in the new 2D PCoA plot. The new 2D PCoA plot will be generated automatically when running the tool Beta Diversity. The data can be sorted and displayed with any user defined metadata. In the example below, data points are colored by location. As evident from both graphical representations, the microbial communities in Antarctica are clearly separated by geographic location.

There are several new features in the latest release of CLC Microbial Genomics Module. If you haven’t already done so, upgrade your installation today to take advantage of these new visualizations simply by opening. If you are new to CLC Genomics Workbench or the CLC Microbial Genomics Module, you can download the software with a free 14-day trial license here.

Enjoy!

Functional metagenomics analysis of environmental microbiomes: A new white paper for the Microbial Genomics Module of CLC Genomics Workbench

Microbiome research presents us with an opportunity to study all microorganisms on Earth. Nonetheless, many are difficult to isolate in the lab and remain uncultured using traditional microbiology methods, despite more than 100 years of research into developing new cultivation methods. Unraveling the currently undiscovered biodiversity of microbiomes remains a major challenge in microbiology, and it is estimated that more that 99% of all microbes remains uncharacterized by traditional culture methods (1). Just 20 years ago, in 1998, Handelsman first proposed to analyze a soil microbial community without prior cultivation (2). The use of culture-independent metagenomics approaches grew rapidly once the advantages became clear, with just one publication listed in PubMed in 1998 to now more than 11,000 publications.

Metagenomic sequencing is a powerful approach to investigate the microbial diversity of complex samples, with taxonomic classification of organisms sometime reaching strain level precision. Shotgun metagenomics can not only reveal specific organisms in a sample, but is also a powerful approach to characterize the functional genomic profile encoded within microbiomes, and potentially to discover genes with new functions. Although the specific sample preparation, library preparation, and sequencing platform used are all important factors that influence the quality of your results, ultimately the downstream bioinformatics pipelines and reference databases used become the analysis bottleneck. With this last point in mind, we have released a new white paper describing how to carry out functional genomics characterization of unbiased shotgun metagenomics data using CLC Genomics Workbench and the add-on CLC Microbial Genomics Module.

To demonstrate the broad capabilities of our software, we re-analyzed previously published data from Mukan Ji and co-workers (3). Ji et al investigated the surprisingly diverse microbial soil community of a polar desert in Antarctica and sought to understand how these microbes survive in such a harsh and nutrient deficient habitat.

For an in-depth discussion of the study and their exciting findings, we recommend listening to the podcast with microbiology experts Vincent Racaniello, Michael Schmidt, Elio Schaechter, and Michelle Swanson on This Week in Microbiology, TWiM. The paper was discussed in Episode 169 – Breatharian Bacteria.

Read our white paper on functional metagenomics with CLC Genomics Workbench and the Microbial Genomics Module and learn how to reveal the functional potential of microbiomes sequenced using shotgun metagenomics methods.

References

Lloyd K.G., et al. (2018). Phylogenetically Novel Uncultured Microbial Cells Dominate Earth Microbiomes. mSystems 3(5):e00055-18.
Handelsman J. et al. (1998). Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chem Biol. 5(10):R245–9.
Ji M., et al. (2017) Atmospheric trace gases support primary production in Antarctic desert surface soil. Nature 552(7685):400–3.

Whole genome sequencing

Taxonomic profiling of bacterial communities

Amplicon-based profiling

Whole shotgun metagenomic taxonomic profiling

The secret to success

Use case and example workflow

Don’t miss these related QIAGEN CLC blogs:

References:

Check out the new features of QIAGEN CLC Genomics

Using viral reference databases for phylogeny construction and taxonomic profiling of samples with low viral load

Visualizing phylogenetic relationships made easy

Create a taxonomic profiling index and detect abundance of coronavirus in metagenome samples with low coronavirus copy number

Table 1. Abundances for the different samples (results have been aggregated to species level)

Check out some of the many new features delivered in the QIAGEN CLC solutions

QIAGEN CLC Genomics Workbench 20.0:

QIAGEN CLC Main Workbench:

QIAGEN CLC Genomics Server:

QIAGEN CLC Microbial Genomics Module:

Biomedical Genomics Analysis Plugin:

Analysis of microbiome transcriptomes

Combining RNA-Seq Data with Existing Metagenomics Data

References

New visualizations for diversity

Alpha diversity visualizations

Beta diversity visualizations

Functional metagenomics analysis of environmental microbiomes: A new white paper for the Microbial Genomics Module of CLC Genomics Workbench

References

Follow Us

Contact Us