Taxonomic profiling - QIAGEN Digital Insights

When it comes to taxonomic profiling of the microbiome, amplicon sequence variants (ASVs) is a method that has unique benefits compared to the traditional method of sequence similarity-based operational taxonomic unit (OTU) clustering. This training’ll cover how to effectively do taxonomic profiling using QIAGEN CLC Genomics Workbench’s ASV tools and a customizable workflow.

Specifically, you’ll learn how to:
1. Collect and bin sequence variants for each sample
2. Merge AVS tables to compare profiles
3. Assign the taxonomic profile/organism
4. Export high-resolution graphical and comprehensive tabular results

If you’re new to QIAGEN CLC Genomics Workbench and are interested in a trial, learn more here:
https://qiagen.showpad.com/share/I6MqdhIybpllyvuYW3AKy

Using viral reference databases for phylogeny construction and taxonomic profiling of samples with low viral load

This blog tutorial highlights several recent improvements in the latest update to QIAGEN CLC Microbial Genomics Module 20.1. The update includes improved usability in the Download Microbial Reference Database tool and improved support for long reads in Taxonomic Profiling. Some of the improvements include:

Faster load times for the selection table, which now loads in just seconds
Full access to the latest assemblies from NCBI with a taxonomy-aware download selection
No deduplication: The tool no longer removes duplicate sequences, as this functionality has been moved to Create Taxonomic Profiling Index

With the 20.1 update, it is now easy to customize the Microbial Reference Database to fit your needs. Here we demonstrate two use cases:

Visualizing phylogenetic relationships of all coronavirus genomes
Creating a taxonomic profiling index of all viral genomes and carrying out taxonomic profiling of viral metagenome samples containing severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in a few simple steps

Visualizing phylogenetic relationships made easy

The updated downloader makes it simple to visualize phylogenetic relationships. To create a dendrogram of the four coronavirus genera, we first create a microbial database containing only coronavirus:

Run the Download Microbial Reference Database tool to load the Database builder
Filter the table to show only entries where the Taxonomy column contains 'coronoviridae'
Aggregate rows on Genus; we observe five samples which do not include the genus
Use Quick Selection: "Complete genomes in RefSeq" to quickly select all complete, coronavirus genomes

Approximately 200 references remained and were downloaded with a minimum contig length of 1000. The five samples with an unknown genus were included in the downloaded database.

The phylogenies of the downloaded database of assemblies can be easily visualized using Create K-mer Tree. In Create K-mer Tree, select the downloaded database of coronavirus genomes. The dendrogram shown was created with default settings, except "Only index k-mers with prefix" was left blank due to the short length of coronavirus genomes.

Figure 1 shows a circular dendrogram with added genus metadata. For ease of viewing, 50% of both the alphacoronavirus and betacoronavirus genomes have been excluded from the tree.

In the tree, the five references without a genus are selected and their branches are shown in dark blue. From the tree, we can see that three of these references cluster with the betacoronavirus, one clusters with the alphacoronavirus and one clusters between alphacoronavirus and gammacoronavirus.

This highlights a quick and easy way to download a database of viral genomes, and how to use the database to create a phylogeny. The phylogeny can then be used to resolve samples of unknown genus.

Figure 1. Dendrogram of the four coronavirus genera.

Create K-mer tree also works with reads. In the next section, we demonstrate how to create a taxonomic profile with metagenome samples.

Create a taxonomic profiling index and detect abundance of coronavirus in metagenome samples with low coronavirus copy number

With the recent updates to the Download Microbial Reference Database and Taxonomic Profiling functions in QIAGEN CLC Microbial Genomics Module, it is now fast and easy to detect coronavirus presence in metagenome samples containing only a few virus reads. Taxonomic profiling now also supports long reads such as those generated by Oxford Nanopore and PacBio sequencing technologies.

For the first time setup, we create a viral database:

Run the Download Microbial Reference Database tool to load the Database builder
Filter the table to show only entries where the Taxonomy column contains 'virae' - we skip the remaining virus kingdom in the interest of speed
Use ’Quick Selection: Complete genomes in RefSeq’ to quickly select all complete, viral genomes

All complete virus genomes to date, approximately 18,500, remained and were downloaded with a minimum contig length of 1000.

The downloaded database was used to create a taxonomic profiling index using default settings.

The analysis can be carried out in a simple workflow using the curated Microbial Reference Database and human genome to create a Taxonomic Profiling index for host genome filtering (Figure 2).

Results are presented from 3 different studies with low fraction of viral reads (Table 1).

SRR10948550: Long read sequencing using Oxford Nanopore (1)
SRR11092061: Paired end sequencing using Illumina HiSeq 3000 (2)
ERR4385803: Paired end sequencing using Illumina HiSeq 2500 (gut virome sample - negative for SARS-CoV-2)

Abundance virus values have been aggregated to species level and table filtered to abundance >10. The % viral reads is the percentage of reads in the sample matching the virus database.

Table 1. Abundances for the different samples (results have been aggregated to species level)

Sample	% viral reads	Species	Taxonomy	Abundance
SRR10948550	1.0556	Severe acute respiratory syndrome-related coronavirus	Orthornavirae; Pisuviricota; Pisoniviricetes; Nidovirales; Coronaviridae; Betacoronavirus; Severe acute respiratory syndrome-related coronavirus	985
		Ambystoma tigrinum virus	Bamfordvirae; Nucleocytoviricota; Megaviricetes; Pimascovirales; Iridoviridae; Ranavirus; Ambystoma tigrinum virus	39
		Common midwife toad virus	Bamfordvirae; Nucleocytoviricota; Megaviricetes; Pimascovirales; Iridoviridae; Ranavirus; Common midwife toad virus	26
SRR11092061	0.0045	Severe acute respiratory syndrome-related coronavirus	Orthornavirae; Pisuviricota; Pisoniviricetes; Nidovirales; Coronaviridae; Betacoronavirus; Severe acute respiratory syndrome-related coronavirus	1304
		Spodoptera frugiperda rhabdovirus	Orthornavirae; Negarnaviricota; Monjiviricetes; Mononegavirales; Rhabdoviridae; Spodoptera frugiperda rhabdovirus	822
		Saccharomyces 20S RNA narnavirus	Orthornavirae; Lenarviricota; Amabiliviricetes; Wolframvirales; Narnaviridae; Narnavirus; Saccharomyces 20S RNA narnavirus	336
		Stenotrophomonas virus SMA7	Loebvirae; Hofneiviricota; Faserviricetes; Tubulavirales; Inoviridae; Subteminivirus; Stenotrophomonas virus SMA7	126
		Influenza A virus	Orthornavirae; Negarnaviricota; Insthoviricetes; Articulavirales; Orthomyxoviridae; Alphainfluenzavirus; Influenza A virus	112
		Nipah henipavirus	Orthornavirae; Negarnaviricota; Monjiviricetes; Mononegavirales; Paramyxoviridae; Henipavirus; Nipah henipavirus	48
		Common midwife toad virus	Bamfordvirae; Nucleocytoviricota; Megaviricetes; Pimascovirales; Iridoviridae; Ranavirus; Common midwife toad virus	12
		Inoviridae sp	Loebvirae; Hofneiviricota; Faserviricetes; Tubulavirales; Inoviridae; Inoviridae sp	12
ERR4385803	0.6578	Gokushovirus WZ-2015a	Sangervirae; Phixviricota; Malgrandaviricetes; Petitvirales; Microviridae; Gokushovirus WZ-2015a	19753
		Human gut gokushovirus	Sangervirae; Phixviricota; Malgrandaviricetes; Petitvirales; Microviridae; Human gut gokushovirus	3883
		Microviridae sp	Sangervirae; Phixviricota; Malgrandaviricetes; Petitvirales; Microviridae; Microviridae sp	1726
		Microviridae	Sangervirae; Phixviricota; Malgrandaviricetes; Petitvirales; Microviridae	47

The negative control sample ERR4385803 correctly reports no coronavirus. The abundance of virus was correctly reported in both positive samples (Table 1).

References:

Zhou, P. et al. (2020) A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 7798: 270-273.
Chan, J.F.W. et al. (2020) A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster. The Lancet 10223: 514-523.

The proGenomes2 project is a set of over 85,000 consistently annotated bacterial and archaeal genomes from over 12,000 species which provides a set of reference genomes across taxonomies and specific habitats, such as disease and food-related pathogens, and microbes from aquatic and soil environments. These databases offer excellent starting points for taxonomic profiling as they are unbiased and aim to span the diversity of the specific habitats. Unfortunately, the databases are not in a format that may be used directly within QIAGEN CLC Genomics Workbench, but with scripting, you can produce similar databases from within QIAGEN CLC using the proGenomes2 fasta files as a starting point. The headers of the proGenomes2 databases are constructed in the following way:

We use the biosample ID to find a set of assemblies in NCBI which we can download with the ‘Download Microbial Reference Database’ tool, including all information required for taxonomic profiling. First we need to find the desired database from http://progenomes.embl.de/data/, e.g. the sediment_mud specific database (but any other progenomes2 database hosted at this URL will work, replacing the definition of "URL" in the script below). With the following simple script we can stream the headers of that (gzipped) fasta file into the unique biosample IDs and use NCBI’s Eutils API to translate them into a set of unique assembly IDs and finally collect them into a file:

import sys, time, gzip, urllib.request
import xml.etree.ElementTree as ET
url="http://progenomes.embl.de/data/habitats/representatives.sediment_mud.contigs.fasta.gz"
print("Downloading "+url[url.rfind("/")+1:]+" may take some time ... ", end="", flush=True)
with gzip.GzipFile(fileobj=urllib.request.urlopen(url)) as f:
    l = list({ line.decode("UTF-8").split(".")[1] for iline, line in enumerate(f) if line.startswith(b">")})
print("Done")
def request(query):
    i = 0
    while True:
        try:
            return ET.fromstring(urllib.request.urlopen(query).read().decode("utf-8"))
        except Exception as e:
            if i > 5:
                print("Could not reach: "+query+"\nCheck connection: "+str(e))
                exit(1)
            time.sleep(1)
            i+=1
assemblies = set()
interval=50
for ibiosample in range(0,len(l),interval):
    biosample = "+OR+".join(bs for bs in l[ibiosample:min(ibiosample+interval,len(l))])
    base="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"
    rparse = request(base + "esearch.fcgi?db=assembly&term="+biosample+"[biosample]&usehistory=y")
    query2 = base+"esummary.fcgi?db=assembly&query_key="+rparse.find("QueryKey").text+"&WebEnv="+rparse.find("WebEnv").text
    for res in request(query2).findall(".//AssemblyAccession"):
        assemblies.add(res.text[:res.text.find(".")])
    print("Getting Assembly IDs from NCBI: {:.2f}%".format(min(ibiosample+interval, len(l))*100/len(l)),end="\r" if ibiosample+interval<len(l) else "\n")
ofname = url[url.rfind("/")+1:].replace(".fasta.gz",".txt")
print("Writing Assembly IDs to output file "+ofname)
with open(ofname , 'w') as f:
    for assembly in sorted(assemblies):
        f.write(assembly+"\n")

To run this script, you need a standard installation of python3. All you need to do is copy and paste the content above, modify the URL (if necessary), save it to a file and execute it on your system. For example, you may save the file as "get_assembly_ids.py", then open a terminal and navigate to the folder where the script is located. Finally, execute it with:

$python get_assembly_ids.py

Running this script takes about 2 minutes (for the sediment_mud database), depending on your internet connection. The output will be a file called "representatives.sediment_mud.contigs.txt" placed in the same folder as the script containing the assembly IDs from which the respective progenomes2 database has been created (if you changed the URL, the name of the output file would be changed accordingly).

This file can now be used from within QIAGEN CLC Genomics Workbench (with the Microbial Genomics Module installed). Select Toolbox → Microbial Genomics Module → Databases → Taxonomic analysis → Download Microbial Reference Database and select "Create custom database" and "Include all" sequences.

After clicking next, it is possible to supply a file with Assembly Accession IDs. Select the file "representatives.sediment_mud.contigs.txt" we have just created and click "Finish".

This will create a "Database builder" where the assemblies from "representatives.sediment_mud.contigs.txt" have been selected and staged for download. The Database builder gives an overview of the selected references and provides an estimate of the download size.

By clicking "Download Selection" the download process is started and a sequence list is saved to the selected location. From this sequence list, a Taxonomic Profiling index can be constructed by running the Create Taxonomic Profiling Index tool.

Learn more about how QIAGEN CLC Genomics and QIAGEN CLC Genomics Workbench with the Microbial Genomics Module are powerful and scalable solutions to support all your genomics analysis needs.

Disclaimer: QIAGEN does not support the proGenomes2 databases, and the information provided in this article is given without any warranty, expressed or implied. Users are solely responsible for the application of any code or information provided. The proGenomes2 databases are free for academic use. For any intended commercial use, please refer to http://progenomes.embl.de/other.cgi.

The leading platform for microbiome analysis just got more powerful

With CLC Microbial Genomics Module 2.0, you can now identify microbes with strain level accuracy in metagenomics data. Check out the release highlights here. The module is part of our QIAGEN CLC Genomics ProSuite which fully equips you with all the tools you need for whole genome-driven pathogen typing, epidemiological analysis of outbreaks and microbiome profiling.

Integrated analytics deliver research continuity All analytics for microbial genomics and metagenomics come fully integrated into one scalable and enterprise-ready solution. Data and sample-metadata management is included.
Focus on what matters We shift the focus from data to discovery by providing preconfigured workflows that streamline processing of raw NGS data. Researchers can focus on the interpretation of results, for example in the context of other samples or metadata.
High performance algorithms Designed to save time and compute resources, and to be accessible to bioinformatics experts and non-bioinformaticians alike.

You may also find our webinar on Monday, April 17 of interest: Taxonomic Profiling using Shotgun Metagenome Data
You can register here!

Taxonomic profiling of whole metagenome sequencing datasets

In a recent study, Willmann and colleagues investigated the impact of ciprofloxacin treatment on the intestinal microbiome. Antimicrobial treatments are known to upset the structure of the intestinal microbiome. Taxonomic profiling can reveal the severity and long term effects of antibiotic treatments on the gut community.

Shotgun metagenomics sequencing is a powerful tool for investigating bio-diversity of microbial communities. Common challenges include unwanted signals from host DNA reducing the sensitivity at which the microorganisms can be detected. In addition, large data volumes and the complex structure of metagenomics data further complicate computational analysis.

Using the original data of Willmann et al. we demonstrate the new taxonomic profiler designed for metagenomics data analysis introduced in CLC Microbial Genomics Module 2.0.

Study out-line

Two healthy male volunteers (S1 and S2) were treated with oral ciprofloxacin (Cp) for six days (500 mg twice daily). Stool samples were collected before treatment (baseline), on day 1, 3 and 6 of treatment, and 2 and 28 days after treatment.

Metagenomic shotgun sequencing was performed on the Illumina HiSeq 2000 platform using a paired-end sequencing approach with a targeted read length of 100 bp and an insert size of 180 bp.

Using QIAGEN CLC Genomics ProSuite, we track the evolution of microbiome diversity during and after treatment by analysing the presence and relative abundance of microbial taxa.

We studied how the community recovers after treatment by comparing microbial diversity and taxonomic composition over time. Furthermore, we show how differential abundance analysis can be used to identify the taxa that are most impacted by Cp treatment.

This application note focuses on taxonomic profiling of whole metagenome shotgun sequencing data and the comparative analysis of microbiome profiles using CLC Microbial Genomics Module. This module is an extension of CLC Genomics Workbench and part of QIAGEN CLC Genomics ProSuite. In this analysis, we used CLC Genomics Workbench (version 10) and CLC Microbial Genomics Module (version 2).

Results

Raw fastq-files and metadata were downloaded directly through CLC Genomics Workbench using the Search for Reads in SRA tool (ENA project number PRJEB10391).

To build a suitable database of reference genomes for taxonomic profiling, we used the Create Microbial Reference Database function. This new tool enables users to easily build and download a reference database containing the latest version of all complete microbial genomes, or meaningful subsets thereof, from NCBI. This reference data manager allows the user to filter reference genomes for assembly quality and taxonomic content. The database used here contained 45,616 microbial genome sequences.

Taxonomic profiling

To get started faster CLC Microbial Genomcis Module 2.0 includes 3 preconfigured and ready-to-use Workflows. The Data QC and Taxonomic Profiling workflow performs sequence trimming and taxonomic profiling using the user-build reference database created earlier (Figure 1). To reduce the impact of contamination the workflow includes an option to filter out reads originating from host DNA. The workflow outputs sequencing quality control (QC) reports, and reports from trimming the NGS reads. Microbial community composition is reported in the form of abundance tables and interactive graphics (sunburst plots, stacked bar and stacked area charts) for each sample.

To compare study cohorts or groups of samples abundance profiles can be merged per experiment into a single profile. We merged abundance profiles for each subject and one for all samples combined.

Figure 1. The Data QC and Taxonomic Profiling workflow.

Figure 2. Sunburst chart showing the relative abundance of bacterial within the phylum Firmicutes in S1. As shown in the legend 15% of all taxa identified in this sample group are Firmicutes.

Alpha diversity and rarefaction analysis

Sequencing at sufficient depth is crucial for recovering the complete microbial diversity inherent to a sample. Rarefaction analysis should be performed to assess if sequencing coverage was sufficient. Rarefaction curves plot the number of taxa as a function of the sub-sampled reads. The rarefaction curves in Figure 3, produced with the Alpha Diversity tool, all reach a plateau, indicating that sequencing was performed to a sufficient level.

Several studies have shown the diversity of the gastrointestinal flora remains diminished 8 weeks after treatment [Fouhy et al.] and does not recover completely even 3 months after treatment ends [Panda et al.] - so a complete restoration of the initial diversity is not to be expected within 4 weeks after treatment.

To analyse the impact of Cp treatment on microbiome diversity we have estimated alpha diversity (or diversity withing a sample). The plot in Figure 3 shows that at baseline level individual S1 had a more diverse intestinal microbiome compared to S2. Cp treatment reduced the taxonomic diversity of the gut microbiome severly for subject S1 by day 6. Subject S2 was less affected. Both S1 and S2 recovers almost completely 28 days after treatment. This is in agreement with observations in the original paper: “Distal gut species diversity was profoundly disturbed over the course of Cp exposure in subject 1[...]. The lowest diversity occurred at the last treatment day and in the following 2 days but recovered almost fully 28 days after treatment. Subject 2 generally was much less affected [...]”.

Figure 3. Alpha diversity analysis shows that gut microbiome of S1 was profoundly disturbed by Cp treatment. For alpha diversity estimation we use a non-absolute diversity metric. Accurate counts of different taxonomic units can be obtained from the abundance tables.

Beta diversity

Beta diversity estimation compares microbial community composition between samples and clusters samples based compositional similarity. To assess the change in microbiome composition across different time-points, we performed beta diversity analysis on the abundance profiles. The Beta Diversity tool offers three diversity measures (Bray-Curtis, Jacquard and euclidian) to calculate distance matrices. The results are shown in Figure 4 and 5, combining the beta-diversity as a bar chart and a PCoA plot for the Bray-Curtis distance. While the bar chart provides details of the relative abundance of taxa, here grouped by phylum, the PCoA plots show the development of microbiome composition over time.

Figure 4. Stacked bar chart showing the gut microbiome diversity of S1 and S2 on phylum level. Samples are grouped by subject, and individual bars are labelled with subject and treatment day.

Subjects S1 and S2 respond to Cp treatment with a similar reduction in the abundance of Proteobacteria. However, for the four other dominating phyla Verrucomicrobia, Firmicutes, Bacteroidetes, and Actinobacteria, subjects S1 and S2 respond differently.

The phylum Verrucomicrobia is only present in S1 at time-point 1 and 2 and recovers at time-point 6. Two weeks after treatment this phylum is present at much higher relative abundance (purple portion of the bar chart) than before the Cp exposure. This agrees with the observation that "Among the five most common phyla, both subjects responded with a similar decrease in abundance of Proteobacteria, but distinct from levels for Verrucomicrobia, Firmicutes, Bacteroidetes, and Actinobacteria" (see Figure. 6 in Willmann et al.)

PCoA analysis of S1 displays the baseline point and point 1 very close together, indicating that on the first day of treatment diversity is mildly affected. However, from day 3 the profile deviate visibly along the first principal coordinate, with day 6 and 8 being most different to the baseline and closer to each other. On day 34, the profile is much closer to baseline again, indicating the partial recovery of the microbial composition after treatment. S2 follow a very similar pattern, but recover to a state much closer to baseline than S1.

Figure 5. PCoA plots of the beta diversity between S1 samples (top) and S2 samples (bottom). Baseline is shown in green, during treatment (day 1, 3 and 6) in blue and post treatment (2 and 28 days after treatment start) in red, and is labeled with the treatment day.

Differential abundance analysis

To determine which taxa were most impacted by exposure to Cp in subjects S1 and S2 we performed differential abundance analysis using the Differential Abundance Analysis tool. CLC Microbial Genomics Module 2.0 feaures state-of-the-art analytics for determining differential abundance between microbial communities. Results can be filtered by fold change or p-values.

Fold-changes at family-level and FDR-corrected p-values are shown in Figure 6. Eubacteriaceae were significantly more abundant in S2, while Oscillospiraceae and Akkermansiaceae were more abundant in S1.

Figure 6. The output of differential abundance analysis on family level, comparing S1 and S2 while correcting for treatment phase.

Concluding remarks

This study has shown that ciprofloxacin treatment significantly alters the intestinal microbial community structure. While the intestinal microbiota recovers extensively post treatment, the results also highlight that microbiome diversity varies profoundly among healthy individuals.

Here we have demonstrated how use of the new Taxonomic Profiling tool enables the analysis of large shotgun metagenomics sequencing dataset and delivers results that are concordant to findings generated by independent means. Ease of use is ensured through preconfigured Workflows. Built-in statistical tools and data visualisations enable comparisons across multiple samples or sample groups and provide clear and interactive graphical results.

References

[Willmann et al.] "Antibiotic selection pressure determination through sequence-based metagenomics." Antimicrobial agents and chemotherapy 59.12 (2015): 7335-7345.

[Panda et al.] "Short-term effect of antibiotics on human gut microbiota." PloS one 9.4 (2014): e95476.

[Fouhy et al.] "High-throughput sequencing reveals the incomplete, short-term recovery of infant gut microbiota following parenteral antibiotic treatment with ampicillin and gentamicin." Antimicrobial agents and chemotherapy 56.11 (2012): 5811-5820.

[Zurfluh et al.] "Quinolone resistance mechanisms among extended-spectrum beta-lactamase (ESBL) producing Escherichia coli isolated from rivers and lakes in Switzerland." PloS one 9.4 (2014): e95864.

For further reading check out our Microbial Genomics Solution.