The CowPi workflow is provided free for use by the wider scientific community
at Aberystwyth University. A full description of how to set up your Galaxy
account and the work flow is provided here: http://www.cowpi.org/p/setting-up-cowpi-on-galaxy-account.html.
The purpose of the following information is to describe each of the
steps that are carried out automatically in the Galaxy workflow at Aberystwyth
University.
The CowPi workflow consists of 7 steps that are carried out
automatically (Figure
1).
Step
1: Usearch
Overview:
We use usearch (https://www.drive5.com/usearch/) to
match user provided OTUs to 16S sequences from the Global Rumen Census (GRC) (http://www.rmgnetwork.org/global-rumen-census.html)
and from fully sequenced genomes from rumen microbes (497 in version 1 - see a
full list here). This differs from the PICRUST approach that matches OTUs to
16S sequences collated from online 16S sequence databases such as RDP (http://rdp.cme.msu.edu/)
or Silva (https://www.arb-silva.de/). The logic
behind our choice to use the GRC data is that it represents a very good
overview of the Bacteria and Archaea that are found in the rumen, globally and
will be a better target database than those from other online datasets which
may be heavily biased towards non-rumen species.
Usage:
A user needs to provide a fasta formatted file
of OTUs from the dataset to be studied. An example of this from the data
presented in the CowPi paper has been provided both in the data repository and
in Galaxy (See step 4 here: http://www.cowpi.org/p/setting-up-cowpi-on-galaxy-account.html).
See Figure
2 for an example of this input file.
When the example data has been correctly imported into your galaxy account, there should be a file called “timecourseOTU.fa” present. This is fasta OTU file. It should appear in a drop-down list for selection (as shown in Figure 3). In order to be able to report usage statistics to our funders, we also request that users give us an indication of the country and institute that the analysis is being run from.
Step
2: Extract Names
Overview:
This is a custom step that takes the output of usearch
and extracts the names of the 16S sequences that are found to be the best
match.
Usage:
The user does not have to specify anything for
this step in CowPi, the workflow will
pass on the results to the next step.
Step
3: Rstep
Overview:
This is a custom step that uses the R
statistical package to sum up the OTU counts for any OTUs that cluster to the
same rRNA sequence from the GRC or Genomes. It uses a user-provided OTU count
table.
Usage:
The user needs to provide an OTU count table,
where the number of observations of any OTU in a sample is provided. This
should be in a Tab-delimited format, with a Header in the first row providing
the sample IDs and a final column with the total number of each OTU.
An example is provided in Figure
4 and an example file ‘OTU_Table.txt’ from the rRNA colonisation study
outlined in the CowPi paper is provided both in the data repository and as part
of the data import in Galaxy.
Select this file from the drop-down menu in
Step 3 as shown in Figure
5.
Step
4: Convert BIOM
Overview:
This is one of the core PICRUST steps, which
takes the mapped OTUs and the summed OTU table and using a phylogenetic tree calculated
from all 696,451 16S sequences from the GRC and sequenced rumen genomes. For a
full explanation of this step, please refer to the original PICRUSt paper.
Usage:
The user does not have to do anything at this
step, it will automatically take the output from Step 3 and pass on the results
to the next step. For version 1 we recommend that users do not change the
pre-selected options in this step (as shown in Figure
6).
Step
5: Normalise
Overview:
This step takes as input a pre-calculated file
of 16S rRNA copy number, per genome and uses it to normalise the OTU abundances
from the previous steps. This file is provided both the in data repository and
in the data import in Galaxy, and is called ‘16S_precalculated.tab`.
This step will pass on the normalised
abundances to the next step for function prediction.
Usage:
The user needs to select the file ‘16S_precalculated.tab’
from the drop-down menu as shown
in Figure
7.
Step
6: Predict Metagenome
Overview:
This step takes the predicted best-hit genomes
and normalised copy-numbers and uses a pre-calculated file containing Kegg
ortholog (KO) abundances to produce an estimate of KO abundances in each
sample. The preclaculated file ‘ko_precalc1.tab’ which contains predicted KO
abundances from all 497 rumen microbial genomes is provided in both the data
repository and in the Galaxy import.
Usage:
The user needs to select the precalculated file
‘ko_precalc1.tab’ from the drop down menu as shown in Figure
8. The output is a hdf5-based
"version 2" biom formatted file of KO abundances and hierarchy, that
can be analysed using various packages. This is passed onto the final step of
CowPi to extract pathway-level information.
Step
7: Categorize
Overview:
This step takes the output of the KO abundances
per sample from the previous step (in hdf5-based "version 2" BIOM
format) and prduces a tab-delimited table of predicted KEGG pathways and
abundances in each sample, based on the previous steps.
Usage:
The user does not have to do anything for this
step. It will automatically produce the pathway-level results.