An overview and tutorial of the CowPi workflow on Galaxy.


The CowPi workflow is provided free for use by the wider scientific community at Aberystwyth University. A full description of how to set up your Galaxy account and the work flow is provided here: http://www.cowpi.org/p/setting-up-cowpi-on-galaxy-account.html.
The purpose of the following information is to describe each of the steps that are carried out automatically in the Galaxy workflow at Aberystwyth University.

The CowPi workflow consists of 7 steps that are carried out automatically (Figure 1).



Step 1: Usearch


Overview:
We use usearch (https://www.drive5.com/usearch/) to match user provided OTUs to 16S sequences from the Global Rumen Census (GRC) (http://www.rmgnetwork.org/global-rumen-census.html) and from fully sequenced genomes from rumen microbes (497 in version 1 - see a full list here). This differs from the PICRUST approach that matches OTUs to 16S sequences collated from online 16S sequence databases such as RDP (http://rdp.cme.msu.edu/) or Silva  (https://www.arb-silva.de/). The logic behind our choice to use the GRC data is that it represents a very good overview of the Bacteria and Archaea that are found in the rumen, globally and will be a better target database than those from other online datasets which may be heavily biased towards non-rumen species.
Usage:
A user needs to provide a fasta formatted file of OTUs from the dataset to be studied. An example of this from the data presented in the CowPi paper has been provided both in the data repository and in Galaxy (See step 4 here: http://www.cowpi.org/p/setting-up-cowpi-on-galaxy-account.html).

See Figure 2 for an example of this input file.



When the example data has been correctly imported into your galaxy account, there should be a file called “timecourseOTU.fa” present. This is fasta OTU file. It should appear in a drop-down list for selection (as shown in Figure 3). In order to be able to report usage statistics to our funders, we also request that users give us an indication of the country and institute that the analysis is being run from.


Step 2: Extract Names


Overview:
This is a custom step that takes the output of usearch and extracts the names of the 16S sequences that are found to be the best match.

Usage:
The user does not have to specify anything for this step in CowPi, the  workflow will pass on the results to the next step.

Step 3: Rstep


Overview:
This is a custom step that uses the R statistical package to sum up the OTU counts for any OTUs that cluster to the same rRNA sequence from the GRC or Genomes. It uses a user-provided OTU count table.

Usage:
The user needs to provide an OTU count table, where the number of observations of any OTU in a sample is provided. This should be in a Tab-delimited format, with a Header in the first row providing the sample IDs and a final column with the total number of each OTU.


An example is provided in Figure 4 and an example file  ‘OTU_Table.txt’ from the rRNA colonisation study outlined in the CowPi paper is provided both in the data repository and as part of the data import in Galaxy.  

Select this file from the drop-down menu in Step 3 as shown in Figure 5.


Step 4: Convert BIOM


Overview:
This is one of the core PICRUST steps, which takes the mapped OTUs and the summed OTU table and using a phylogenetic tree calculated from all 696,451 16S sequences from the GRC and sequenced rumen genomes. For a full explanation of this step, please refer to the original PICRUSt paper.

Usage:
The user does not have to do anything at this step, it will automatically take the output from Step 3 and pass on the results to the next step. For version 1 we recommend that users do not change the pre-selected options in this step (as shown in Figure 6).

Step 5: Normalise


Overview:
This step takes as input a pre-calculated file of 16S rRNA copy number, per genome and uses it to normalise the OTU abundances from the previous steps. This file is provided both the in data repository and in the data import in Galaxy, and is called ‘16S_precalculated.tab`.
This step will pass on the normalised abundances to the next step for function prediction.

Usage:
The user needs to select the file ‘16S_precalculated.tab’ from the drop-down menu as shown  in Figure 7.

Step 6: Predict Metagenome


Overview:
This step takes the predicted best-hit genomes and normalised copy-numbers and uses a pre-calculated file containing Kegg ortholog (KO) abundances to produce an estimate of KO abundances in each sample. The preclaculated file ‘ko_precalc1.tab’ which contains predicted KO abundances from all 497 rumen microbial genomes is provided in both the data repository and in the Galaxy import.

Usage:
The user needs to select the precalculated file ‘ko_precalc1.tab’ from the drop down menu as shown in Figure 8. The output is a hdf5-based "version 2" biom formatted file of KO abundances and hierarchy, that can be analysed using various packages. This is passed onto the final step of CowPi to extract pathway-level information.


Step 7: Categorize


Overview:
This step takes the output of the KO abundances per sample from the previous step (in hdf5-based "version 2" BIOM format) and prduces a tab-delimited table of predicted KEGG pathways and abundances in each sample, based on the previous steps.

Usage:
The user does not have to do anything for this step. It will automatically produce the pathway-level results.  


Click on "Run Workflow" on the top-right to start the analysis.