Releasing GFF Files
From NMPDR Wiki
NMPDR Standard Operating Procedure SOP006
Contents |
Introduction
This standard operating procedure (SOP) describes the steps required to export data from the NMPDR in GFF3 format.
SCOPE
This SOP applies to the procedures to upload NMPDR data to the Central Site.
Applicable Regulations and Guidelines
| NMPDR Contract | Delivery of NMPDR SOPs |
| BRC Metrics | Production of metrics |
| GO | List of GO terms |
| Transaction Logging | NMPDR Logging requirements |
Responsibility
This SOP applies to those members of the NMPDR research team involved in uploading data. This includes the following:
- Principal Investigator
- Bioinformaticians
Definitions
- Standard Operating Procedures (SOPs): Detailed, written instructions to achieve uniformity of the performance of a specific function.
- Subsystem: A collection of functional roles that together implement a specific biological process or structural complex.
- FigFam: Protein families. Each family is intended to contain a set of globally similar proteins that implement the same function.
- Annotation: A tuple consisting of a date, annotator name, and textual message.
- Structured Annotation: An annotation where the text is structured. There are two kinds of structured annotations
- Placement of a gene within a subsystem.
- Assignment of a function to a gene.
Process Overview
- Create the files.
- Upload the files.
- Check the upload.
Release Procedure
Creating the Files
- Choose a machine that is up to date, and create an empty directory. We will refer to this as the target directory.
- Run the SproutGFF utility specifying the target directory as a parameter. This selects the genomes flagged as belonging to the NMPDR, and creates a GFF3 file for each one.
For example, if the target directory were ~fig/nmpdr you would use
SproutGFF ~fig/nmpdr
This produces a directory hierarchy similar to the following.
- ~fig/nmpdr
- Campylobacter
- Camplyobacter.coli.RM2228.gff
- Camplyobacter.coli.RM2228.xml
- Campylobacter.lari.RM2100.gff
- Camplyobacter.coli.RM2228.xml
- ...
- Listeria
- Listeria.innocua.Clip11262.gff
- Listeria.innocua.Clip11262.xml
- Listeria.monocytogenes.10403S.gff
- Listeria.monocytogenes.10403S.xml
- ...
- Staphylococcus
- ...
- Campylobacter
For each organism, there is a GFF file and an XML file. The XML file contains information used by BRC Central to identify the genome and its source. Each subdirectory corresponds to a genus among the NMPDR core organisms. Organisms not in the core set are excluded.
To create a GFF3 file for an organism outside the core groups, you can use the --genome option and specify the genome ID. For example, the command below will produce a single GFF3 file for Streptomyces coelicolor A3(2).
SproutGFF --genome=100226.1 ~fig/nmpdr
In this case no hierarchy is created and the output files are placed directly in the target location.
It takes from 30 seconds to 1 minute to create the GFF3 file for a single genome. For the entire core organism set, the process generally takes 45 minutes.
Uploading the Files
Once the creation of the GFF3 files is complete, use the BRC Central Validator to validate and upload the data. The syntax of the command is as follows, where ~fig/nmpdr represents the directory in which the GFF and XML files were placed in the previous step.
gff3_validator.pl -b NMPDR -d /path/to/directory/NMPDR -p CDS
Checking the Upload
If everything worked, then you should be able to see the files by browsing the NMPDR directory at the BRC Central FTP site. The URL is ftp://ftp.brc-central.org/NMPDR.
Categories: SOP | Flagged
