NMPDR Wiki

The Official Documentation Site for the National Microbial Pathogen Data Resource

Releasing GFF Files

From NMPDR Wiki

The gff_validator application mentioned in this article does not exist on any of the NMPDR machines. If you know the location of the file, contact Bruce immediately.

NMPDR Standard Operating Procedure SOP006

Contents

Introduction

This standard operating procedure (SOP) describes the steps required to export data from the NMPDR in GFF3 format.

SCOPE

This SOP applies to the procedures to upload NMPDR data to the Central Site.

Applicable Regulations and Guidelines

NMPDR Contract Delivery of NMPDR SOPs
BRC Metrics Production of metrics
GO List of GO terms
Transaction Logging NMPDR Logging requirements

Responsibility

This SOP applies to those members of the NMPDR research team involved in uploading data. This includes the following:

  • Principal Investigator
  • Bioinformaticians

Definitions

  • Standard Operating Procedures (SOPs): Detailed, written instructions to achieve uniformity of the performance of a specific function.
  • Subsystem: A collection of functional roles that together implement a specific biological process or structural complex.
  • FigFam: Protein families. Each family is intended to contain a set of globally similar proteins that implement the same function.
  • Annotation: A tuple consisting of a date, annotator name, and textual message.
  • Structured Annotation: An annotation where the text is structured. There are two kinds of structured annotations
    1. Placement of a gene within a subsystem.
    2. Assignment of a function to a gene.

Process Overview

  1. Create the files.
  2. Upload the files.
  3. Check the upload.

Release Procedure

Creating the Files

  1. Choose a machine that is up to date, and create an empty directory. We will refer to this as the target directory.
  2. Run the SproutGFF utility specifying the target directory as a parameter. This selects the genomes flagged as belonging to the NMPDR, and creates a GFF3 file for each one.

For example, if the target directory were ~fig/nmpdr you would use

 SproutGFF ~fig/nmpdr

This produces a directory hierarchy similar to the following.

  • ~fig/nmpdr
    • Campylobacter
      • Camplyobacter.coli.RM2228.gff
      • Camplyobacter.coli.RM2228.xml
      • Campylobacter.lari.RM2100.gff
      • Camplyobacter.coli.RM2228.xml
      • ...
    • Listeria
      • Listeria.innocua.Clip11262.gff
      • Listeria.innocua.Clip11262.xml
      • Listeria.monocytogenes.10403S.gff
      • Listeria.monocytogenes.10403S.xml
      • ...
    • Staphylococcus
    • ...

For each organism, there is a GFF file and an XML file. The XML file contains information used by BRC Central to identify the genome and its source. Each subdirectory corresponds to a genus among the NMPDR core organisms. Organisms not in the core set are excluded.

To create a GFF3 file for an organism outside the core groups, you can use the --genome option and specify the genome ID. For example, the command below will produce a single GFF3 file for Streptomyces coelicolor A3(2).

 SproutGFF --genome=100226.1 ~fig/nmpdr

In this case no hierarchy is created and the output files are placed directly in the target location.

It takes from 30 seconds to 1 minute to create the GFF3 file for a single genome. For the entire core organism set, the process generally takes 45 minutes.

Uploading the Files

Once the creation of the GFF3 files is complete, use the BRC Central Validator to validate and upload the data. The syntax of the command is as follows, where ~fig/nmpdr represents the directory in which the GFF and XML files were placed in the previous step.

 gff3_validator.pl -b NMPDR -d /path/to/directory/NMPDR -p CDS

Checking the Upload

If everything worked, then you should be able to see the files by browsing the NMPDR directory at the BRC Central FTP site. The URL is ftp://ftp.brc-central.org/NMPDR.

Personal tools