View on GitHub

LocalDB_with_gene_synteny

How to make a local NCBI gene database and associated tools for gene synteny analysis

GeneSyntenyDB

This will show you how to set up a workflow that allows gene synteny analysis on genomes stored at NCBI. The protocol is divided into four major parts:
1) Database creation
2) Taxonomic annotation
3) Functional annotation
4) Gene synteny

This is still very much a work in progress, but feel free to submit a github issue or contact me at gisves at dtu dot dk

Table of contents

GeneSyntenyDB

Database creation

The database is based on NCBI data and uses their ftp server. I use the non-redundant genomes found in refseq, but since the structure is the same, genbank shoould also work. First we need to download the relevant overview file called assembly_summary.txt. In this case we are interested in all archaeal genomes found in refseq:

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/assembly_summary.txt

This is a tabulated table of all archaeal genomes and column 12 tells us if the genome is complete. We extract a list of complete genomes using awk:

awk -F '\t' '{if($12=="Complete Genome") print $20}' assembly_summary.txt > complete.list

Now we can download from NCBI using:

for next in $(cat complete.list); do wget -P Complete "$next"/*genomic.gbff.gz; done

We unpack and concatenate all files to make one big genbank file:

zcat Complete/*.gbff.gz > complete.gbk

To extract all CDS and attach a unique identifier that also includes gene order found in Genbank files (features are stored according to position) we use extractCDSfromGB.py

extractCDSfromGB.py -i complete.gbk > archaea.cds.faa

The output fasta file contains all actual CDS features of the Genbank file. Each fasta header consists of >ACCESSION NUMBER_PROTEIN ID_UNIQUE ID

Plasmids database creation

Plasmids are a bit different at NCBI, so it is done with minor modifications. We download all plasmids:

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plasmid/plasmid.*.genomic.gbff.gz

Concatenate

zcat *.gbff.gz > complete.gbk

To extract all CDS and attach a unique identifier that also includes gene order found in Genbank files (features are stored according to position) we use extractCDSfromGB.py

extractCDSfromGB.py -i complete.gbk > archaea.cds.faa

The output fasta file contains all actual CDS features of the Genbank file. Each fasta header consists of >ACCESSION NUMBER_PROTEIN ID_UNIQUE ID

Taxonomic annotation using Genbank files (optional)

This is not necessary for many of the following gene synteny analysis, but included nonetheless. This can be done several ways. Here I’ll use the information from Genbank files that follow NCBI syntax.

extractTaxfromGB.py -i complete.gbk > archaea_taxonomy.tsv

Taxonomic annotation using Kaiju (optional)

This is not necessary for many of the following gene synteny analysis, but included nonetheless. This can be done several ways. For this workflow I chose Kaiju because I need sensitivity and speed more than precision. See their webpage for installation instructions and how to setup your Kaiju database.

kaiju -t nodes.dmp -f kaiju_db_nr_euk.fmi -i archaea.cds.faa -o archaea.cds.faa.kaiju -z 20 -p -v

Adding full taxon path to Kaiju output

kaiju-addTaxonNames -i archaea.cds.faa.kaiju -o archaea.cds.faa.kaiju.names -t nodes.dmp -n kaiju_db_nr_euk_names.dmp -p

Functional annotation

For functional annotation we will use the slow but excellent EggNOG-mapper. See their webpage for installation instructions and how to setup their database.

emapper.py -m diamond --cpu 10 -i archaea.cds.faa -o archaea.eggnog