GeneSyntenyDB
This will show you how to set up a workflow that allows gene synteny analysis on genomes stored at NCBI.
The protocol is divided into four major parts:
1) Database creation
2) Taxonomic annotation
3) Functional annotation
4) Gene synteny
This is still very much a work in progress, but feel free to submit a github issue or contact me at gisves at dtu dot dk
Table of contents
Database creation
The database is based on NCBI data and uses their ftp server. I use the non-redundant genomes found in refseq, but since the structure is the same, genbank shoould also work. First we need to download the relevant overview file called assembly_summary.txt. In this case we are interested in all archaeal genomes found in refseq:
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/assembly_summary.txt
This is a tabulated table of all archaeal genomes and column 12 tells us if the genome is complete. We extract a list of complete genomes using awk:
awk -F '\t' '{if($12=="Complete Genome") print $20}' assembly_summary.txt > complete.list
Now we can download from NCBI using:
for next in $(cat complete.list); do wget -P Complete "$next"/*genomic.gbff.gz; done
We unpack and concatenate all files to make one big genbank file:
zcat Complete/*.gbff.gz > complete.gbk
To extract all CDS and attach a unique identifier that also includes gene order found in Genbank files (features are stored according to position) we use extractCDSfromGB.py
extractCDSfromGB.py -i complete.gbk > archaea.cds.faa
The output fasta file contains all actual CDS features of the Genbank file. Each fasta header consists of >ACCESSION NUMBER_PROTEIN ID_UNIQUE ID
Plasmids database creation
Plasmids are a bit different at NCBI, so it is done with minor modifications. We download all plasmids:
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plasmid/plasmid.*.genomic.gbff.gz
Concatenate
zcat *.gbff.gz > complete.gbk
To extract all CDS and attach a unique identifier that also includes gene order found in Genbank files (features are stored according to position) we use extractCDSfromGB.py
extractCDSfromGB.py -i complete.gbk > archaea.cds.faa
The output fasta file contains all actual CDS features of the Genbank file. Each fasta header consists of >ACCESSION NUMBER_PROTEIN ID_UNIQUE ID
Taxonomic annotation using Genbank files (optional)
This is not necessary for many of the following gene synteny analysis, but included nonetheless. This can be done several ways. Here I’ll use the information from Genbank files that follow NCBI syntax.
extractTaxfromGB.py -i complete.gbk > archaea_taxonomy.tsv
Taxonomic annotation using Kaiju (optional)
This is not necessary for many of the following gene synteny analysis, but included nonetheless. This can be done several ways. For this workflow I chose Kaiju because I need sensitivity and speed more than precision. See their webpage for installation instructions and how to setup your Kaiju database.
kaiju -t nodes.dmp -f kaiju_db_nr_euk.fmi -i archaea.cds.faa -o archaea.cds.faa.kaiju -z 20 -p -v
Adding full taxon path to Kaiju output
kaiju-addTaxonNames -i archaea.cds.faa.kaiju -o archaea.cds.faa.kaiju.names -t nodes.dmp -n kaiju_db_nr_euk_names.dmp -p
Functional annotation
For functional annotation we will use the slow but excellent EggNOG-mapper. See their webpage for installation instructions and how to setup their database.
emapper.py -m diamond --cpu 10 -i archaea.cds.faa -o archaea.eggnog