AZ Orthofinder

Download

Repository: https://github.com/vladsaveliev/az_orthofinder

Installation

Just extract the archive. You will find scenario_1.py, scenario_2.py and test_input inside the extracted folder.

Note: you will need some third-party software to be installed on your system for running the tool. See section System Requirements for details.

Scenario 1

The scenario_1.py is aimed to initialize a database of orthologous groups. It generates the resulting orthogroups.tsv, and for further extention it also produces the following intermediate results for the second scenario:

— the proteomes directory with correctly adjusted proteomes,

— the annotations directory with GB files from NCBI,

— and the intermediate/blasted.tsv file.

Usage examples:

1. Fasta-proteins (optionally with annotations from prodigal). See example in test_input/proteins. Filenames will be taken as taxon codes. Uses Internet to download GB annotations; it the Internet is off, a short version of output will be produced, containing only taxon|protein ids.

./scenario_1.py --proteomes test_input/proteins -o output_test_proteomes

2. GB-annotations (Example: test_input/gbs).

./scenario_1.py --gbs test_input/gbs -o output_test_gbs

3. A list of reference ids. (Example: test_input/ids.txt). The tool will download references from Genbank. This step requires an Internet connection.

./scenario_1.py -i test_input/ids.txt -o output_test_ids

4. A list of species names (Example: test_input/species.txt). In this case, the tool will search the NCBI server using the following query (considering a species name is Escherichia coli):

Escherichia coli[Organism] AND (complete genome[Title] OR complete sequence[Title])) NOT (partial[Title])

Note that you need to specify full species name like Escherichia coli (not E. coli).

Particularly, we processed E. coli and K. pneumoniae this way. I used the files with species list (see the test_input directory):

kpneumonia_list.txt with the following single line inside:

Klebsiella pneumoniae

The command line I used:

./scenario_1.py --species test_input/kpneumonia_list.txt -o kpneumonia

ecoli_list.txt contating the following:

Escherichia coli

Command line:

./scenario_1.py --species test_input/ecoli_list.txt -o ecoli

Scenario 2

The scenario_2.py script is meant to extend orthogroups. It is required to run on a direcotory produced by the scenario_1.py script, since it extends it's blasted.tsv, and also reuses proteomes and GB annotations.

Existing blast results are used because the all-against-all blast process is the most time consuming step.

The reason we use a file with blast results is that the all-against-all blast process is the most time consuming step. Basically, the blast results can be stored between runs in a database table; nevertheless, we decided to not to rely here on SQL, because we think it is going to be more clear for users:

1. The users don't have to remember names of their talbes so they won't overwrite or damage something important;

2. The database can be cleaned up after any tool usage.

3. The results are easier to sent between computers.

There are 2 possible types of scenario_2 workflow depending on input.

1. Input is a list of reference IDs / GIs / organism names / GB annotations files.

In this case, the resulted intermediate files and othrogroups.tsv will be the same if you run the scenario_1.py on a larger input data.

2. Input are assemblies or proteomes generated by Prodigal (in case of assemblies, proteomes will be generated automatically with Prodigal). After running scenario_2 on this input, orthogroups.tsv is generated. Then each orthogroup is processed basing on its kind:

1. A group that contain only annotated genes is not processed any more.

2. A group that contains both annotated and unknown genes is not processed as well, since unknown genes can be possibly curated manually based on annotated ones.

3. If a group contains only unknown genes, it will be saved into a fasta file inside the blasted_singletones directory. Then, for each group, one of the proteins will be blasted against the public NCBI database (a local database can be also provided with the --blast-db option; otherwise, an internet connection will be used).

For each group, and XML file with blast results will be generated; the best hits will be printed to output.

Usage examples:

Appending additional list of files (fasta, gb) to an existed output after scenario 1.

./scenario_2.py -s1o test_proteomes -s2o test_prots_new_prots --proteomes test_input/new_proteins

You can pass assemblies instead, in this case they will be automatically annotated with Prodigal.

./scenario_2.py -s1o test_proteomes -s2o test_prots_new_assemblies --assemblies test_input/assemblies

Or a list of reference ids (accession numbers of gi):

./scenario_2.py -s1o test_ids -s2o test_ids_new_ids --ids test_input/new_ids.txt

Existing directory must contain an intermediate subdirectory with a blasted.tsv file and proteomes folder from a scenario_1 run.

The last step is blasting new genes that didn't match any group against a local NCBI database. By default, the remote database is used, but you would rather use a local on with the --blastdb option. On chara:

./scenario_2.py -s1o test_ids -s2o test_ids_new_ids --ids test_input/new_ids.txt --blastdb /gpfs/group/infection_translation/orthoMCL/app/refseq-proteins/refseq_protein

Starting from a step

There is an optional command-line argument --start-from. It is used to skip several steps of the pipeline and run right from the step specified. You can take step names from log.txt in the results folder.

./scenario_1.py -o output_test_proteomes --start-from "Parsing blast results"

./scenario_1.py -o output_test_proteomes --start-from 7

scenario_1.py steps:

1. Preparing proteomes and annotations

2. Filtering proteomes

3. Making blast database

4. Blasting

5. Parsing blast results

6. Cleaning database

7. Installing schema

8. Loading blast results into the database

9. Finding pairs

10. Dump pairs files

11. MCL

12. Saving orthogroups

scenario_2.py steps:

1. Preparing imput

2. Filtering new proteomes

3. Filtering proteomes

4. Making blast database

5. Blasting

6. Parsing blast results

7. Cleaning database

8. Installing schema

9. Loading blast results into the database

10. Finding pairs

11. Dumping pairs files

12. MCL

13. Saving orthogroups

14. Blasting singletones

Fine tuning

--prot-id-field Fields are separated by either a bar or a space. For example, with --prot-id-filed 1 fasta ids like >NC_005816.1|NP_995567.1 will lead to the protein id NP_995567.1

--min-length Minimum allowed length of proteins (default: 10)

--max-percent_stop Maximum percent stop codons (default: 20)

--evalue Blast e-value (default: 1e-5)

-t Threads number (default: 30)

-w Overwrite output directory if it exists.

System Requirements

The tool needs the following software installed on your system:

python 2.7
blast
mysql
mysql perl modules (note that src/mysql.cnf is a path from the root of the tool (az_orthofinder/src/mysql.cnf)):

$ perl -MCPAN -e shell

cpan> o conf makepl_arg "mysql_config=src/mysql.cnf"

cpan> install Data::Dumper

cpan> install DBI

cpan> force install DBD::mysql

The tool also requires a MySQL user orthomcl with password 1234, and a database orthomcl with all privileges granted to that user. It can be achieved in the following way:

$ mysqld --port=3307 & (in case if the mysql server is not running)

$ mysql -u root -p

mysql> CREATE DATABASE orthomcl;

mysql> GRANT SELECT, INSERT, UPDATE, DELETE, CREATE VIEW, CREATE, INDEX, DROP on orthomcl.* TO orthomcl@localhost;

mysql> set password for orthomcl@localhost = password("1234");

If you have any problems when setting and running the mysql server, please, let us know immediately.

Attachment	Size
orthofinder_2014_02_03.zip	12.92 MB