## GenomeTools Scripts

```
Author: Florian Wagner
Email: florian.wagner@duke.edu
```
This notebook demonstrates the use of the scripts `extact_protein_coding_genes.py` and `extract_entrez2gene.py` from the GenomeTools package.

In [1]:
# get genometools version
from pkg_resources import require

print 'Package versions'
print '----------------'
print require('genometools')[0]

Package versions
----------------
genometools 1.1.0


### Running `extract_protein_coding_genes.py`

In [2]:
gene_annotation_file = 'Homo_sapiens.GRCh38.79.gtf.gz'
protein_coding_gene_file = 'protein_coding_genes_human.tsv.gz'

In [3]:
!curl -o "$gene_annotation_file" \
        "ftp://ftp.ensembl.org/pub/release-79/gtf/homo_sapiens/Homo_sapiens.GRCh38.79.gtf.gz"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 42.3M  100 42.3M    0     0  4800k      0  0:00:09  0:00:09 --:--:-- 9342k


In [4]:
# reading annotation file from stdin
!gunzip -c "$gene_annotation_file" | extract_protein_coding_genes.py -q -a - -o - | gzip > "$protein_coding_gene_file"
!gunzip -c "$protein_coding_gene_file" | head -n 10

A1BG	19	ENSG00000121410
A1CF	10	ENSG00000148584
A2M	12	ENSG00000175899
A2ML1	12	ENSG00000166535
A3GALT2	1	ENSG00000184389
A4GALT	22	ENSG00000128274
A4GNT	3	ENSG00000118017
AAAS	12	ENSG00000094914
AACS	12	ENSG00000081760
AADAC	3	ENSG00000114771

gzip: stdout: Broken pipe


In [5]:
# alternatively: reading the annotation file directly
!extract_protein_coding_genes.py -a "$gene_annotation_file" -o - | gzip > "$protein_coding_gene_file"

[2015-11-20 16:58:41] INFO: Regular expression used for filtering chromosome names: "(?:\d\d?|MT|X|Y)$"
[2015-11-20 16:58:41] INFO: Parsing data...
[2015-11-20 16:59:05] INFO: done (parsed 2720535 lines).
[2015-11-20 16:59:05] INFO: 
[2015-11-20 16:59:05] INFO: Gene chromosomes (25):
[2015-11-20 16:59:05] INFO: 	1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 21, 22, 3, 4, 5, 6, 7, 8, 9, MT, X, Y
[2015-11-20 16:59:05] INFO: 
[2015-11-20 16:59:05] INFO: Excluded chromosomes (225):
[2015-11-20 16:59:05] INFO: 	CHR_HG126_PATCH, CHR_HG1362_PATCH, CHR_HG142_HG150_NOVEL_TEST, CHR_HG151_NOVEL_TEST, CHR_HG1832_PATCH, CHR_HG2021_PATCH, CHR_HG2030_PATCH, CHR_HG2058_PATCH, CHR_HG2066_PATCH, CHR_HG2095_PATCH, CHR_HG2104_PATCH, CHR_HG2191_PATCH, CHR_HG2217_PATCH, CHR_HG2232_PATCH, CHR_HG2233_PATCH, CHR_HG2247_PATCH, CHR_HG2288_HG2289_PATCH, CHR_HG2291_PATCH, CHR_HG986_PATCH, CHR_HSCHR10_1_CTG1, CHR_HSCHR10_1_CTG2, CHR_HSCHR10_1_CTG4, CHR_HSCHR11_1_CTG1_2, CHR_HSCHR11_1_CTG5, CHR_HSCHR11_1_CTG6, C

### Running `extract_gene2entrez.py`

In [6]:
gene2accession_file = 'gene2accession_2015-05-26_human.tsv.gz'
entrez2gene_file = 'entrez2gene_human.tsv'

In [7]:
!curl -L -o "$gene2accession_file" \
        "https://www.dropbox.com/s/ggjrvnigtrfue3x/gene2accession_human_2015-05-26.tsv.gz?dl=1"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   506    0   506    0     0    610      0 --:--:-- --:--:-- --:--:--   622
100 15.6M  100 15.6M    0     0  3339k      0  0:00:04  0:00:04 --:--:-- 5525k


In [8]:
!extract_entrez2gene.py -f "$gene2accession_file" -o "$entrez2gene_file"
!head -n 10 "$entrez2gene_file"

[2015-11-20 16:59:19] INFO: Found 53831 Entrez Gene IDs associated with 53814 gene symbols.
[2015-11-20 16:59:19] INFO: Output written to file "entrez2gene_human.tsv".
1	A1BG
2	A2M
3	A2MP1
9	NAT1
10	NAT2
11	NATP
12	SERPINA3
13	AADAC
14	AAMP
15	AANAT
