Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.



gene ontology

Madelaine Gogol

on 26 May 2010

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of GO

Gene Ontology: origin and recent application Madelaine Gogol
Journal Club
April 22, 2010

Experimentally presented using prezi.com Berkeley Bioinformatics and Ontology Project (BBOP)

British Heart Foundation





Gene Ontology Annotation @ EBI


Mouse Genome Database (MGD) and Gene Expression Database (GXD)

Rat Genome Database (RGD)


Saccharomyces Genome Database (SGD)

The Arabidopsis Information Resource (TAIR)

Institute for Genome Sciences (IGS)

The J. Craig Venter Institute (JCVI)


Zebrafish Information Network (ZFIN) GO consortium members GOAL: a structured, precisely defined, common, controlled vocabulary for describing the roles of genes and gene products in any eukaryotic organism. For simplicity, not all known gene annotations have been included in the figures. a, Biological process ontology. This section illustrates a portion of the bio- logical process ontology describing DNA metabolism. Note that a node may have more than one parent; for example, ‘DNA ligation’ has three parents, ‘DNA- dependent DNA replication’, ‘DNA repair’ and ‘DNA recombination’. b, Molecular function ontology. The ontology is not intended to represent a reaction path- way, but instead reflects conceptual categories of gene-product function. A gene product can be associ- ated with more than one node within an ontology, as illustrated by the MCM proteins. These proteins have been shown to bind chromatin and to possess ATP- dependent DNA helicase activity, and are annotated to both nodes. c, Cellular component ontology. The ontologies are designed for a generic eukaryotic cell, and are flexible enough to represent the known differences between diverse organisms. 17 Why three ontologies?

"Function" has been used to describe
biochemical activities (molecular function)
biological goals (biological process)
cellular structure (cellular component) Who funds GO? You do, through your tax dollars. Additionally...

Direct support - R01 grant from NHGRI (National Human Genome Research Institute, NIH)
Incyte Genomics
European Union
UK Medical Research Council

Participating databases:
GXD - National Institute of Child Health and Human Development
FlyBase - NHGRI
TAIR - National Science Foundation
WormBase - NHGRI
DictyBase - NIGMS Evidence Codes

Experimental Evidence Codes
EXP: Inferred from Experiment
IDA: Inferred from Direct Assay
IPI: Inferred from Physical Interaction
IMP: Inferred from Mutant Phenotype
IGI: Inferred from Genetic Interaction
IEP: Inferred from Expression Pattern
Computational Analysis Evidence Codes
ISS: Inferred from Sequence or Structural Similarity
ISO: Inferred from Sequence Orthology
ISA: Inferred from Sequence Alignment
ISM: Inferred from Sequence Model
IGC: Inferred from Genomic Context
RCA: inferred from Reviewed Computational Analysis
Author Statement Evidence Codes
TAS: Traceable Author Statement
NAS: Non-traceable Author Statement
Curator Statement Evidence Codes
IC: Inferred by Curator
ND: No biological Data available
Automatically-assigned Evidence Codes
IEA: Inferred from Electronic Annotation
Obsolete Evidence Codes
NR: Not Recorded Long, highly expressed transcripts have more reads and hence more power to detect differential expression. If you do a typical GO analysis, categories with long or highly expressed genes will show up more often Some GO categories have long genes, and some have short. Three steps 1. Call DE genes however you want

2. Model DE as a function of transcript length

3. Incorporate DE vs. length function into statistical test of each category's significance "we used the exactTestNB() function with r=1e6 in the edgeR package to calculate a p-value based on a Poisson exact test. These p-values were then corrected for multiple hypothesis testing with a Benjamini-Hochberg adjustment" DE = log2 fold change in number of reads > 3 Although we use a Poisson exact test to determine DE, GOseq will work with any DE methodology. To illustrate this, we used three additional methods to determine differentially expressed genes in the prostate cancer data set. The first method calls all genes with a log2 fold change in number of reads greater than 3 as DE. The second method only counts reads lying within the exon of a gene. The number of reads is then divided by the total number of reads for each sample and the length of exonic sequence within each gene and multiplied by 109 giving RPKM transformed data (reads per kilobase of exon model per million mapped reads) [19] for each gene. As a test for DE is not specified by Mortazavi A, et al. 2008 we log2 transformation the RPKM data and used Limma [32] to fit a linear model followed by an empirical Bayes adjustment of the variances and defined a FDR cutoff of 10% to call genes as DE. The third method we considered used a negative binomial distribution to model the counts for each gene and determine if the different in counts between samples was statistically significant [16]. This is similar to the Poisson method we used for our main analysis, but with an additional parameter to account for over dispersion. RPKM + limma Negative binomial Other methods of identifying DE genes also show length bias, even when using RPKM Hey, there's a bias here. (or read count) Read Count Bias (paraphrased from supplemental)

DE analysis of RNA-seq data (using p-values) is likely to detect
many genes with high read counts and small fold-changes
genes with low read counts need larger fold-changes to achieve the same significance level.

Using read count bias in the probability weighting function would
Give more weight to lowly expressed, high fold-change transcripts
increase the likelihood of detecting transcription factors
give less weight to "housekeeping genes"

"Ultimately, the decision to test for either read count bias or length bias should be determined by what is relevant to the user’s research..."
A spline is a special function defined piecewise by polynomials. Monotonicity Why'd they use this method? As the functional form of this dependence is unknown, we chose to use a cubic spline.
We tested other functional forms for this fit, such as a generalized linear model, but they were found to perform more poorly at the extremes of the distribution. Basically:

Randomly sample sets of genes same size as DE genes, weight the chance of choosing a gene by its length or read count

count # genes associated with a GO term of interest

Repeat many times

Calculate a P-value for each GO category But, that takes too long, so they do something called the Wallenius approximation:

Extension of the hypergeometric distribution

"The mean of the probability weightings for each gene within a category is defined as the common probability of choosing a gene within that category."

Dramatic gain in computational efficiency.
Results Get rid of long-gene bias

Find short-gene enriched GO terms that didn't show up before

Find terms that better match the known biology

Take home messages

"It is mathematically indisputable that all commonly used criteria for judging DE interact with gene length and read count."

When analyzing RNA-seq data, we should take into account gene length and read count

http://bioinf.wehi.edu.au/software/goseq Data set

Effects of androgen stimulation on a human prostate cancer cell line
androgen is thought to be responsible for the promotion of prostate cancer progression
enhances growth, cellular activity
in normal prostate, androgen supports the secretary epithelial cells

10 million untreated, 7 million treated 35mers were mapped using bowtie, they discarded multi-reads

any gene overlapping part of a gene counts for that gene of limma fame Supplemental Gene ontology analysis for RNA-seq: accounting for selection bias Gene ontology: tool for the unification of biology* "GO is not a way to unify biological databases (i.e. GO is not a 'federated solution'). Sharing vocabulary is a step towards unification, but is not, in itself, sufficient. Reasons for this include the following:

Knowledge changes and updates lag behind.
Individual curators evaluate data differently. While we can agree to use the word 'kinase', we must also agree to support this by stating how and why we use 'kinase', and consistently apply it. Only in this way can we hope to compare gene products and determine whether they are related.
GO does not attempt to describe every aspect of biology; its scope is limited to the domains described above." Regular people can also contribute to GO via sourceforge is composed of
Full transcript