15%

5%

30%

15%

5%

30%

20%

50%

50%

50%

5%

25%

20%

50%

15%

30%

5%

day

dey

they

days

to

too

her

there

therm

ther

ter

wet

whe

wea

way

the

they

there

Let's ask about

**?**

The Human Cell

A mathematical insight into DNA Sequencing

Anooj Dodhia • 27 Feb 2013

The Human Cell

Anooj Dodhia • 27 Feb 2013

The Human Cell

Anooj Dodhia • 27 Feb 2013

"the weather today"

the | wea | ther | to | day

**The Human Cell**

**Anooj Dodhia • 27 Feb 2013**

Inference: If there were no Windows?

Rainy

Cold

Sunny

70%

25%

5%

45%

45%

10%

An introduction to DNA

vs

ATCG

AACC

GCAT

ATGC

AACG

Added information

Probability of a mutation between

C & G is much higher than any other

Uniform

prob.

Added

info.

Reality

But what if...

ATCGAATCGGTCTGAAGTCGATCGATTTGAC

TTAGAATCAGTATGAAGACGATCAATATGAG

TTGGTAACCGACAGTACTGGTTGGTTATCAG

WE NEED

A more

precise

PROBABILITY

framework

Markov Chains

"A memoryless random variable"

State Space

S = {A,T,C,G}

S = {codons}

Random Variable (the chain)

X(t)= {X(1), X(2), X(3), ... }

Transition Probabilities

Prob. of future depends only on the present, and not on the past:

Denote P[X(t+1) = b | X(t) = a] := P(a,b)

P[X(t+1) | X(1), X(2), ..., X(t)] = P[X(t+1) | X(t)]

Hidden Markov Model

Observed Markov

Chain

Underlying Markov

Chain

Probabilistic

Calculation

Sound-bites sent to Google's servers

Genetic code of interest

Google's interpretation of our voice

The sequence best matching our "observed" input

Umbrellas, coats & hats

The weather outside

But what can we do with this model?

The Forward-Backward Algorithm

The Viterbi Algorithm

25%

0%

75%

Big

Ideas

Probability

Inference

Markov Chain

Hidden Markov Models

Deoxyribonucleic acid

3,000,000,000 base pairs - 8 x dist(earth, sun)

e.g. heart cell structure or

digestive enzymes

We are interested in

a sequence of speech

which we split into

states

, or,

syllables (~3 letters)

And use

probability

to match

to the

closest sequence

in our database, made up of

{a-z, A-Z, 0-9, symbols}

We are interested in

a sequence of observed DNA

which we split into

states

, or,

codons (3 letters)

And use

probability

to match

to the

closest sequence

in our database, made up of

{A, T, C, G, '-'}

**Any Questions?**