Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Regular Expression

Under the hood of regular expression, and efficiency.
by

Chris Zheng

on 24 July 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Regular Expression

Regular Expression
Engines
NFA vs. DFA
Tools:
Emacs, vi, Java, grep, awk, sed, Perl, PHP, Python, Ruby
Regex directed
vs.
Text directed
Example:

string="after tonight"
regex="to(nite|nighta|night)"
Tools:
awk, egrep, MySQL
Why not DFA?

capturing
ab(.+)fg abcdefg \1
lookaround
(?=Jeffrey)Jeff JeffreyJeff
non-greedy quantifiers
.? *? +?
ordered alternation
ab|cd|ef cd|ef|ab
about 24 characters long
re = 'ˆ.+([0-9][0-9])'

Copyright 2003
ˆ.+([0-9]+)
Match rules:
1. The Match That Begins Earliest Wins
2. The Standard Quantifiers Are Greedy
The dragging belly indicates your cat is too fat
re = 'cat'

The dragging belly indicates your cat is too fat.
re = 'fat|cat|belly|your'
tonight

to(ni(ght|te)|knight)
tonite|toknight|tonight
to(k?night|nite)
DFA matching is very fast.
DFA matching is very consistent.
Talking about DFA matching is very boring.
Backtracing:
First attempt greedy quantifiers, skip lazy quantifiers
Last In First Out
Pre-use compile:
DFA cost time and memory
Match speed:
DFA unrelated to the regex
What is matched:
DFA - longest leftmost match
NFA - might find something else
Chris Zheng
Coach Network
Wikipedia:

A nondeterministic finite state machine or nondeterministic finite automaton (NFA) is a finite state machine where for each pair of state and input symbol there may be several possible next states.

This distinguishes it from the deterministic finite automaton (DFA), where the next possible state is uniquely determined.

Although the DFA and NFA have distinct definitions, it may be shown in the formal theory that they are equivalent, in that, for any given NFA, one may construct an equivalent DFA, and vice-versa:
Alternation is expensive
u|v|w|x|y|z vs. [uvwxyz]

The name "McDonald’s" is said "makudonarudo" in Japanese
Start or end of string/line anchor optimization
^ $
Effienciency
Use non-capturing parentheses
(?: )
Expose:

th(?:is|at) vs. (?:this|that)
Look around:

This month is Jan|Feb...|Dec
(?=[JFMASOND])(?:Jan|Feb...|Dec)
Put the most likely alternative first

www.google.com www.wikipedia.org

(?:aero|biz|com|coop|...) to
(?:com|edu|org|net|...)
Techniques
IP Address
[0-9]*\.[0-9]*\.[0-9]*\.[0-9]*
ˆ^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$
'then ......'
'1234.5678.9101112.131415’
ˆ^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$
'900.1.1.1'
'0.0.0.0'
^(?!0+\.0+\.0+\.0+$)
([01]?\d\d?|2[0-4]\d|25[0-5])\.
([01]?\d\d?|2[0-4]\d|25[0-5])\.
([01]?\d\d?|2[0-4]\d|25[0-5])\.
([01]?\d\d?|2[0-4]\d|25[0-5])$
^T.*

The name "McDonald’s" is said "makudonarudo" in Japanese
10.10.10.10
10.10.10.10
^([01]?\d\d?|2[0-4]\d|25[0-5])\.
([01]?\d\d?|2[0-4]\d|25[0-5])\.
([01]?\d\d?|2[0-4]\d|25[0-5])\.
([01]?\d\d?|2[0-4]\d|25[0-5])$
abcdefgh

ab(.+)e(.+)h
Full transcript