Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
Do you really want to delete this prezi?
Neither you, nor the coeditors you shared it with will be able to recover it again.
Make your likes visible on Facebook?
You can change this under Settings & Account at any time.
2012-11-05 outage root cause analysis
Transcript of 2012-11-05 outage root cause analysis
restarting outbound traffic debugging found the cause of high traffic: CSS compressor 21:30 patching CSS compressor 23:30 anytime we restart the service mysql and redis go down in less than a minute. This slows down debugging reverting to older version 22:32 switching off gargoyle 21:39 site is up and running 21:36 trying to isolate the root cause turning back removed component one-by-one slight hiccup but reverting back quickly 23:05 <1 min outage, found the root cause removed in-development feature switches cleaning up 01:30am Root Cause Analysis Site is very slow if works gateway timeout exception Prezi is Down myql is very slow can't connect to mysql gargoyle's feature switch object is huge memcached can't server more request, 1Gbit/sec interface is saturated css compressor tries to load css files from amazon s3 on every request css compressor generates huge inbound traffic outbound public interface saturated gargoyle overloaded mysql gargoyle can't reach memcached and falls back to mysql biggest and constantly growing user traffic coming to prezi.com + internal loadbalancer interface saturated redis connections are slow dead end 22:00 going to bed 120 feature switches and 808 selective conditions are active why didn't we recognize? there was no alerting on network interface traffic Minor Cause Major Cause site goes down # of open mysql connection memcached server interface traffic beginning of the outage we try to restart the site site is running with patched gargoyle fast growing traffic network is saturated Action Items Implement our own feature switch managers Add network monitoring Separate components so they won't affect each other