Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


2012-11-05 outage root cause analysis

on 11/05/2012 prezi had a 4 hours long outage: the root cause was an overloaded component (gargoyle feature switch manager) in our system which. We didn't find the real cause for a long time because our monitoring system didn't cover network traffic.

Peter Halacsy

on 8 November 2012

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of 2012-11-05 outage root cause analysis

Timeline 17:25 1Gb interface saturated 17:35 public interface saturated 17:23 prezi_adminstats_check_memcachd alert API calls overloaded the system? stopped quota update service stopped index building service stopped prezi modification updated internal network issue? 18:10 calling our datacenter support line 18:30 19:15 S3 -> too much inbound traffic? redis server has issues? 21:00 20:15 data center bans our outgoing traffic: we are overloading their switches 20:30 circumventing the ban
restarting outbound traffic debugging found the cause of high traffic: CSS compressor 21:30 patching CSS compressor 23:30 anytime we restart the service mysql and redis go down in less than a minute. This slows down debugging reverting to older version 22:32 switching off gargoyle 21:39 site is up and running 21:36 trying to isolate the root cause turning back removed component one-by-one slight hiccup but reverting back quickly 23:05 <1 min outage, found the root cause removed in-development feature switches cleaning up 01:30am Root Cause Analysis Site is very slow if works gateway timeout exception Prezi is Down myql is very slow can't connect to mysql gargoyle's feature switch object is huge memcached can't server more request, 1Gbit/sec interface is saturated css compressor tries to load css files from amazon s3 on every request css compressor generates huge inbound traffic outbound public interface saturated gargoyle overloaded mysql gargoyle can't reach memcached and falls back to mysql biggest and constantly growing user traffic coming to prezi.com + internal loadbalancer interface saturated redis connections are slow dead end 22:00 going to bed 120 feature switches and 808 selective conditions are active why didn't we recognize? there was no alerting on network interface traffic Minor Cause Major Cause site goes down # of open mysql connection memcached server interface traffic beginning of the outage we try to restart the site site is running with patched gargoyle fast growing traffic network is saturated Action Items Implement our own feature switch managers Add network monitoring Separate components so they won't affect each other
Full transcript