The Internet belongs to everyone. Let’s keep it that way.

Protect Net Neutrality
Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

You can fix everything

InfoShare 2014
by

Gábor Vészi

on 26 August 2015

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of You can fix everything

YOU CAN FIX
EVERYTHING

Gábor Vészi
@veszig
2010
now
DURING THE OUTAGE
POST
MORTEM

10 STAND UP
20 UNDERSTAND
30 FOCUS
40 GOTO 10
my first month
October 6
SHOW & TELL
October 1
I'VE JOINED
ON AVERAGE MONTHLY ~80 MINUTES
REAL TITLE:
OUTAGES & MINDSET

What is an outage?
PRIO1: WHEN 3 PEOPLE CAN'T PRESENT
Why is it hard to fix outages?
PEOPLE FORCED BEYOND LEARNED ROLES
TOO MUCH INFORMATION & IRRELEVANT NOISE
HEROISM
FIGHT THE PRIO1
SOLVE ALL YOUR PROBLEMS IN 4 EASY STEPS
DOUBT
TRUST
EASY!
FIGHT THE BYSTANDER EFFECT
http://www.businessinsider.co.id/communication-charts-around-the-world-2014-3/
Business Insider: 25 Fascinating Charts Of Negotiation Styles Around The World
(from Richard D. Lewis: When Cultures Collide)
MEASURE AND DISCUSS
SOLVE ONE SMALL ISSUE AT A TIME AND BE PERSISTENT
UNTIL THERE ARE PROBLEMS (NOT JUST IT)
11 OUTAGES IN 18 DAYS
October 13
SHOW & TELL
MINDSET
APRIL 30: OUR BIGGEST OUTAGE IN 1.5 YEARS
https://code.google.com/p/memcached/issues/detail?id=141
sentry
WHAT IS OUR GOAL AFTER A PRIO1?
POST MORTEM DOCUMENT & MEETING
GOAL: LEARN AND IMPROVE
NO BLAMING!
SWISS CHEESE MODEL
CONCRETE ACTION ITEMS
FIND PATTERNS & UNDERSTAND PRIORITIES
PRIO1 LIST
37 minutes in 6 months
(this involves a 24 minute auth outage)
WHAT IS THE BIGGEST LEARNING?
OWNERSHIP
MICROSERVICE ARCHITECTURE
DEVS ON CALL
SAME PROCESS WORKS FOR OTHER STUFF AS WELL
http://heroicimagination.org/
}
MILITARY MODE
START
all hands on deck
pick “commanding officer”
announce
pick communications officer
COMMUNICATE WITH THE COMPANY
5 minutes into the outage
memcached bug
https://bugs.launchpad.net/percona-server/+bug/893348
mysql bug
turn off features
revert everything
restart everything
pinpoint exactly what is slow
let's rewrite the Django database layer in production
isolate important parts of the site
issue solved...
4 days later...
OBSERVE
ORIENT
DECIDE
ACT
&
FIX YOUR OUTAGES THEN THE WORLD!
Full transcript