Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Architecting Availability on AWS

Slides for the AWS NYC meetup, 2013 Jan 31
by

Tim Gross

on 16 June 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Architecting Availability on AWS

Architecting
Availability

EC2 Instance
EBS volume
Apache
Django / Python
EC2 Instance
EBS volume
PostgreSQL
EC2 Instances w/ EBS:
Nginx
Django/Python
memcached
ELB
RDS master
(Multi-AZ)
RDS slave
Multiple
AZ deployment
EC2 Instances w/ EBS:
Nginx
Django/Python
memcached
ELB
RDS master
(Multi-AZ)
RDS slave
Manager
node
growth
S3
Image servers:
EC2 Instances
w/ EBS
Nginx
(from CDN)
(from web)
(from web)
growth
EC2 instance: mysqldump + snapshot
Minimal bootstrap environment for multi-AZ disaster recovery
<2 hr SLA target
S3 storage is eventually consistent across regions
#1. get multi-AZ and multi-region
EC2 Instances w/ EBS:
Nginx
Django/Python
ELB
RDS master
(Multi-AZ)
RDS slave
Multiple
AZ deployment
S3
Image servers:
EC2 Instances
w/ EBS
Nginx
(from CDN)
(from web)
write-thru
Elasticache
cluster (m1.medium nodes)
High I/O = 100mbs
Moderate I/O = 30mbs
#2: Measure all the things!
Cloudwatch is not enough
(plus it sucks)
Web servers:
EC2 Instances
Nginx + Django
ELB
RDS master
(Multi-AZ)
RDS slave
Multiple
AZ deployment
S3
Image servers:
EC2 Instances
Nginx
(from CDN)
write-thru
Elasticache
cluster (m1.large nodes)
ELB
ELB
Tim Gross
Dev/Ops
Architecting Availability on AWS | 2013 Jan 31
Millions of Users
Working here is awesome
We're hiring!
DevOps engineer
IT/Ops engineer
(and always open to talking to talented software engineers)
Multiple AZs
Elasticache
"Assembling"
availability
Lessons learned
and stories
Constraints:
High uptime, low latency (duh)
Cyclical traffic load
Write-heavy load (analytics)
Build vs Buy Threshold:
2-week feature sprints
Not enough time and staff
Disclaimer:
Works for us.
Start-up: everything done yesterday
Perfect is the enemy of good.
Numbers in mirror may be larger then they appear
In the beginning...
Intro
EBS
HAProxy: CNAME to multiple Elastic IP
Static content
(.js, .css, .png, etc.)
logrotate
# 3. No EBS:
Forces you to treat instances as disposable

Implications:
Least-connections ELB algorithm applies per zone, then per instance. Need equal # of instances behind each AZ
EBS volumes tied to the AZ
Don't worry about DB latency between zones; negligible
When EBS goes down, it goes down *hard*
We've had:
8hr+ recovery times
Individual volumes that *never* came back from I/O failure
RDS, ELB, and Elasticache all run on EBS volumes, but...
RDS has its own (fairly good) recovery system and multiple-AZ fail-over.
ELBs can be replaced quickly. It'd be even better if we could assign ElasticIPs to them. =(
Elasticache nodes are moderately disposable so long as you're using consistent caching.
But what about...?
SQS
async tasks
Workers:
EC2 Instances
Django + Celery
# 4. Design for graceful failure for SQS and DynamoDB

SQS and DynamoDB APIs are just HTTP: "If it locks, it blocks!"

The SQS paradox:
SQS is for asynchronous tasks.
Calls are potentially blocking.

Need non-blocking libraries for API calls or perform API calls off the main thread.
SQS
DynamoDB
Analytics data store
(greenthreads, batch process)
EC2 c1.medium instance:
Eventlet + boto
SQS
DynamoDB
Analytics data store
(blocking, non-batch process)
(20) EC2 m1.large instances:
Celery
Original implementation:
this sucks, don't do this
"Oh, come on, when has ELB ever--?"
ELB advantages:
- Good uptime
- Good throughput
- Low maintenance

ELB disadvantages:
- No application awareness
- When it's down, you're *down*

Crossing the build-vs-buy threshold
#5. When ELB goes down, now what?
Full transcript