Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
Jedberg - Scaling Reddit
Transcript of Jedberg - Scaling Reddit
1 > 2 > 3
S3 for logos
S3 for thumbnails
EC2 for batch processing
EC2 for entire site
Needed an easy way to distribute and upload our logo.
Didn’t want to rent another cabinet
Didn’t want to buy more servers
Reddit Gold is Launched
Why am I here?
WHAT LED REDDIT TO AWS
Is it necessary to build a scalable architecture from the beginning?
Why should we learn from
If it won’t scale, it'll fail.
The key to scaling is
finding the bottlenecks
before your users do.
Way back in 2005...
They were called back
Two UVA students applied for this thing called YCombinator
They were rejected
Imaging and Racking Servers Is a
Reddit moved from self hosting to EC2
EC2 for Overflow
Used openvpn to create a secure link to our datacenter for batch processing
Started by migrating all data
Got a complete stack running on EC2
Long Friday night finishing the migration and “forklifting” the last bits
Based on Amazon public pricing, reddit open source code, and public configuration information
Motivators for moving to EC2
Outgrew data center
EC2 makes things easier, but isn’t a magic bullet.
The higher network latency and noisy neighbors will be problematic -- expect to work around it.
Scaling on EC2 is a lot like anywhere else, but you need to be more disciplined.
Webserver or Proxy?
What about event driven and non-blocking web servers?
Good for long connections
More complicated to start, but scales better
To prevent someone from consuming too much, all resources have per account limits. Keep track of them and get them raised ahead of when you need them. Make sure to catch the exceptions too.
Keep track of those limits!
Relying on a single cloud product and expecting it to work as advertised
Bleeding edge in production
Cassandra wasn’t always perfect
No data loss, but it was a pain sometimes
Automate all the things!
Mistakes we've made
What is Reddit?
Reddit is an online community
June 23rd, 2005
Advantages to a Service Oriented Architecture
Easier capacity planning
Identify problematic code-paths more easily
Narrow in the effects of a change
More efficient local caching
Disadvantages to a Service Oriented Architecture
Postgres is still a good database
What else do you need to worry about?
1 > 2 > 3
Not having enough monitoring and using a system that isn’t “virtualization friendly”.
Need multiple dev teams, or need people to work on multiple services.
Need to come up with a common platform, otherwise work will be duplicated.
Too much overhead for a small team just starting out.
Don’t follow fads
Put a limit on everything.
Make it really really high.
Lower it or raise it as needed
We used Ganglia
Backed by RRD
Makes good rollup graphs
Gives a great way to visually detect errors
Wasn’t friendly to rapidly changing infrastructure.
Going from two to three is hard
Going from one to two is harder
If possible, plan for 3 or more from the beginning.
Data is the most important
asset your business will have.
Data Gravity and you
The bigger your dataset, the harder it is to move from anywhere to anywhere
Also, how do you move that data without affecting your running application?
Sql or “nosql”?
Mysql, Postgres or something else?
Unless you are really really sure of your business model...
The less schema the better
reddit’s database is literally just keys and values
Expire your data
It’s a lot easier to manage if your data is either gone or in static form
Users will almost never notice
Think of SSDs as cheap RAM, not expensive disk
Database Scaling with Sharding
You must construct additional Pylons
Picking a framework
What is Pylons?
c and g
pylons scaling == python scaling
run lots of appservers and make them independent of each other
We built our own caching
We built our own database layer
Would I use Pylons again?
Yes (although it’s called Pyramid now)
Event or thread based?
C is faster than Python (sorry)
Open Source is Good
Or, why you should never have your entire team on one airplane.
Provide an API
The business side of things
Running a site that requires user input?
Be one of the most active users
People like to see the founders participate
Moderation, cheating, spam and fraud
If you take user input, and get popular, people will cheat and spam.
If you take money, they will scam people.
Limits will help a lot, as will pattern detection.
Hard coded rules only go so far -- you need learning algorithms.
Let your users do the work for you.
What made reddit successful?
How does reddit make money?
Ask Me Anything
Not only did I run technology for reddit but I also was deeply involved in the business.
Ask me anything about running a profitable social media company.
Getting in touch
We split our writes across four master databases
Links/Accounts/Subreddits, Comments, Votes and Misc
Each has at least one slave
We avoid reading from the master if possible
Wrote our own database access layer, called the “thing” layer
How it works
Quorum reads / writes
Bloom Filter for fast negative lookups
Immutable files for fast writes
Fast negative lookups
Easy incremental scalability
Distributed -- No SPoF
Second class users
Logged out users always get cached content.
Akamai bears the brunt of reddit’s traffic
Logged out users are about 80% of the traffic
Queues are your friend
Spam processing and corrections
Sometimes users notice your data inconstancy
Not using a consistent key hashing algorithm at first.
We moved to using a consistent key hashing for memcache
We moved to Cassandra, which follows the Dynamo model, which uses a type of consistent hashing
The environment in a public cloud is inherently more variant (co-tenants, abusive or heavy users, etc)
Make sure your code is written to handle this -- state should be kept somewhere shared and redundant, not on the instance.
Keep data in multiple Availability Zones
Avoid keeping state on a single instance
Take frequent snapshots of EBS disks
No secret keys on the instance
Different functions in different Security Groups
Not accounting for increased latency in a virtual environment.
Advantages and disadvantages to each
Thread based you can size ahead of time
Event based can handle a lot more connections, but when you reach the scaling wall, you slam against it
reddit's Data Gravity problem
We had a lot of data that was ever-growing
We were so resource constrained we couldn’t move it without hurting our application
Using md5’d keys made it difficult to rebalance.
It didn’t really have a way to rebalance
Turns out it was pretty slow under high workloads