Loading…
Transcript

Credits

  • Patrick -- memegenerator.com and Nickelodeon
  • Openstack high level -- openstack.org
  • Openstack arch concept -- openstack.org
  • Openstack arch complicated -- solinea.com
  • Rackspace servers -- hostmethod.com
  • Xzibit -- The Internet (google images)
  • Release the Kraken -- The Internet (kickinthecornflakes.files.wordpress.com)

Jesse Keating

jesse.keating@rackspace.com

@iamjkeating

prezi.com/j2sol

Deploying the Rackspace Public Cloud

Continuous Integration

Upstream development happens

Internal merge branch and package is created daily

CI environment is (nearly) fully rebuilt and deployed

Automated testing validates the package

This is done daily -- or faster if we iterate on a sub-task

Deployments are easy, just ask Patrick!

Pre-production

What is the Rackspace public cloud?

CI-validated package deployed as an upgrade to preproduction

More automated tests and human driven tests to validate integration and upgrade process (migrations)

Package iteration for bug fixes re-deployed to preproduction and tests re-evaluate

OpenStack with some of our own software on top.

Package iterates in preproduction for two weeks (or longer) before considered production ready

Goal

Production Deploys

Outage windows used per-region to deploy tested package

post-deploy smoke tests validate continued operation and successful deployment

Hot-patch of code/config post-deploy to address criticial issues

This is almost what it looks like

Almost right

All production regions are deployed in a given week -- sometimes longer if issues are encountered.

This is OpenStack

Multiple deploys of new Public Cloud code a day -- with sub 10-second customer perceived impact

Deployment Process

Where does it all go?

Package Pre-stage

Package Creation

Public cloud has Regions -- a physical collection of systems that act as part of the public cloud

  • Fact file generated per-cell from region config repos
  • Fact file copied to instances and nova-computes
  • Package torrented to instances and nova-compute
  • Package extracted

Venvs built from git tags applied to merged repos

Bundled together with puppet manifests based on same git tag to form a package

Package uploaded to torrent server

All driven by jenkins and custom software

Can happen at any time leading up to outage window

Shell driven Ansible playbooks

Tens of thousands nodes to touch!

Release the Kraken!

  • Graphite poked
  • Symlink on filesystem updated to new package
  • Nagios disabled
  • Services axed
  • DB migrations ran
  • Puppet ran
  • Nagios alerts enabled
  • Graphite poked to end

Regions

Shell driven Ansible

  • IAD
  • DFW
  • ORD
  • LON
  • SYD
  • HKG
  • (preproduction)
  • (continuous integration)

60~70 cloud instances for control

Cells -- smaller blocks of servers

  • 1 ~ 15+ cells per region
  • 10~ cloud instances for control
  • 200~400 hypervisors
  • 1 nova-compute VM per HV

Future Improvements

Orchestrated two-step deployment

Orchestrated Service Outages

Online / Offline Migrations

Majority of work done during a migration can happen on a live DB. Only a small part should lock the db.

  • Deploy to capacity cells first
  • Queue connections at load balancer while deploying to region control
  • Configure control plane before computes
  • If skipping migrations
  • Leave compute up while deploying control plane
  • Deploy compute after
  • Deploy just compute or control plane

Reduces time when db is locked and services are out

Graceful shutdowns and cell first means minimal customer impact

Reduces time when API or compute resources are unavailable. Allows for more frequent deploys.

API Service accept but not process mode

  • Use Load Balancer to remove 1/2 API nodes from pool
  • Set other half to queue only mode
  • Do deployment and migrations to regional control
  • Do blocking migrations
  • Flip enabled/disabled set on Load Balancer
  • Release the queues on API nodes
  • Put all active on Load Balancer

Potential for zero perceived customer impact