Lessons Learned from Launching CS50x
Last Monday, CS50x launched on edX, an online education platform hosting classes from Harvard, MIT, UC Berkeley, and UTexas. The CS50 team has developed a suite of apps to facilitate discussions among students and staff, assignment submission, and automatic grading (among other things), which also went live with Monday’s course launch.
First, a timeline of our launch day. We definitely had no idea what to expect on the first day of the course, but here’s a quick rundown of my sunny Cambridge afternoon.
- 1:45pm: Wake up. Enjoy a wonderful breakfast of powdered donettes.
- 2:00pm: CS50x gets the green light to launch. Floodgates are open for all students, and a welcome email is sent to early registrants. I should probably put some pants on.
- 2:01pm: Put pants on.
- 2:02pm: 500+ users online. !@#$
- 3:00pm: 10,000+ users online. !@#$.
Needless to say, I spent the day frantically monitoring our infrastructure, which actually held up pretty nicely. When all was said and done, we had fun Day 1 facts like:
- 30,000+ student logins
- 6,000+ new discussion threads
- 8,000+ replies within threads
- 60,000+ responses to in-lecture practice questions
- 81% of questions answered correctly (woo!)
And now, here some things we tried to keep in mind in anticipation of our launch, and some things we learned along the way.
Reduce single points of failure. If any points in your application are mission critical, make sure there’s more than one of them. For example, if you’re making a search site and your search engine service goes down on launch day, you can kiss that seed funding goodbye. Same goes for your database and application code. Instead of just running everything on one server, distribute code as much as possible across a cluster of servers, so that if one server gets struck by lightning, it’s not the end of the world. Similarly, data stores should have hot-swappable, replicated copies. If Murphy’s law kicks in and a database service dies for seemingly no reason, you don’t want to be restoring from a backup you manually made a few hours ago; you want to be ready to make a seamless, transparent swap.
Have a strategy to roll back. Inevitably, that quick fix you applied at 4am is going to wreak havoc on your production machines when you least expect it. When that happens, you don’t want users to be waiting on you to remember the difference between a git revert and a git reset. While we use git for source control, we use RPMs to version our production deployments. When we want to ship code to our servers, we do a single push to a build server, which builds an RPM containing everything from the application code to system configuration files, and propagates it out across the entire cluster. To revert to a previous version, all we have to do is install an old RPM, and we know that our entire system configuration is exactly as it was whenever we rolled out the RPM originally.
Be ready to bring up new instances. All of our apps are hosted on Amazon Web Services, which makes spinning up new servers a breeze. In anticipation of the load spike, we provisioned a cluster of EC2 servers just for the online course. However, throwing (virtually, at least) more hardware at a problem isn’t a bad short-term solution if you’re in the middle of a massive traffic spike. Being aware of when the load on a cluster is too high and being able to quickly distribute that load over a greater number of servers can reduce performance degradation during a spike. With our RPM-based approach, configuring a new server is as easy as running a yum install, and that new server will be configured exactly like the other servers.
Cache the crap out of everything. Okay fine, not everything. But a lot of things. The list of new posts on the discussion board? Cached. The replies to one of those posts? Cached. The students you can grade? Cached. The courses you’re enrolled in? Cached. With our apps (and most apps, really), every page contains tons of data that changes infrequently yet is seen by a large number of users. If 1000 users look at the same list of new posts, there’s no reason your code should be asking the database the same question 1000 times. It’s like that stupid knock-knock joke with the bananas and the orange. Not funny. We use a combination of Memcached (which is wonderful) and Redis (which is also wonderful) for these kinds of things, but pretty much anything that grabs something from RAM in O(1) is going to be faster than going to disk with a database query.
Warm caches and load balancers. When your site goes live, you want to make sure that caching layers and load balancers aren’t running on empty. While many of Amazon’s web services auto-scale beautifully, you don’t want your first users to be the ones triggering the scaling, else your site will load slowly at the outset of a traffic spike. Make sure these kinds of things are sufficiently scaled when starting out, following the same principle of making sure to have enough servers ready to go, so hopefully you don’t need to spin up too many more instances (though you’re ready to, of course).
Assume queries can return a million rows. While developing software, it’s generally easy to test on a small data set. With 5 users on your test site, everything could perform nicely. It’s not until you have a large number of users that a single badly-written query can ravage your once-shiny servers. When writing a query or designing a system, consider at least in the back of your head that it could be run on a really large data set. That means making sure indexes are created in the right places, and queries are limited as much as possible. Even better than that would be testing on a large data set before deploying to the masses.
Definitely one of the busier Mondays in recent memory!