How this event can be prevented or mitigated in times of crisis, such as COVID-19
The week of March 27, 2020, brought a record 6.6 million jobless claims. The CARES Act not only enables traditional claims, but extends that coverage for workers who might not otherwise qualify for unemployment, such as contractors and “gig” workers. Surprising almost no one in IT and yet nearly everyone else, the state websites could not handle the additional load.
First the stock exchange crashed, then two weeks later, the unemployment sites did, too.
Let’s talk about why, and how to avoid this in your systems.
SEE: Coronavirus: Critical IT policies and tools every business needs (TechRepublic Premium)
It’s easy to criticize the states for not load testing. But that is a distraction. Take Colorado, which, according to NPR, expected about 400 people per day in early March. On March 17, that number was 6,800 by 10 a.m. Expecting that people realistically get up after 7 a.m. (especially the laid-off folks) and work until 8 p.m., that’s close to 30,000 people. That data was also from the week of March 20, which had a record-setting 3.3 million unemployed nationwide.
The following week it would be 6.6 million, almost exactly twice as many.
That leads to an estimate of more than 150 times the expected load. Do you load test for 150 times typical demand?
Think of Black Friday, or the Christmas rush. Cyber Monday, for example, reached $9.4 billion in the United States last year, while the daily internet sales average in the US is around $1.6 billion. That is a dollar volume of around six to one. When doing performance testing, I might test to five or 10 times expected average volume. Most of the time, if I get to 20 times, the system falls over. No one can express a requirement, but most leaders nod their heads sagely and agree that 20 times anticipated average volume is fine.
Until it isn’t.
These sorts of problems are what Nicolas Nassib Taleb calls “black swans” in his book: The Black Swan: The Impact of the Highly Improbable. According to Taleb, a black swan is an extremely rare event with severe consequences. These are hard to predict beforehand, but seem obvious in hindsight.
SEE: Top 100+ tips for telecommuters and managers (free PDF) (TechRepublic)
For example, it’s common wisdom that unemployment goes up linearly during a depression or a recession. Even in the Great Depression, a huge segment of the economy, such as restaurants, didn’t just all stop working at the same time. Users should see a slow degradation, that operations would have time to address and scale.
For the past 20 years, any investment in getting 150 times the typical load in an unemployment system would be waste. You can even hear the executives trying to address it, saying, “If unemployment is that high, we’ve got bigger problems than unemployment.”
Here are two ways to deal with it.
Have resilience plans
How a system operates when it falls over can be as important as the failure point. For example, it might be possible to replace the homepage with a single, static page saying that the system is down due to load, suggesting a time to check back. That will at least prevent frantic phone calls that overload the toll-free system, which go to operators who can’t get into the system, which is still locked, even for them. A Content Delivery Network could make sure that the system scales to 200x typical load for a fraction of the cost of building the capability in-house
Of course, when the static page is switched off and people expect the system to work, everyone will check in at the same time.
Colorado’s approach was to ask people to time their login by last name. While this might sound silly, it can certainly reduce the traffic, eliminating the “click refresh” syndrome, which actually boosts demand and hurts systems more.
The point today isn’t to recommend solutions, but thinking patterns. Lanette Creamer, a technical program manager at MediaAlpha, suggests that performance testing should include recovery and stability testing, or what to do after the crash. So think about what could happen, and have a plan to minimize the impact, even if it is awkward. Hopefully, you’ll never need it.
The next time a risk like this appears in your build, remind everyone in the room of the 2020 unemployment crisis. It is possible that after this lesson, management learns that new features like “unemployment for gig workers” is a possibility.