Software Takes Time

Dave Schuler October 7, 2013

There’s an interesting article at the WP’s Wonkblog, presenting one informed technician’s thoughts on the problems being encountered by users trying to access the PPACA’s portal, healthcare.gov. He makes a number of interesting observations but the two I found most interesting were this:

SK: The Obama administration has said that all these problems are happening because of overwhelming traffic. How good of an explanation is that?

JB: That seems like not a very good excuse to me. In sites like these thereâ€™s a very standard approach to capacity planning. You start with some basic math. Like, in this case, you look at all the federal states and how many uninsured people they have. Out of those you think, maybe 10 percent would log in in the first day. But you model for the worst case, and thatâ€™s how you come up with your peak of how many people could try to do the same thing at the same time.

and this:

SK: What would you be doing right now if you were running healthcare.gov?

JB: First I would put some really good instrumentation in place. The problem is if youâ€™re fighting a fire, and it’s dark, you donâ€™t know whatâ€™s going on. In other words, you canâ€™t manage what you canâ€™t measure. So first I would put something in place so you can measure whatâ€™s happening.

The second thing Iâ€™d do is Iâ€™d start building a very good load testing environment, so everything could be simulated in a load test, and move faster. Really everything is about speed right now, how quickly can you find problems and fix them. Ninety percent of the effort is really finding what to fix. Making the coding changes is only about 10 percent.

Neither of those two observations is particularly profound. Just ordinary good practice.

They’d better hope that the problems can be solved by throwing additional hardware at the site. That’s an easy solution and inexpensive to implement both in elapsed time and labor. Changing the architecture at this point could be disastrous. It’s probably out of the question.

My experience is that software developers are strongly predisposed to continue doing what they’re accustomed to doing. Getting someone from the outside to audit the code is probably a good idea but it will take time.

11 comments… add one

steve Link

It took a long time to hash out the problems with Medicare Part D.

“Based on the Medicare Part D experience, we can experience some decline in interest in the health insurance marketplaces after this first week. But there should be steady volume of website use, phone calls, and visits with counselors throughout the fall. Medicare Part D then experienced another surge of interest as the December enrollment deadline for coverage to begin on the first of the year.

Glitches continued with the Part D website and call center throughout the open enrollment period. But the program added both phone lines and customer service representatives and implemented other upgrades over the weeks. The website â€“ both its functionality and the accuracy of its information â€“ was the source of ongoing frustration for its users, but it did get better over time.

By the end of open enrollment in May 2006, over 16 million successfully enrolled for drug benefits in Part D (not counting another 6 million automatically enrolled as a result of participation in both Medicare and Medicaid). Initial glitches did not deter their enrollment. And today, Part D enjoys widespread popularity.”

http://ccf.georgetown.edu/all/how-does-acas-first-week-compare-to-medicare-part-ds/
PD Shaw Link

Reuters had an article, which I cannot find on their website, which quoted an expert on the problems”

“One possible cause of the problems is that hitting “apply” on HealthCare.gov causes 92 separate files, plug-ins and other mammoth swarms of data to stream between the user’s computer and the servers powering the government website, said Matthew Hancock, an independent expert in website design. He was able to track the files being requested through a feature in the Firefox browser.

…

“They set up the website in such a way that too many requests to the server arrived at the same time,” Hancock said.

He said because so much traffic was going back and forth between the users’ computers and the server hosting the government website, it was as if the system was attacking itself.

Hancock described the situation as similar to what happens when hackers conduct a distributed denial of service, or DDOS, attack on a website: they get large numbers of computers to simultaneously request information from the server that runs a website, overwhelming it and causing it to crash or otherwise stumble. “The site basically DDOS’d itself,” he said.”
Dave Schuler Link

Yes, I read that article, PD. What I believe he is describing is how, in order to make the web site look and feel the way people expect web sites to do nowadays, the developers used a method that requires a very high degree of interactivity between the host and client. It does not scale well. One of the things I meant by “strongly predisposed to continue…”.
jan Link

Changing the architecture at this point could be disastrous. Itâ€™s probably out of the question.

Many IT’s, though, are saying that the architecture of the software is, the problem — having little if anything to do with traffic volume. They also are saying adding more servers, like the government said they were going to do as their ‘fix’, will not help these intrinsic flaws. Mixed in with the confusion of this roll-out are how traffic numbers are being disputed daily, depending on who is spewing them. Dem officials coyly cite something like seven million signing onto the government site, as of yesterday. Today, though, analysts are saying a five million figure is even too high, ratcheting that down to a scant 500-600 thousand perusing the HC site, with only around 7 thousand actually applying for the insurance. Much like PD’s post stated, there appears to be a growing consensus of computer experts who trace the glitches to the program having too many files/paperwork to download, basically clogging up the system. If that’s the case, it seems a major overhaul might be required….

This kind of reminds me of Romney’s much touted GOTV software debut, Project ORCA, that ended up crashing, causing confusion for volunteers on the ground for that last important push — glitches creating so much havoc as to probably have figured significantly in his loss.
Dave Schuler Link

The reason I think it’s out of the question is in my title: software takes time. Deferring the operation of the federally-operated exchanges for technical reasons is now seen as capitulation.
PD Shaw Link

Other than a college class in Pascal (which felt like a dumbed down logic class), all of my programming experience is with BASIC on a Commodore 64 almost 30 yrs ago. The description of a website locking up from trying to do too much at once seemed very familiar though.

I just took a glance at the Illinois website, which I guess is just a fancy (and well-done) vehicle for directing you either to a national health insurance marketplace website or to Medicaid. I guess I’m surprised by that; I thought the exchanges would still have a state-specific structure. But I guess it explains why everybody is complaining about the national portal.
Dave Schuler Link

I’m hesitant to mention this but the company that was the prime contractor on the site is Canadian and the contractor for site support is Irish. Globalization!

I’d think they’d be more sensitive to the optics. Maybe no one will notice.
PD Shaw Link

Well, Ireland almost has the population of Boston, and Ontario almost has the population of Los Angeles, so it should work.
PD Shaw Link

At least it wasn’t left for Illinois’ state government to work out a site, pressing the “apply” button might have spit out the social security numbers of hundreds of applicants and a list of the governor’s favorite Elvis songs. I still am surprised though by this being handled nationally, where the scalibility issues are at least a hundred fold what they would be at the state level. Canada’s provinces operate their public health insurance program(s), not the national government.
Dave Schuler Link

Off-hand I’d guess that the problems with a national roll-out are between two and three orders of magnitude greater than handling the problem at the state level would be. And handling things on a state-by-state basis means more than one approach is tried at at time and it’s possible to learn from the mistakes of others.

But then there’s always Illinois…
jan Link

There are some interesting assessments regarding IT problems, during this week of the PPACA rollout here, and here.

Also, there’s this opinion piece entitled: Why Government IT Projects Are So Prone To Failure Or Overruns.

Fred Allen of the Day

Roots of the Shutdown