Poor Knight Capital. A software release in the summer of 2012 went badly, and the result was a nearly immediate $460 million dollar loss. Liquidity became an issue and they were acquired shortly thereafter.
For the next year, it seems like every vendor with a tool that helps with software deployment, testing, configuration management, or monitoring has used them as an example. The problem was other than knowing that a new software release resulted in massive numbers of trades and a lot of lost money, we didn’t really know what went wrong. I was certainly guilty of these. After all a release went spectacularly, and I work on deploy and release software. As we’ll see, I can breath a sigh of relief. Better deployment and release tools would have helped, but there’s much more to the story.
Now, thanks to this report (http://www.sec.gov/litigation/admin/2013/34-70694.pdf) from the Securities and Exchange Commission we do know what went wrong. The SEC also announced that it fined Knight another $12 million for poor controls to bring the total bill to $472 million. Knight has suffered enough. I don’t want to pick on them. I do want to look at this report and see what we can learn from it and apply to our own businesses and releases operations.
We must acknowledge that the failure was of a complex system that had some safeguards in place to prevent failure. The Knight team look to have done some good things. As Richard Cook describes in his essay and presentation “How Complex Systems Fail” these types of failures require multiple smaller failures.* So here’s what went wrong.
I see dead code
In section B.13 of the SEC report, we find the following:
13. Upon deployment, the new RLP code in SMARS was intended to replace unused
code in the relevant portion of the order router. This unused code previously had been used for
functionality called “Power Peg,” which Knight had discontinued using many years earlier.
Despite the lack of use, the Power Peg functionality remained present and callable at the time of
the RLP deployment. The new RLP code also repurposed a flag that was formerly used to
activate the Power Peg code. Knight intended to delete the Power Peg code so that when this flag
was set to “yes,” the new RLP functionality—rather than Power Peg—would be engaged.
In section 14, the report clarifies that “many years earlier” means in 2003. So functionality that was no longer needed sat around for nearly a decade. The “Power Peg” became a powder keg ready to blow and was primed in 2005 when controls over how many orders to process were refactored elsewhere.
The SEC complains about a lack of written policies for testing unused code that remains callable. That strikes me as terrible guidance. Continue reading