Poor Knight Capital. A software release in the summer of 2012 went badly, and the result was a nearly immediate $460 million dollar loss. Liquidity became an issue and they were acquired shortly thereafter.
For the next year, it seems like every vendor with a tool that helps with software deployment, testing, configuration management, or monitoring has used them as an example. The problem was other than knowing that a new software release resulted in massive numbers of trades and a lot of lost money, we didn’t really know what went wrong. I was certainly guilty of these. After all a release went spectacularly, and I work on deploy and release software. As we’ll see, I can breath a sigh of relief. Better deployment and release tools would have helped, but there’s much more to the story.
Now, thanks to this report (http://www.sec.gov/litigation/admin/2013/34-70694.pdf) from the Securities and Exchange Commission we do know what went wrong. The SEC also announced that it fined Knight another $12 million for poor controls to bring the total bill to $472 million. Knight has suffered enough. I don’t want to pick on them. I do want to look at this report and see what we can learn from it and apply to our own businesses and releases operations.
We must acknowledge that the failure was of a complex system that had some safeguards in place to prevent failure. The Knight team look to have done some good things. As Richard Cook describes in his essay and presentation “How Complex Systems Fail” these types of failures require multiple smaller failures.* So here’s what went wrong.
I see dead code
In section B.13 of the SEC report, we find the following:
13. Upon deployment, the new RLP code in SMARS was intended to replace unused code in the relevant portion of the order router. This unused code previously had been used for functionality called “Power Peg,” which Knight had discontinued using many years earlier. Despite the lack of use, the Power Peg functionality remained present and callable at the time of the RLP deployment. The new RLP code also repurposed a flag that was formerly used to activate the Power Peg code. Knight intended to delete the Power Peg code so that when this flag was set to “yes,” the new RLP functionality—rather than Power Peg—would be engaged.
In section 14, the report clarifies that “many years earlier” means in 2003. So functionality that was no longer needed sat around for nearly a decade. The “Power Peg” became a powder keg ready to blow and was primed in 2005 when controls over how many orders to process were refactored elsewhere.
The SEC complains about a lack of written policies for testing unused code that remains callable. That strikes me as terrible guidance. Continue reading
To learn more, you should watch the recording of the webinar: Mobile to Mainframe: The Challenges and Best Practices of Enterprise DevOps.
The wrong answer to: “How does UrbanCode do DevOps internally?”
Last week during an interview on DevOps, I was asked how UrbanCode has “done DevOps”. I spoke a bit about how we streamlined the released of our website. However, to some degree I felt that we don’t have significant enough “Ops” to really do DevOps. After all, the vast majority of the software we create is installed by our customers and runs in our customers’ environments.
Upon further reflection, it hit me. UrbanCode faces a massive Dev / Ops silo problem. “Ops” is hundreds of system administration or tool owner teams working for different companies. The operations teams aren’t on staff at UrbanCode, they work for our customers and are responsible for delivering the product, among others, to the end-users in a highly reliable way.
Time consuming or risky upgrades decrease the frequency that the operations teams will be willing to upgrade the software. Therefore, if our goal is to continuously delivery new capabilities and fixes to our customers, we must make it fast and safe to upgrade. Our rate of delivering new versions has zero customer impact if the customers’ system administrators do not install the updates.
How we really do DevOps at UrbanCode
We work to fulfill the “Dev” side of the partnership as best we can; actively engaging Ops teams, and building a development infrastructure that maximizes productivity. Given that we have been building continuous integration and delivery tools for the past twelve years, it should be no surprise that our development infrastructure is pretty solid.
The hard part is servicing our distributed operations teams to make upgrading, maintaining, and trouble shooting easier. Breaking down the barriers between development and operations is also tricky. Here’s some of the practices and technology we have learned to adopt.
Automatic Database Upgrades
Updating schema is never fun to do by hand. Early in the development of AnthillPro 3.x we put considerable effort into building a toolkit that would perform all the database updates moving from any version to any later version. We’ve used it in every product since, and offer a uDeploy plugin for customers who use it for their own apps. This also helps in our test labs as we make small upgrades to prior versions.
Centralized Agent Upgrades
Many of our products use a server-agent architecture. A central server (or cluster) decides what should be done, and agents do the work. For every central server that needs to be upgraded, there may be hundreds or thousands of agents. No matter how easy your up-grader is to run, doing any chore a few thousand times stinks. Fortunately, we make automation tools and taught them to upgrade themselves on command. The central server will distribute new versions of the agent code to the agents who upgrade themselves.
Avoiding agent upgrades
Even centralized automatic upgrades proved to be a barrier. With a 99.5% success rate, any upgrade of thousands of agents would leave tens of servers offline. This could block releases, and generally be a pain to diagnose and repair. So we updated protocols away from serialization and have managed to make most of our application upgrades not required agent upgrades.
Some customers still have lengthy validation programs for new versions. That prevented them from taking minor updates with fixes to bugs that bothered them. End users would be frustrated and file tickets around bugs that we’d fixed months earlier and their system administrators declined to fix via upgrade. Clearly we had angry customers and higher support costs as a result of insufficiently making our “Ops” team nimble enough. To drive down the cost of taking fixes, we added a special patch capability which could be installed (and uninstalled) in minutes (in some cases without any downtime).
We found that the most common features and updates end users wanted were related to integrations. We introduced plugins in 2009 to decouple updating an integration from a full upgrade. Plugins each represent a single integration are versioned, uploaded in the tools through the web interface, and transparently distributed to agents. Power users can upgrade integrations on a regular basis without any involvement from the sys-admins. Plugins run on any version of the application supporting their schema level – most plugins run on any vaguely recent version of the products. As a bonus, the standardization plugins required considerably reduced our development effort per unit value.
From a certain perspective, plugins are architecturally similar to micro-services. Each is a discrete collection of functionality that may be pulled into a larger system which has a mostly unrelated release cadence.
Developers working support
We have always had a close working relationship between development and support. From time to time, we have even rotated developers into support roles for a week or two at time. This is a common DevOps strategy for a reason. There’s nothing like tracing a customer problem back to your own code to nudge you towards quality.
We also encouraged the sys-admins who planned upgrades over weekend and evening hours to let us know. We made sure to have people familiar with the upgrade standing by in support. Any risks to successful upgrade would be partially born by the developers responsible for creating it.
Over the years, we’ve added more and more diagnostic information into the tools themselves. Some of it is highly technical such as thread timers and memory utilization. Others are within the problem domain – uBuild, for instance, will report on the relationships between builds to help diagnose why a build happened (or was skipped). These tools support the sys-admins in inspecting the software and provide the support team key information.
Encourage customers to have test environments
Larger customers often put in place test environments for our products where they can run upgrades using a copy of their data and play with the new features before using them in production. Running the same process against the same data prior to the official cut-over helped boost confidence and facilitate more updates.
Key Lessons Learned
- Yes, there is an “Ops” role for your product. Understand who that is. Consider, for example, what problems an app store solves
- Developers must cater to anyone between them and the end user
- Any part of an upgrade that could be automated should be
- Decouple. It’s not enough to be multi-tier, you need to decouple upgrades and changes as much as possible to make upgrades cheap and low risk.
- If you build it, you help run it. Expose developers to support and regular communication with operations teams. More diagnostics built into the tool is a natural result.
My iPhone, now almost two years old, is better than the day I bought it. It’s banged up, and there’s a nasty scratch on the screen, but it has constantly improved. New apps are available every day, and the base capabilities are improved with regular OS upgrades.
Meanwhile, when I was car shopping last month, I encountered cars with Pandora support and those without. Even when all the physical requirements (ports or Bluetooth) were met, I was confident that most cars I looked at which lacked an integration with the audio streaming service would always lack that capability. Next years’ model might, but whatever car I purchased would constantly get worse.
Other “things” get better
Smart phones are not unique in getting better. Amazingly, there are internet controlled thermostats that attempt to run heat and air conditioning while you’re home and not when you’re away. Since they’re WIFI enabled, they automatically take software updates improving their algorithms. Just about everything in my entertainment center gets better too. The setup box and gaming console both get regular updates and the DVD player has a USB port that makes updating the firmware to support new standards a ten minute process.
The things that update most frequently are those with screens that consumers interact with and are connected to the internet. As we look forward towards the “Internet of Things” when more and more of our home appliances are WIFI enabled, smart, and talk to one another, I expect consumers to be increasingly conditioned to the stuff in their life just getting better.
Cars are starting to get better
The good news is that it appears that cars are going to get better too. Ford has some clear instructions for updating their MyTouch system, and a recent update even adds a new audio app (for audible.com). Mercedes is out with an over the air upgrade with the concept of apps and Tesla owners debate the merits of various updates in their forums.
Most manufacturers don’t offer these capabilities (or at least don’t make it obvious with a quick search). I think we can expect the trend to continue, and for the Mercedes model of an over the air option being available. The trend is also towards updating the infotainment systems first. That most clearly matches our smart phone and tablet experiences.
Some manufacturers may be tempted to stick the status-quo of not providing easy upgrades in order to nudge their customers into upgrading sooner. The result is that their cars will have worse resale values over time making their products effectively more expensive. Automakers that do not provide easy software upgrades will repeat the mistake of planned obsolesce.
Instead, we can expect upgrades to become more frequent and easier over time. An industry groomed on multi-year product design cycles will need to become proficient at agile development and DevOps.
What about the software that controls the brakes?
While the infotainment systems are easy candidates for frequent updates, a significant portion of the software in vehicles is safety critical. Part of Toyota’s response to a safety recall was to update the software that controls what happens when both the accelerator and brakes are depressed (brakes now win).
Software can control trade-offs such as how responsive a vehicle is to the accelerator vs fuel efficiency. If an engineer working on next years’ model finds an algorithm tweak that improves fuel efficiency, would that be pushed out to existing models? How would that impact regulatory bodies that demand efficiency gains? Would an automaker be credited for improving cars already on the road the same way they are for improving efficiency in the new model? I’ll leave that to the automakers and regulators to negotiate.
However, as we look at software that is increasingly safety critical and regulation impacting, it seems too safe to assume that updates will be better tested and less frequent. Again, automakers can turn to their IT departments for a parallel. IT will talk in terms of systems of engagement (the website or mobile app) and systems of record (financial records). While often connected, the pace of release can vary widely.
As the things around us, and the things we drive, become increasingly software driven and connected, we expect that they will improve while we own them. That’s pretty great. But, we can also expect growing pains as organizations used to very infrequent releases that are never updated, adapt to a world in which they can respond to negative reviews of their product by releasing a patch.
We’ve covered the controversy of a DevOps Team on this blog before. DevOps Teams are dangerous in that many organizations realize that their Dev and Ops groups are so far apart, that they need a neutral, expert group that can bring them together. At the same time, there are increasing reports of DevOps teams becoming yet another silo – and one that is often arrogant and disliked. More silos being exactly the opposite of what we are going for, this is a frightening result.
The dangers in a DevOps Team seem to be that they will:
- End up owning a lot of things, and be a silo onto themselves
- Be over aggressive in dictating how teams should work
A little over a week ago at the IBM Innovate conference in Orlando an attendee shared her successful experience creating a DevOps Liaison Team. I am very intrigued by the naming here. When the team is formed with the name and charter to bring the other groups together, it is hard for them to either own systems or dictate.
The other pattern that I liked was what WebMD did in their project to select and implement a deployment automation tool. They went to manages in various traditional silos (Dev, QA, Ops, etc) and asked for a techie who:
- Did real work
- Had the respect of their peers
- The manager would delegate authority to compromise to
They ended with a team full of skilled engineers who would work together, but then go back to their own teams once the project was done. However, they had formed solid working relationships cross-silo and could become a liaison between their group and others.
Both of these approaches seem to form a DevOps Team of sorts, without creating something evil. They also take chisel to walls between groups rather than trying to reorganize radically all at once. I’m encouraged that our industry is beginning to find some healthy patterns for enterprise DevOps adoption.
For more on non-evil DevOps teams, check out our recorded webinar: Building a DevOps Team that Isn’t Evil.
What about you? Have you had success forming a dedicated team that helps the rest of the organization grok DevOps? Have you failed? Leave your tips in the comments area below.