The CI community has made plenty of compelling arguments that fast builds facilitate continuous integration. And by “fast” we mean the time it takes to get a cup of coffee or just at least under 10 minutes.
The Situation
But what do we do when the build (not the integration tests but compile and package) takes 30 minutes and produces large files? What if there are potentially hundreds of changes a day being introduced? Can we get some of the benefits of while keeping hardware and storage utilization in check?
A customer recently asked me these questions. Here’s their (slightly modified) scenario:
- Builds (mostly compile and package) take 45 minutes
- Developers commit frequently (the human side of CI is happening)
- On average there are 150 commits per day to the project
- Commits occur mostly in a 10 hour period (standard working day)
- The build products are about a 3 GB in size
- Build machines and disk space are finite resources
Classic Continuous Integration
In classic CI, AnthillPro will trigger build on every atomic commit. With 150 thirty minute builds falling in 10 hours, some quick math indicates that we’ll have to have something like an eight machine build cluster to service all these builds. If we have the hardware, that’s great. It also means that we’ll be generating 450 GB of reuseable artifacts a day. I love CI, and would be happy to argue for an eight machine build cluster to facilitate rapid feedback but I wouldn’t want to justify that kind of storage expense to my manager. An AnthillPro Cleanup Policy can be instructed to throw out all but the most recent Xunpromoted builds, but usually teams want to keep a couple days of builds around so they have options around what to move to test servers, pass to testers, approve for release, etc. Cleanup policies help, but have a couple terabytes of data around the last few days of CI builds still smells like waste. We’ll need to get smarter.
The Reactionary Response
The energetic engineer who likes the idea of CI presents it to his manager. All it will take is purchasing six new build machines and getting a couple terabytes of network storage.
The classic response of a manager is pretty easy to predict. “Get out of my office. Nightly builds will be fine.” Scalability solved. CI dead. Bummer.
A compromise might be struck with a build every X hours but the battle has been lost.
Operating within Constraints
Let’s look at a more interesting compromise. The team is given four build servers and 100 GB for storing CI builds that haven’t been promoted in any way. Hourly builds will use just a single machine but a policy of keeping all builds for at least two days will use up the bulk of the disk space. Developers will get feedback about a typical commit in about an hour – not great but better than nightly.
However, hourly builds are leaving 80-90% of our build machine capacity unused. Distributed build tools like Dmake or Incredibuild could tap the rest of the machines to provide faster feedback and start moving down towards that 10 minute build though.
Let’s avoid our disk space limitation though. Most of our CI builds we are primarily interested in just for rapid status feedback not as potential deploy-able / promote-able releases.What if we only published artifacts from a small subset of the builds? Then we could build like crazy and keep our disk space down. So let’s introduce a checkbox to our process. If checked, we’ll publish artifacts. Otherwise we will not. Our standard CI builds will not publish any files but manually triggered builds might and some regularly scheduled (nightly or every X hours) will. Only the builds with published files will be available for promotion.
To control whether the artifacts are published add a step precondition to the Artifact Deliver steps. An example pre-condition script for this would be:
Logic.and(
StepStatus.allPriorIn(new JobStatusEnum[] { JobStatusEnum.SUCCESS, JobStatusEnum.SUCCESS_WARN }),
Property.is("publish.files", "true")
);
The files are managed now, but (assuming one build per machine) our four build machines only provide a ten hour throughput of about 80 builds which is far less than our estimate for build per commit (about 150). AnthillPro will actually handle this balancing act automatically. Once the machines are saturated AnthillPro will start queuing build requests. When it sees two requests from the same source (the CI trigger) both waiting in the queue it will automatically merge those requests. The net result will be that the build farm stays busy with four builds so long as their is waiting work. In the worst case scenario feedback from a commit will be delivered in just an hour. In the normal scenario, a fully loaded build farm will get feedback about a commit turned around in under 40 minutes.
Mission Accomplished
Basically, we’re able to get reasonably rapid feedback to our development team in support of continuous integration without breaking the bank on build machines or storage space. Do we get feedback to developers before context switching penalties occur? No. We’d need to speed up the build. Can we promote any of our CI builds? No. Skipping artifact publishing for most builds precludes that.
However, we are able to provide feedback to our developers in under an hour and have a number of builds per day available for promotion. It’s not perfect, but it’s worlds better than nightly builds.
