Continuous Integration that doesn’t fit in 10 Minutes

The CI community has made plenty of compelling arguments that fast builds facilitate continuous integration. And by “fast” we mean the time it takes to get a cup of coffee or just at least  under 10 minutes.

The Situation

But what do we do when the build (not the integration tests but compile and package) takes 30 minutes and produces large files? What if there are potentially hundreds of changes a day being introduced? Can we get some of the benefits of while keeping hardware and storage utilization in check?

A customer recently asked me these questions. Here’s their (slightly modified) scenario:

  • Builds (mostly compile and package) take 45 minutes
  • Developers commit frequently (the human side of CI is happening)
    • On average there are 150 commits per day to the project
    • Commits occur mostly in a 10 hour period (standard working day)
  • The build products are about a 3 GB in size
  • Build machines and disk space are finite resources

Classic Continuous Integration

In classic CI, AnthillPro will trigger build on every atomic commit. With 150 thirty minute builds falling in 10 hours, some quick math indicates that we’ll have to have something like an eight machine build cluster to service all these builds. If we have the hardware, that’s great. It also means that we’ll be generating 450 GB of reuseable artifacts a day. I love CI, and would be happy to argue for an eight machine build cluster to facilitate rapid feedback but I wouldn’t want to justify that kind of storage expense to my manager. An AnthillPro Cleanup Policy can be instructed to throw out all but the most recent Xunpromoted builds, but usually teams want to keep a couple days of builds around so they have options around what to move to test servers, pass to testers, approve for release, etc. Cleanup policies help, but have a couple terabytes of data around the last few days of CI builds still smells like waste. We’ll need to get smarter.

The Reactionary Response

The energetic engineer who likes the idea of CI presents it to his manager. All it will take is purchasing six new build machines and getting a couple terabytes of network storage.

The classic response of a manager is pretty easy to predict. “Get out of my office. Nightly builds will be fine.” Scalability solved. CI dead. Bummer.

A compromise might be struck with a build every X hours but the battle has been lost.

Operating within Constraints

Let’s look at a more interesting compromise. The team is given four build servers and 100 GB for storing CI builds that haven’t been promoted in any way. Hourly builds will use just a single machine but a policy of keeping all builds for at least two days will use up the bulk of the disk space. Developers will get feedback about a typical commit in about an hour – not great but better than nightly.

However, hourly builds are leaving 80-90% of our build machine capacity unused. Distributed build tools like Dmake or Incredibuild could tap the rest of the machines to provide faster feedback and start moving down towards that 10 minute build though.

Let’s avoid our disk space limitation though. Most of our CI builds we are primarily interested in just for rapid status feedback not as potential deploy-able / promote-able releases.What if we only published artifacts from a small subset of the builds? Then we could build like crazy and keep our disk space down. So let’s introduce a checkbox to our process. If checked, we’ll publish artifacts. Otherwise we will not. Our standard CI builds will not publish any files but manually triggered builds might and some regularly scheduled (nightly or every X hours) will. Only the builds with published files will be available for promotion.

To control whether the artifacts are published add a step precondition to the Artifact Deliver steps. An example pre-condition script for this would be:

Logic.and(
  StepStatus.allPriorIn(new JobStatusEnum[] { JobStatusEnum.SUCCESS, JobStatusEnum.SUCCESS_WARN }),
  Property.is("publish.files", "true")
);

The files are managed now, but (assuming one build per machine) our four build machines only provide a ten hour throughput of about 80 builds which is far less than our estimate for build per commit (about 150). AnthillPro will actually handle this balancing act automatically. Once the machines are saturated AnthillPro will start queuing build requests. When it sees two requests from the same source (the CI trigger) both waiting in the queue it will automatically merge those requests. The net result will be that the build farm stays busy with four builds so long as their is waiting work. In the worst case scenario feedback from a commit will be delivered in just an hour. In the normal scenario, a fully loaded build farm will get feedback about a commit turned around in under 40 minutes.

Mission Accomplished

Basically, we’re able to get reasonably rapid feedback to our development team in support of continuous integration without breaking the bank on build machines or storage space. Do we get feedback to developers before context switching penalties occur? No. We’d need to speed up the build. Can we promote any of our CI builds? No. Skipping artifact publishing for most builds precludes that.

However, we are able to provide feedback to our developers in under an hour and have a number of builds per day available for promotion. It’s not perfect, but it’s worlds better than nightly builds.

AnthillPro Video: User selectable agents for deploy/test

One of my favorite features in AnthillPro 3.7 is also one of the least well understood ones. Scripted workflow properties allow the system to prompt users to supply input to a workflow based on current conditions, other properties values, or even the output of a quick job.

In this video, we look at scenario where the team wants to give the user kicking off the workflow a list of all the agents that are online in the target environment to select which agents the workflow will run on. We’ve seen this technique used quite a lot for ad-hoc testing and occasionally in deployment scenarios where each tester has one or two machines in QA that belongs to them.

The scripts mentioned in the video are available in the AnthillPro script repository.

Configure a Multiplatform Build in AnthillPro

A common build scenario is to need to compile on several different platforms. The basic build is all the same, but a couple parameters change platform to platform. Here we look at the basics of configuring this kind of scenario in AnthillPro.

You need:

  • A build job that is parameterized
  • A workflow that iterates over that build job for each platform
  • An agent filter that directs each iteration to the appropriate platform
  • Properties on the iteration that are the build job parameters and inform the filter

Streamline Notification with Properties in ‘Fixed’ Selectors

AnthillPro notification schemes are made up of three pieces. A template for the email (or IM) to go out, an event selector to determine when to notify, and a recipient generator that chooses who to mail to. The simplest form of recipient generator is the “Fixed” kind that just notifies some list of fixed addresses.

In 3.7, we loosened the definition of “Fixed” by allowing the values in the selector to reference properties. This has some interesting effects. For instance, we can create a selector that mails existing team mailing lists that adapts itself to each project. The selector might be configured to mail ${property:mailing-list}@mycompany.com. If my project has a “mailing-list” property set to myProjectTeam the notification will go to myProjectTeam@mycompany.com. Likewise, we could create a “team lead” property on each project for a scheme that emails the team lead on every failure.

Using VMWare with Temporary Test Environments

One of the things the economic down-turn has done, is refocus teams on efficiently using resources. Virtualization was already accelerating but remains an area where teams are trying to use the same hardware for a number of different purposes.

AnthillPro plays nicely in a virtualized environment. Build machines tend to always be needed, so having at least most of the build machines be always present makes sense. Machines for automated functional tests are more likely to be used off and on. We see numerous teams looking to start their VMs, run the tests, and then shut the VMs back down. The AnthillPro VMWare LabManager integration can help streamline this process. Continue reading

Request Contexts

Adventures in Advanced AnthillPro Concepts

The idea of a request context is foreign to most AnthillPro users and administrators but is key to understanding how AnthillPro manages build order and consistency along a dependency hierarchy. Administrators using dependencies aggressively would be well served by understanding the request context.

Understanding “Requests”

Before we get into the request context, we should understand the idea of a build request. Immediately after the user presses the Build button, AnthillPro creates a build request. This starts the process towards creating the build, but doesn’t always immediately result in a build. For instance, if the “Force” box isn’t checked, AnthillPro might start by asking source control if there are changes. If there aren’t any changes in source control since the last build, and none of the project’s dependencies have changed the request will terminate having decided that a build is unneeded.  Continue reading