How we did it
It was morning at the software engineering office in Bristol, UK and I grabbed a fellow engineer from the cubicle next to mine.
– Can you come over here for a second? – I said in a completely casual voice
I took him to the whiteboard hanging in my cubicle.
– You know what? – I was still speaking casually and calmly – That solution that I previously proposed? – I made a short pause – it’s not going to work.
He smiled.
I often bounced ideas from Ron. He was involved in every technical aspect of what our team was doing.
– But last night I came up with a new one that will work. –
I proceeded to draw the failing solution on the whiteboard.
– I created a prototype of the previous solution and last evening I found that with a high rate of changes it becomes completely unusable. – We proceeded to discuss the shortcomings of the previously proposed system. – Here’s a new idea. – I drew it on the whiteboard.
– That will work. – Ron said confidently.
We were working on an R&D project. The previously proposed solution didn’t work. But there was something different about the new one. It was simple. It made me think “why has nobody thought about it sooner?”. That was a good sign. I figured that the best way forward would be to create a prototype of the new method first. And it wasn’t that difficult to do. The difficulty was that when we started the project we didn’t know how. Once we know how, the coding is easy.
The Goal
John joined us from the IT consulting group of the corporation. He had worked for the company for over a decade. He brought a talent for explaining complex matters in a simple way. By June 2012 he had been with Cloud Services for over a year and had advanced to the position of Technical Lead of Compute Engineering. He called for a meeting where he proposed that we change the way we worked. John wanted us to work closer with the Open Source community. That was when the project started. I like to call it Federative Continuous Integration.
By June 2012, we have already used Open Source software. However, our mode of operation was that we would take a release and stay on it for months. We would make custom changes to it. We would borrow and share changes with the community infrequently. By that time, we had been on a release for about 9 months.
Instead of doing that, we wanted to synchronise with the community every day. It made business sense. There’s much less risk in making frequent small changes to our critical system than large infrequent ones. It makes engineering a lot more predictable, minimises outages and delays. It makes customers, managers and shareholders happier. On the technical side, it is to our advantage to be able to consume the latest bug fixes and features developed by the community. We also wanted to be able to contribute our changes back frequently, so that we could influence the technical direction of the community. We also wanted to take advantage of the technical review process of the community.
To put it in context, in 2000s, innovative software engineering organisations were taken over by the idea to implement change in the smallest deltas possible, applied to the mission-critical system multiple times in a day. The concept is called Continuous Integration (CI). It reduces the risks and unpredictability traditionally associated with software engineering. To make CI possible, each individual proposed small change undergoes extensive automatic or semi-automatic testing. If the tests pass, the change can be applied, without being delayed by other proposed changes.
However, the CI concept usually only applies to in-house software to be used in a specific in-house environment. We, however, wanted to consume software created by the global community of independent organisations and individuals, over which we have no control and which is not being developed to fit our specific in-house environment. That was a greater challenge. We wanted to do Federative Continuous Integration (FCI).
At some point I realised that not only it made business sense, but it also has two important socioeconomic implications. Firstly, it changes the way innovation is being done worldwide. Instead doing it in large leaps over the period of months or years, it is being done in small steps multiple times in a day. Secondly, it changes the way organisations work with each other. Traditionally, organisations compete by innovating in secrecy from each other. However, in the FCI model, companies compete on the service they provide and the engineers from the competing companies collaborate to innovate together. As a result, the global innovation system is more efficient.
I was given the privilege to be the lead engineer on the Federative Continuous Integration project at Cloud Services Compute Engineering.
I met Jeff in Bristol, UK. He arrived from the US to meet with us. He was leading the community Continuous Integration system for the Open Source software we were using. The software is called OpenStack. It has thousands of contributors. There are as many as 50 changes daily to OpenStack. The CI system allows every proposed changed to be comprehensively reviewed and automatically tested. We were going to integrate with that system. Jeff was sociable and chatty and gave us some good advice. But the greater the success and velocity of his team’s system, the greater challenge was it going to be for us.
Problem
Martin was a skilful engineer involved in our tooling and Ron was a generalist involved in everything. The three of us were staring at the computer screen in silence and anticipation. Messages were scrolling. It seemed to take forever.
– ERROR – was the final message at the computer screen
– Yes! – We exclaimed at once.
It was a great success. It was a different error then the last time and later in the process.
We were making progress. But it was a race against time. We were trying to make the latest community software work with our custom changes and deployment system. In the meantime, the community software was quickly changing. We didn’t yet have the tooling to keep up with the rate of change. But, ironically, to develop and test the tooling, we had to have our system work with recent community software. So it was a self-perpetuating problem. In order to keep up in the long term, we knew we had to resolve any incompatibilities between community software and our environment within a day or so from the problem appearing. For now, we were a few weeks behind.
The problem with using community software — or upstream, as we call it — was that it was moving target. We needed to maintain custom configuration, deployment system and changes on top of it, while it was continuously changing.
John came up with the idea of dividing the source code changes into three categories. Type I are the changes we contribute directly to the community. Type II are changes we create in the form of plug-ins, drivers, aspects or similar. Type III are changes we make directly on the source code in the form of in-house patches.
Categories are sorted from the most desirable to the least desirable. Most of our work would be done through changes type I. We follow the standard community processes to contribute that type of changes. We use them only after the community accepts them. As for changes that are specific to our environment, we implement them as type II. And as for changes type III, we use them only if we have to. For example, we create a change type III if the change is of such a high priority that we cannot wait for the community to review and accept it – such as a security vulnerability. We then want to use it on our critical systems as soon as possible. We simultaneously submit it as a change type I to the community. Once the change type I is accepted by the community, we can drop the corresponding change type III.
What makes things even more complicated, is that the Open Source software we use has hundreds of configurable parameters. Not every combination gets tested by the community. We needed to make sure that the latest software works with our configuration.
We also use a custom deployment system. The deployment system is based on a tool called Chef. To use Chef, we wrote so-called “recipes” that tell Chef how to install our software on our infrastructure. We needed to make sure that the latest community software works with our Chef recipes.
Furthermore, changes type III can lead to a number of problems. They are of only temporary nature and once it becomes accepted by the community, we want to drop the patch. How do we know when can we drop a given change type III?
If the patch that ends up in the community code is identical to our change type III, automatically removing the patch is easy. However, imagine that the community comes back to us and rejects the patch because it only works for our configuration and not for somebody else’s configuration. So we work with the community to improve the patch. By the time the patch gets accepted by the community, the actual source code change is different than our change type III . How can we automatically drop our change type III when the change in a modified form appears in the community code?
And even more difficult scenario may occur. Imagine that we create a change type III and deploy it to the production. We submit it upstream and it gets rejected because somebody else from a different organisation fixed the same problem in a different way. How do we know that we can drop the change type III then?
Solution Part 1 – Knowing if upstream works for us
The idea is as follows. We run a set of automatic tests on the latest community code overnight (we initially wanted them to run on every change to the community code, but in the end we’ve found that running them overnight is actually sufficient). The tests are:
(1) changes type III apply (i.e. no community developer modified the same lines of code as we did)
(2) unit tests pass (unit tests are special tests that developers create that validate individual parts of software)
(3) the code builds to a deployable kit
(4) deployment works, including changes type II and III (this deployment is done to a virtual machine using the same Chef recipes as deployment to production)
(5) functional pass (i.e. a test of the entire system)
In the tests pass, we mark the version of the source code as green. If it fails, we mark it as red. In that case, we aim to resolve it within a day.
The last “green” code combined with our custom patches is the version of the code that’s most important to us. It appears on top of the default branch in our source code repository. It is used by default by our toolkit when we cut a new release, deploy it to a physical development environment or build a development environment in a virtual machine on a developer’s workstation.
For example, the development environment in a virtual machine on an engineer’s workstation can be configured to update itself a few hours after the upstream tests run. If the test result from the same night was green, the same night’s code will be used. If it was red, the last green one will be used, which will usually be the one from the previous night. That way, when a developer comes to work in the morning, they already have a development environment built from the last community code that works with our configuration, deployment processes and custom changes.
An important feature is that all changes type II and III and all changes to our deployment tools (such as the Chef recipes) are gated using the same set of tests. That means that if any change to the local code fails the tests, we do not let it through. That way we know that if the upstream tests fail, it’s due to changes upstream, not changes to any local code.
Solution Part 2 – Source code management
Harry had impressive experience in source control and build. He worked from Ireland. He was on the Cloud Services Automation team. We discussed the various methods we could use to store upstream combined with custom changes. We discussed a number of solutions and set off to prototype and experiment with them. It’s a better approach for R&D projects than quickly converging on one solution, because if that solution turns out to be sub-optimal, attempts to improve it often result in a patchy hack, that looks like a car with wings. Instead, we were looking for a simple and elegant design. After about 6 months of experiments, we’ve found one.
Our requirements for source control were that:
(1) it should be clearly visible which upstream version we are using and what is each of our changes type III
(2) we should have all the history of the previous versions of the code
(3) we should have the ability to do hot fixes on existing releases without importing upstream if we didn’t want to
(4) we should be able to do code reviews
The source code management software we use is called git. It is very powerful, but it doesn’t give us exactly what we wanted out of the box. We created a new operation in git to give us what we needed. We called it “upstream import”.
We also use a web-based code review tool called Gerrit. Submitting code to Gerrit allows developers to engage in a discussion about a code before the code as accepted to the main branch. Gerrit also initiates our continues integration tests. Upstream OpenStack also uses Gerrit, so our internal process is similar to the process to contribute code upstream.
We also assign a change ID to a change type I and a corresponding change type III. By using the same change ID, we can track when the change type I gets accepted upstream. It is then possible to automate the removal of the change type III when it’s no longer needed. It is the same change ID that Gerrit uses to track multiple versions of a patch that undergoes a review.
If somebody outside our organisation implements a change functionally equivalent to our change type III in a different way than we have done it, then the change type I submitted by us to the community will be rejected and marked as duplicate. That allows our developer to change the ID of the change type III to match the one that has actually been accepted. That way it is possible to automatically drop the change type III when we import the equivalent upstream change.
Problems resolved
We’ve overcome the problem with the community code being a moving target by running overnight tests that check that the software is compatible with our temporary custom changes, plug-ins, configuration and deployment systems. We can easily cut a release using the last version of the community code that has passed our tests combined with our custom changes. We can also rebuild the development environment on a developer’s workstation using that version of the code.
We extended our source code management software called git with a new operation called upstream import. It allows us to track upstream while maintaining custom patches on top of it. If the latest upstream code doesn’t work for us, we can easily base our changes on top of the last one that does.
We have the ability to carry temporary local patches to the community code. That allows us to quickly apply a hot fix to the production environment without having to wait for the community to accept it. It is possible to automatically drop a local patch when the equivalent change appears upstream is possible. We’ve achieved that by using a change ID, which allows us to associate a given local change with an upstream change, even if they are not identical.
Summary
As a result of the introduced Federative Continuous Integration process and tooling we have been able to keep our production environment a week to a few weeks behind community code. We can cut a release and rebuild a development environment daily using the last community code that works for us.
Links
The work on the source code management tools created for this project continues at: https://opendev.org/x/git-upstream
The names of the characters and organisations have been changed. The OpenStack™ Word Mark and OpenStack Logo are either registered trademarks/service marks or trademarks/service marks of the OpenStack Foundation, in the United States and other countries and are used with the OpenStack Foundation’s permission. We are not affiliated with, endorsed or sponsored by the OpenStack Foundation.