one of my main responsibilities at work is source code control and integration. This could be considered a somewhat trivial work, but these horror stories keep me motivated.
moblog: The Windows Shutdown crapfest In small programming projects, there's a central repository of code. Builds are produced, generally daily, from this central repository. Programmers add their changes to this central repository as they go, so the daily build is a pretty good snapshot of the current state of the product.
In Windows, this model breaks down simply because there are far too many developers to access one central repository -- among other problems, the infrastructure just won't support it. So Windows has a tree of repositories: developers check in to the nodes, and periodically the changes in the nodes are integrated up one level in the hierarchy. At a different periodicity, changes are integrated down the tree from the root to the nodes. In Windows, the node I was working on was 4 levels removed from the root. The periodicity of integration decayed exponentially and unpredictably as you approached the root so it ended up that it took between 1 and 3 months for my code to get to the root node, and some multiple of that for it to reach the other nodes. It should be noted too that the only common ancestor that my team, the shell team, and the kernel team shared was the root.
Joel on SoftwareA young Windows engineer writes:
prior to the restart effort of Longhorn, there were about seven [branches], reverse-integrating into one main branch every two or three weeks perhaps. Now, imagine several thousand developers checking in directly into seven branches. This will lead to two things:
you check in frequently, and there's a very high chance of either breaking the build, or breaking functionality in the OS, or 2., as a counter-reaction, you don't check in very often, which clearly is bad, since now you don't have a good delta history of what you did.
So this clearly didn't scale. As part of the restart effort, we decided that each team would get its own feature branch, each feature area (multiple teams) would go up to an aggregation branch, and those would lead up to the final main branch. (As such there's now north of a hundred branches in tiers, leading up to about six aggregation branches.) Teams were free to choose how many sub-feature branches they wanted, if any, and they were free to choose how often they wanted to push up their changes to the aggregation branch. As part of the reverse-integration (RI, i.e. pushing up) process, various quality gates had to pass, including performance tests. Due to how comprehensive those gates ended up being, this would take at least a day to run, plus perhaps a day or two to triage issues if any cropped up; so there was a possibly considerable cost to doing an RI in the first place. However, these gates were essential in upholding the quality of the main branch, and had they not existed, the OS would have never shipped. I suppose it's one of those 'what doesn't kill you...' type deals.