Build Battle Royale

Last week I experienced one of those "perfect storm" technical problems that can ruin any faith one has in project planning. I spent a solid week on something that I thought would take a couple hours...

Background

The setting for this drama is the (notoriously neglected) build and integration process. For those unfamiliar with this concept, it's an automated process that takes programming source code and turns it into an all-in-one file that can be installed on other computers.

The build is commonly a convoluted series of scripts that are not well documented. Institutional knowledge about the build tends to dissipate over time as developers focus mostly on other activities.

Phases

Initial Change

I had intended to convert some JSON files that describe steps in a workflow to their equivalent in C#. This was supposed to improve maintainability prior to my adding more workflows and was quite a trivial thing to do, in theory.

Unfortunately the exercise took longer than expected. E.g. 5 hrs instead of the expected 2. I did manage to have a bit of fun discovering a way to avoid string literals and thus improve maintainability that much more. (See DispatchProxy)

Test Errors

Build Failure

 The true nightmare began when I tried to run the build on the changes I had made. For some mysterious reason 2 tests were failing.

MongoDB.Driver.MongoCommandException : Command insert failed: WriteConflict.

Occasionally our build fails for no good reason (like many non-trivial build processes) so I tried running it again but still there were failing tests. After some more thought, I still could not think why JSON parsing would affect database access.

My immediate response was to reproduce the test errors on my workstation which didn't really work. Sadly, some of our tests don't work on a workstation and I was getting some errors that looked similar but not exactly the same as those on the build machine. This means fixing them may not improve the situation on the server.

I fixed them anyway, just in case, and naturally it did not help. Oh well, at least the tests work on a workstation now and that gives me some clues and more familiarity, which is often useful.

Build Server Maintenance

Being confident my changes were not the problem I assumed the build server needed to be updated because it had finally gotten too old. This of course is a dangerous exercise as build servers almost always contain 'ticking time bombs' in the form of outdated, undocumented dependencies that rear their ugly head when doing maintenance.

Bravely, I waded into the build setup process clutching the scant setup guide I had honed the last time I was in there. I reset a few things and naturally, there was an error I could not explain that occurred before the build could even get under way. Access to the source code was blocked by security.

[2023-01-21 06:19:17Z ERR  VisualStudioServices] GET request to https://devops.server.net/tfs/_apis/connectionData?connectOptions=1&lastChangeId=-1&lastChangeId64=-1 is not authorized. Details: VS30063: You are not authorized to access https://devops.server.net.

This of course was solved by a seemingly unrelated, minuscule post containing a suggestion that I overlooked a day earlier. But not before a number of troubleshooting attempts had failed.

The problem is likely a bug in the build software we use and I made sure to add the workaround to the setup guide.

Build Server Broken

Having made some minor updates, I found the build was experiencing even more problems; it was no longer able to successfully build a previously-successful revision of the code base. It seems I had inadvertently updated a dependency that was not backward compatible. These are annoying problems as they are unexpected and can take a long time to solve.

2023-01-24T19:15:03.9463495Z fatal: [localhost]: FAILED! => {"msg": "template error while templating string: No filter named 'ternary'.. String: {{ include_wowza | ternary(wowza_license_key, '') }}"}

The obscure error message had no directly-related search results. Fortunately I had experience in this part of the build and the errors that others had reported reminded me that another utility was likely out of date.

These are scary problems because solving them depends on happenstance; it was my personal experience at work, and I just happened to remember something obscure that suggested a particular 'stab in the dark' that worked.

So now I had completely rebuilt the build server and no, that did not fix the build errors that resulted from my original changes. And so the next phase began.

Runaway Builds

The particularly, unhelpful nature of our build is a run can start failing and then take an especially long time to complete. The monitoring routine gives up after an hour and, ideally, would terminate the build.

##[error]The job running on agent abc-pipeline-agent-7 ran longer than the maximum time of 60 minutes. For more information, see https://go.microsoft.com/fwlink/?linkid=2077134
##[warning]Agent abc-pipeline-agent-7 did not respond to a cancelation request with 00:01:00.

Sadly, the way we had setup the build meant that build run continues in the background and interferes with subsequent build runs that occur before the first one completes.

This of course is not documented anywhere and I learned everything the hard way. I tried to add a feature to prevent runaway build interference but settled for learning how to control the build and documented accordingly.

The Real Problem

So here I am back to the original errors but possessing a deeper understanding of the build and tests.

Common.Infrastructure.EntitySetException : Failed to save entities of type ResourceManagement.API.Model.ResourceInstance to the underlying storage Common.Infrastructure.MongoDB.MongoWriteableCollection`1[[ResourceManagement.API.Model.ResourceInstance, ResourceManagement.API, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]]. The storage returned: Command insert failed: WriteConflict.
---- MongoDB.Driver.MongoCommandException : Command insert failed: WriteConflict.

at Common.Infrastructure.EntitySet`1.SaveAsync(CancellationToken cancellationToken) in /src/src/Common.Infrastructure/Common.Infrastructure/EntitySet.cs:line 153
at Common.Infrastructure.Context.SaveChangesAsync(CancellationToken cancellationToken, IEntitySet[] collections) in /src/src/Common.Infrastructure/Common.Infrastructure/Context.cs:line 63
at Common.Infrastructure.Context.SaveChangesAsync(CancellationToken cancellationToken, IEntitySet[] collections) in /src/src/Common.Infrastructure/Common.Infrastructure/Context.cs:line 72
at ResourceManagement.Infrastructure.ResourceManagementContext.SaveEntitiesAsync(Boolean dispatchEvents, CancellationToken cancellationToken) in /src/src/ResourceManagement/ResourceManagement.API/Infrastructure/ResourceManagementContext.cs:line 36
at Common.Infrastructure.Extensions.IContextExtensions.EnsureDataExistsAsync[TContext,TEntity](TContext context, Func`2 entitySetSelector, ICollection`1 entities, CancellationToken cancellationToken) in /src/src/Common.Infrastructure/Common.Infrastructure/Extensions/IContextExtensions.cs:line 32
at FunctionalTests.Services.ResourceManagementService.ResourceManagementTestData.InitializeDbForTests(IResourceManagementContext context) in /src/test/ServicesTests/Application.FunctionalTests/Services/ResourceManagement/ResourceManagementTestData.cs:line 24
at FunctionalTests.Services.ResourceManagementService.ResourceManagementWebApplicationFactory`2.ConfigureTestServices(IServiceCollection services) in /src/test/ServicesTests/Application.FunctionalTests/Services/ResourceManagement/ResourceManagementWebApplicationFactory.cs:line 21


The errors I saw suggested a concurrency problem with accessing the database, which seemed unrelated to loading a JSON file. After tracing the errors to the initialization process for some tests, it finally occurred to me that formerly loading the JSON files caused just enough delay to hide a race condition error.

These are nasty because different parts of the code are competing for a resource and it's not obvious what is happening. Again, these errors commonly require an educated guess at a solution however there was a twist. There was a second race condition error that looked almost identical to the first. Of course, it took me nearly a day to recognize the slight difference in the error message.

Conclusion

So that was my week. Outdated build setup instructions, buggy build server software, outdated & undocumented dependencies, runaway builds and dual race condition errors. And all over a small change that was arguably unnecessary.

In many other environments, a developer would have backed out and lived with the overheads of inferior workflow definitions and a mysterious, flaky build process. Instead, I now have a better understanding of the build, updated setup instructions, a more reliable build, and workflow definitions that are more maintainable and ready for the next wave of changes.

I think it was a net gain.

Comments

Popular Posts