Recently, I interviewed a team lead, working in a large financial services company, about ensuring a high level of quality for a large ETL migration project that his team had delivered.
Project background and challenges
The team had a difficult job on their hands. They were replacing a gnarly and troublesome legacy system, which had evolved many (undocumented) data flows over time. The new system would be mission-critical, delivering data feeds to multiple departments within their company. Operating in a regulated field, the team had to deal with all of the same constraints that Martin Thompson’s team were facing – a need for traceability, completeness and the ability to “explain your working”. On top of that, the team was distributed, split between the UK and Ukraine.
The application was to be hosted on Windows and would use Microsoft SQL Server as the data store. At the start of the project, the team evaluated various technologies for their development stack, and settled on writing the ETLs and feeds in C# with Database Projects. This stack proved a good fit for the team’s mix of OOP and SQL development skills.
A build pipeline baked in
The team set up a build pipeline from the very start. Time was invested upfront into figuring out how to write low-level tests and minimise testing dependencies between the schema, the feeds and the ETLs.
The schema objects, the ETLs and feeds each had their own independent build and test stages. Combinations of the three (that successfully passed the individual tests) were then bundled together and put through a separate integration and performance test pipeline. The entire pipeline was fully automated end-to-end: a change in version control to any one of the 3 components would first be exercised at unit level and upon success, a new deployment artefact would be bundled and automatically promoted to the next build stage. The pipeline was also portable: any developer was able to check out the code and run the whole pipeline on their local development machine. A successful run, from code change to production readiness took around 2 hours to complete.
For breaking changes, version control branching was used. With TeamCity, it was easy to clone the entire pipeline and configure the clone to execute off a branch. This meant that the branch code was already thoroughly tested prior to being merged into the trunk, leading to fewer integration breakages.
The team followed the model of the test pyramid, with a large amount of tests written at the unit level. Because the feeds and ETL code were written in pure C#, it was possible to use C# tools like MSTest to run the unit tests, while mocking out the schema. This mocking approach allowed making schema changes without breaking the feed and ETL unit test phases, and for developers to work on those components independently. Test-driven development was used to get fast feedback on the code being developed.
The integration tests were split into two stages, one for the database and one for the feeds. As well as verifying the component integration, they played the role of acceptance tests for the business requirements. There were few, relatively large integration tests, each covering a complete system requirement for a single consuming downstream system.
The team followed acceptance-test-driven development. Before a new feature was implemented, a conversation would take place between the developers, BAs and stakeholders. Afterwards, the developers produced new input data and the expected output data. As both the input and output was in the form of human-readable files, the BAs were in a position to check and sign-off the test data changes before implementation started.
At the start of the project, the team explored the possibility of abstracting the database in the integration tests, but decided not to pursue this strategy because the technical costs outweighed the yielded returns. Input data was mostly hand-crafted, and then augmented by special cases extracted from production data. The team created custom libraries for comparing expected and actual output data, which were able to ignore differences in id and timestamp fields, and allowed tolerances when comparing numeric fields. The integration test suite also included performance tests, run with data volumes of different t-shirt sizes (small/medium/large).
The path to live and releasing
By the time a change had successfully travelled through the pipeline, the team had a high level of confidence about its correctness, and would ideally liked to deliver continuously into production. The team’s tricky change management constraints meant that this was not possible; nonetheless, the pipeline made sure that there were virtually no surprises when the time came to release. Because each release contained numerous changes, a short exploratory testing phase on a production-like environment was necessary prior to each release.
The end result
The team was able to successfully replace the legacy system and the new system has been running without problems in production for more than a year now. Along the way, the team experienced no failed releases. Only a single urgent production bugfix – to fix a performance problem – was ever necessary. The problem was not uncovered during testing because of differing configurations between the test and production environments. Because the pipeline was fast enough, the team were able to put the bugfix through it just like any other change, and didn’t have to expedite it virtually untested, as often happens in such situations.