Alternative Test Data Strategies for Automation
Bringing automation to a legacy system generally involves many challenges, particularly if your technology stack is diverse. A major sore point that is frequently overlooked, however, is the challenge of having solid test data to support your tests. The availability, or lack thereof, becomes a driving design constraint for our testing activities since we end up with data variations and cases that we simply cannot reliably test with automation. So why do we often face this test data challenge? It could be because:
2. We don’t have good documentation on what exactly is in the database, is different from production, or is missing from our stage environment.
3. We are not alone in our stage sandbox, so we cannot control the contents or refresh activities in that environment.
4. It seems impossible to create and maintain a realistic, predictable, comprehensive data set.
Those are all quite reasonable concerns, but we have to understand the cost of accepting them. Anything that we do not test because we don’t have usable data is something that is skipped in our test coverage! If we don’t see a reasonable way to resolve the blocking issue, we tend to minimize that risk because we really don’t want to acknowledge the potential cost. However, this doesn’t actually change the existing degree of risk and sooner or later, we have the same probability of paying the price. Automated tests in particularly are constrained by our ability to have predictable data verification strategies (see Kaner’s Black Box Software Testing slides on Oracles for examples and further discussion).
An ideal test data environment will provide test data that has several key characteristics: External Validity, Predictability (including Availability), and Seedability (to force boundary and special cases, including negative tests). When we perform many types of tests, particularly automated tests, we generally want to have a few basic things:
• we need to know what we expect to see or can retrieve from a valid alternate source for data values once the test has been performed
The latter item is our test oracle that informs our evaluation of the test results as being as expected or a potential problem. When we have a system that has constantly changing input data (think traffic, inventory levels, auction bids, etc.), that definitive expected result can be hard to determine predictably, so we may consider an alternate version of the code (e.g. Production) that points to the same data for a valid reference point. Conversely, we may have fairly stable data, such as the ICD (disease) codes in health care where we can use a static set of reference values, but when there is a major revision to that reference data, our strategy will need to account the update. Often, your environment will be some blend of the two extremes, making different strategies for the stable and dynamic data sets worth considering.
If your organization simply doesn’t have the resources to make any changes to the existing test environment or allow you to do so, you may need to get creative in figuring out what you can control, such as creating an in-memory database or local development data source for testing purposes. Your overall test strategy would then account for this limitation in your risks, particularly if this means stubbing out or not testing integration across systems that are based on certain data values.
If your challenge is that there isn’t good documentation or a way to find out what the data structures are, your first step is to start researching your existing system. Even if you cannot get access to the database team or the database directly, you can map out a high level conceptual diagram of the different data objects and their relationships. Once you have the diagram, you can start showing it around and asking people where it should be changed. I have had far more success asking a person for feedback than trying to get them to squeeze in time to document something that has long been neglected. They can make suggestions and corrections without feeling the burden of possible “blame” if they don’t perfectly understand every part and their input gets you closer to a valid representation of the data. I have also found that developers who have worked on tricky areas are great sources of information about the rare exceptions and gotcha’s that may be imbedded in stored procedures or legacy data tables.
If your challenge is simply getting the data you need for your tests, there are two strategies that I have had success with:
• Sampling incoming “real” data for your stage environment and then comparing your test results to the values in the source system.
The first of these can work well for unit and simple integration tests, particularly below the UI of the application. It has the benefit of being highly controlled and predictable. However, making it consistently realistic can be challenging, resulting in false positives in your tests, and it is limited to what you think of for inputs and have time to create and clean up. In addition, many automation tools are not really designed to make it easy (or possible) to know how much you need to clean up if the test fails part way through its execution. What happens then tends to be that overly painful tests to maintain or clean up after fall out of favor over time, withering in the dead test cemetery. One way around this challenge is to use a subset of your real data to populate an in-memory local database that is refreshed each time you do your test run can help if the smaller data set is workable for your application under test.
Even if the test runs beautifully and all “pass”, each test takes time to run and needs to be evaluated for its contribution to your coverage. You need to decide whether to have lots of data variations to cover the breadth of your possible data sets, which can take a lot of execution time, or limit yourself to a few key variations, accepting that you are running only a very narrow path through the code. You can get more sophisticated to increase your coverage by defining a large data set that you select from, randomly or with specific probabilities, so that over time you test many options, but this can be both expensive to set up and to maintain as well. (There are lots of great references for defining “least pairs” or other combinations of data values that provide possibly reasonable coverage that help with this. They aren’t bad to grasp once you see the tools in action, even if you always hated statistics class.)
The second strategy requires a separate environment that has the same data so that you can compare the two. Some organizations do this in their manual testing by comparing production values to masked or scrubbed data in their test or staging environment. If you can be completely confident that your sampling in production won’t alter the data or the environment, this can work for automation as well, but you may have hidden issues around performance, sizes of log tables, or mixing test activity with real production activity in security / audit logs that make this approach much less feasible.
Another approach is to go straight to the database, and this may have value in specific cases, but it carries additional risks. If your database is large, overly normalized, or otherwise difficult to traverse, it may be far harder to know if your oracle is indeed valid or if it may be using exactly the same stored procedures that you are trying to test. However, this approach may be good enough for your purposes (e.g. unit tests or simple integration tests), but it does reduce the value of the test and limits the actual scope that your tests cover by relying on assumptions that the stored procedures work the way you think they do across all cases.
If you have access to a pre-production or release staging environment, this may be a reasonable compromise source for your oracle, particularly for larger runs of test suites that you only want to do on full pre-production candidate builds. You still need to spend time identifying your test data, possibly cleaning it up or refreshing it from time to time, and you may have to coordinate with your build / release team for when the tests are run. If this seems like a big request to make, you may be able to gain additional support by pointing out the usefulness for testing the release process, emergency patches, load and performance tests, or for disaster recovery.
Few organizations feel they have the resources to provide a full enough test environment to completely mirror their production environment. Legacy applications that have grown and evolved over the years are even more likely to have limited test environments. These limitations come about for many reasons, but the constraint this puts on your test coverage and automation can be enough to kill your automation completely. The more you limit your automated tests to just what you have easy access to data for, the less value you get from them and the greater your risk of overconfidence in the system. Any strategy that supports making accurate, representative, predictable test data available will improve your test coverage, validity, and speed on every test cycle. Those savings might appear unbalanced against the pain of creating a solid test data environment, but that investment only gets more valuable with each release.