Introduction
Setting up a great test data management strategy is a crucial step for taking your test automation process to its fullest potential. However, many software professionals are still not familiar with the concept of test data management (TDM). Even those that are familiar with TDM might have a hard time putting it in practice. Why is that?
When it comes to test data management, the “what” is relatively straightforward, but we can’t say the same about the “how.” As it turns out, there are several competing methods of managing test data. Which one should you choose? As you’ll see in this post, this isn’t a one-approach-fits-all kind of situation. Each method has its unique strengths and weaknesses and might be more or less appropriate for your use case.
Today’s post will cover some of the existing test data management approaches, listing the advantages and disadvantages of each one. Let’s get started.
Replicating Data From Production
The first approach we’re going to cover in this post is perhaps the most popular one, at least for beginners. And that makes perfect sense if you think about it. When you first encounter the challenge of coming up with data to feed your testing processes, it isn’t too far-fetched to think you should just copy data from production and be done with it. It’s the easiest way to obtain data that is as realistic as possible. You just can’t get more real than production.
Not everything is a bed of roses when it comes to production data replication. Quite the opposite, actually. The easy access to data is pretty much the only advantage this method has. And what about the disadvantages? These, sadly, abound.
Here Be Dragons: Some Downsides of the Approach
Here’s the first problem: replicating data from production continues to be mostly a manual process. Sure, you can come up with scripts and automated jobs to do most of the heavy lifting for you. But keep in mind that generating the data isn’t the whole job of a TDM management solution. “Availability” is an integral part of the package. That means that the TDM tool is responsible for making sure the data is available where it’s needed, at the right time. A naive approach based on scripts might not be sufficient to manage the demands of a complex testing process, forcing you to rely on a manual process to do so.
Secondly, production replication doesn’t lend itself well to negative test cases. It’d be out of the scope of this post to give a lengthy explanation of negative testing. In a nutshell, negative test cases are tests that validate the system against invalid data. Basically, you throw faulty data at your application to check how well it can handle it. Since production data would (hopefully) be in good shape, this approach isn’t well suited to this type of testing.
Production data replication also doesn’t work…if there is not data replication for you to replicate in the first place! What should you do when you need to test an application that is still in the alpha stage of development or even a prototype? Since no one is actually using the application, there would be no production data for you to copy. That’s a severe downside of this approach since every new application will face this problem.
Here Be Dragons (For Real): Legal Implications
Finally, we have the most serious downside of this approach—data sensitivity. Data compliance is a crucial part of the modern IT landscape since companies are responsible for the data they store and manipulate. It’s up to them to protect their client’s data, ensuring it’s not abused. When replicating data from production, software organizations run the risk of failing to comply with privacy acts, such as GDPR. And that can bring catastrophic consequences, legal, financial, and reputation-wise.
Data Masking
In order to solve the downsides of production data replication (a.k.a the naive approach), test data management tools have come up with more sophisticated methods. One of the
most popular of these approaches is test data masking. As its name implies, tools that adopt this approach enable its users to apply masks to production data. Such masks will remove personally identifiable information (PII) from the data.
Data masking is an improvement over naive production data replication, for sure. But the approach is not without its downsides.
First, consider the “time” variable. Data masking doesn’t reduce the time spent generating (or rather, copying) the data for testing. On the contrary, it increases it because now you have a new added in the process. You could argue—and I’d gladly agree—that it’s time well spent, but it’s more time nonetheless.
Then, you also have to keep in mind that data masking isn’t a standalone approach on its own. Instead, it complements the previous approach by solving one of its more serious issues. The problem is data masking can’t fix every problem that the production replication approach has. For instance, if you intend to test an application still in development, for which there is no production data at all, data masking is powerless to help you.
Synthetic Data Generation
Synthetic data generation is yet another method of test data management. As its name suggests, this approach consists of generating “fake”—or synthetic—data from a data model. Tools that implement this approach are able to preserve the format of the data. The values themselves, though, are completely disconnected from any original data. What does that imply?
The implication of this is that synthetic data generation’s greatest asset is simultaneously its most significant downside. By populating the database with entirely “made-up” values, the approach dramatically reduces (virtually eliminates) the risk of exposing sensitive data. On the other hand, depending on the tool’s sophistication—or lack of—you might end up with data that feels “fake-y.” One of the goals of an excellent TDM strategy is to provide data that is as production-like as possible.
To wrap-up, let’s talk about the biggest advantage of synthetic data generation, namely: speed. Once you have a model in place, you can quickly generate data from it, effectively eliminating the time delays that plague other approaches.
Test Data Management Is More Than Test Data Generation
In this post, we’ve covered some of the most used approaches to generate test data. The list is definitely not exhaustive; there are many more methods that we didn’t cover. However, many of them are variations or combinations of the approaches we did talk about.
Another thing to keep in mind is that test data management is much more than just generating test data. TDM is responsible for ensuring the quality of the test data, its availability, and also its security. In other words: the data must be good, and it must be available at the right place, at the right time. And bad actors shouldn’t be allowed to expose it or misuse it in any way. That’s why, depending on the needs of your organization, you should consider adopting a full-fledged data compliance solution, which can not only supply your data generation needs but also make sure your data adhere to the compliance requirements you must follow.
Author Carlos Schults
This post was written by Carlos Schults. Carlos is a .NET software developer with experience in both desktop and web development, and he’s now trying his hand at mobile. He has a passion for writing clean and concise code, and he’s interested in practices that help you improve app health, such as code review, automated testing, and continuous build.