5 Steps to Better Test Data Management

I always say that it's important to test in production because nothing compares to a production environment. But it wouldn't be very professional of you to test only—and directly—to production. Testing in production usually gives the impression that you didn't care enough to test before you reached the production stage.

But I'd say that in order for you to even dare to test something in production, you need to have run a set of tests previously in a similar environment—including all the data you need for testing.

That's where test data management (TDM) comes in.

#1 Only Use the Data You Really Need

If you don't know where to go for your next vacation, just book the next flight and start packing. You might have the best experience of your life...or the worst. If you don't know what data you really need for testing, you might just use it all. That approach has pros and cons, so when you test software without having an idea of which scenarios you need to test, you'll want to have an exact copy of the production database because it's the easiest way to start testing with real data. Otherwise, you'll end up spending too much time and money waiting to get your copy of the data for testing.

When you start creating your testing process by building the list of test cases you'll need, it becomes pretty obvious how much and what type of data you're going to need. More importantly, think about your testing process as an iterative one. If you start testing the login page, you don't need to have all the information from the user for that test case, such as their birthdate or home address.

As you keep iterating, you're going to need more testing data. And as you find more bugs, you're going to need more real data. Unless you need to run stress tests, subsetting data is going to be enough for the majority of the test cases. And even if you still need to validate that the system can handle high waves of traffic, you can also generate varied static data for that purpose. More on this later.

Taking small sets of your production database should be enough for most of the tests you'll run to validate the software. You'll also reduce costs and complexity when building only the test data you really need at the moment.

#2 Avoid Having Sensitive Data for Testing

We've seen a lot of recent GDP-related lawsuits involving big companies in Europe. Europe is taking data protection more seriously than other countries. Pretty soon, regulations like GDPR will be implemented on other continents too. If GDPR is already affecting other companies, we better avoid having unprotected sensitive data in our testing environments.

SOX compliance regulation fosters a separation of duties within an organization. I've worked with these type of regulations. In my experience, auditors only want to see that only certain people have access to the production environments. These people with privileged permissions are legally responsible for what happens with customer data.

Even with regulations in place, data is still leaked. We have to be prepared for that, so you should operate as if you expect the information you're storing will be stolen someday. Mask any data that could identify a person, or what's also called personally identifiable information (PII).

Use irreversible methods to mask data so that it's difficult to unmask it. And make sure you're constantly checking that PII is protected. Managing test data will be simpler and easier if you create subsets of the data to fulfill your different testing needs. And you won't have to worry about giving sensitive data to developers or whoever needs production data for testing.

Ideally, try to avoid having sensitive data. But since sometimes you can't avoid it, try to keep PII data at a minimum, and securely mask the data you need to have.

#3 Build Synthetic Data for Better Efficiency

Even though you decided to mask sensitive data—especially if the data is going to be used for testing purposes—you want to make the security gap as small as possible by not including sensitive data in your tests (even if it's masked). One way to improve security is to replace real data (like credit card numbers) with autogenerated dummy data. That's synthetic data, and it will help you get more efficient results in testing.

You can take advantage of the synthetic data approach by using more realistic data than just dummy data. For example, you might have a user called Joe in your records, but for testing purposes, you decided Joe will be called Jeremy. This gives you a chance to run machine learning experiments where you can learn more about "Jeremy's" preferences without knowing that Jeremy is actually Joe. You're protecting Joe, even if the data is leaked or misused.

Synthetic data makes real data more shareable because you only have the data you need. Why would you need to know a person's name if you're just trying to replicate a bug in production? You're only interested in knowing which paths through the system's workflow the user took. What matters is why the data ended up in a certain state that caused the software to break. You can then decide either to ignore the person's name or replace it with other "real" data.

If you need to have large amounts of data for performance testing, you can use synthetic data to double the size of the database. Along with the previous benefits we discussed, synthetic data makes your tests more efficient by only using the data you need to cover specific test scenarios.

#4 Create Test Data As a Self-Service Model

DBAs are in charge of generating testing data. They know the best ways to do it and what data is sharable among teams (as I explained in the previous section), and sometimes they're the only ones who have access to production databases. When this happens, the DBAs become a bottleneck, and the time spent in the testing stage increases.

That's why you should create test data as a self-service model. It's not just so you don't constantly interrupt DBAs when a developer or tester needs data. The ability to automatically have testing data will let you parallelize the boring task of manually generating data for testing. Do you need to reduce the testing time? Fine. Create more subsets of testing data in parallel and distribute the test cases.

Another benefit of having a self-service model is that you can easily drop and re-create environments on demand. By doing this, you ensure repetition and predictable results when preparing testing data. It's also easier to include TDM in your CI/CD pipeline, which brings you closer each day to one-click deployments.

Creating a self-service model is far from an easy task. So it's important that DBAs, developers, and testers work together to create this model. Not all of them have the same needs and objectives. Join experience, knowledge, and skill to create a better model for data testing.

#5 Keep Testing Data up to Date

Last but not least, keep your testing data up to date. Your software will continue evolving, so the test scenarios and the data they need will keep changing over time too. Some test scenarios will become obsolete, and so will their data. Try to always keep the house clean by making sure you're only generating the testing data you really need regarding its relevance in time.

This process takes discipline, and good communication within the team always helps. Developers need to inform everyone which tests are no longer needed and when it's OK to remove them. And either DBAs or testers need to keep confirming that the data they're using for tests is still valid and relevant.

Keeping data fresh might seem like common sense. But I've seen delivery pipelines where tests continue to grow, even though some of the features no longer exist. Sometimes we get too extreme about trying to have a high percentage of test coverage, which isn't efficient.

Having up-to-date testing data will help you have higher quality TDM.

Benefits: Better Test Results With Better Testing Data

I'd say that testing is the most important stage of any software release life cycle. The more quickly you can verify that everything is still working, the better. Always keep the mindset that parallelizing testing will help you to speed up the process. For that, you need to have better test data quality, and it's not always necessary to have an exact replica of what you have in production. In fact, if you don't, it may help you in the cost, security, or speed departments.

It's important that you start by defining what you truly need and iterate from that. Automation helps with repetitive and boring tasks, but you need to continue taking account of the human side of things in the equation to generate data for testing purposes.

TDM helps you provide only the data you need, on time and securely.

Author: Christian Melendez

Further reading suggestions: Holistic Test Data Management