Security of infrastructure. Confidence in data. You can have it all.
Data breaches have companies reevaluating how to protect sensitive information but still make it valuable to their teams for a variety of business-critical functions. Data masking, a solution for keeping useful data secure and out of the wrong hands, has rapidly become a best-in-class solution for companies today.
Incorporating data masking with the right security policies gives companies a powerful option for utilizing data in a secure and compliant way. In the past, companies would often delete data entirely in an attempt to protect from a security breach. All valuable data was lost for future use. Or companies would attempt to utilize data without the correct security measures and would be at risk of non-compliance and security breaches.
When done correctly, masked data looks and works like real information, with data fields and properties preserved in the new, fictitious values, across all sources and databases. It provides your business with peace of mind that data is safe while leaving it viable for dev teams and non-production environments.
Data masking is a necessary tool to have in your security arsenal, but companies need to educate themselves properly to pull it off successfully.
Here are four things to consider:
1. Determine data relationships: explicit or implicit?
Production data can have both explicit and implicit relationships between different data elements. Determining the two is key to maintaining the integrity of data as well as making sure data is readily available for frequent and efficient access.
An explicit relationship might be a customer that exists across several tables or systems with a unique customer ID that tie records together and allow integrations to work properly. If the customer is a sensitive data element, then it must be replaced with the same fictitious value wherever it appears across the various tables. The replacement must also be unique—two distinct customers must never be replaced with the same fictitious value.
An implicit relationship, on the other hand, can be illustrated with addresses. If the postal code is randomized, it may be necessary to adjust the city, state, county/parish, or other components to match. If a person is tied to that address, then a number such as a social security number or social insurance number may be tied to the state or province, and require adjusting as well.
2. Preserve statistical distributions
Retaining the statistical distribution of values is sometimes required when masking data. Consider the management of patient records at a hospital, for example. As the patient is admitted, records are generated: treatment codes, clinical findings, diagnostic test results, pre and postoperative care, patient’s progress, and medications. Each record generates date and time stamps. For security reasons, these dates and times must be anonymized. For statistical reasons, they may be required to stay within a set time frame (only adjusted by a set number of days, hours or minutes) and also may be required to stay with the same time difference or a small variation. Retaining the statistical distribution of values allows analysis of treatment efficiency to remain accurate without jeopardizing the patient’s identity by exactly matching the original records.
3. Want to pass the test? Retain proper formatting.
Many systems also require data to adhere to proper formats in order to pass validity tests in downstream systems—email addresses, credit card numbers, and social security numbers are among some formatting examples. Masked data must be created in a format that meets these validation tests. It must also be able to meet more complex tests, like the Luhn algorithm used to arrive at the proper check digit for credit card numbers, or Canadian social insurance numbers.
4. Data consistency is key
Repeatability is important not only across a data set but also over time. Consider a refresh of development databases with a new extract from production. From a test and development point of view, it is vital to have the data stay consistent across a refresh so that test cases, automated testing tools, and the like can find the same records again. Customer names, numbers, emails, or other identifiers in the production data should be “anonymized” into the same fictitious replacement each time a refresh is performed. This allows uninterrupted work on the same masked data after the refresh is performed, along with whatever new data might have been generated in production since the last refresh.