4 Secrets for Archiving Stale Data Efficiently

In order for organizations to find an effective solution to help deal with stale data and comply with defensible disposition requirements, there are 4 secrets to efficiently identify and clean-up stale data...

Rob Sobers

3 min read

Last updated October 14, 2022

The mandate to every IT department these days seems to be: “do more with less.” The basic economic concept of scarcity is hitting home for many IT teams, not only in terms of headcount, but storage capacity as well. Teams are being asked to fit a constantly growing stockpile of data into an often-fixed storage infrastructure.

So what can we do given the constraints? The same thing we do when we see our own PC’s hard drive filling up – identify stale, unneeded data and archive or delete it to free up space and dodge the cost of adding new storage.

Get the Free Essential Guide to US Data Protection Compliance and Regulations

Stale Data: A Common Problem

A few weeks ago, I had the opportunity to attend VMWorld Barcelona, and talk to several storage admins. The great majority were concerned about finding an efficient way to identify and archive stale data. Unfortunately, most of the conversations ended with: “Yes, we have lots of stale data, but we don’t have a good way to deal with it.”

The problem is that most of the existing off-the-shelf solutions try to determine what is eligible for archiving based on a file’s “lastmodifieddate” attribute; but this method doesn’t yield accurate results and isn’t very efficient, either.

Why is this?

Automated processes like search indexers, backup tools, and anti-virus programs are known to update this attribute, but we’re only concerned with human user activity. The only way to know whether humans are modifying data is to track what they’re doing—i.e. gather an audit trail of access activity. What’s more, if you’re reliant on checking “lastmodifieddate” then, well, you have to actually check it. This means looking at every single file every time you do a crawl.

With unstructured data growing about 50% year over year and with about 70% of the data becoming stale within 90 days of its creation, the accurate identification of stale data not only represents a huge challenge, but also a massive opportunity to reduce costs.

4 Secrets for Archiving Stale Data Efficiently

1. The Right Metadata

In order to accurately and efficiently identify stale data, we need to have the right metadata – metadata that reflects the reality of our data, and that can answer questions accurately. It is not only important to know which data hasn’t been used in the last 3 months, but also to know who touched it last, who has access to it, and if it contains sensitive data. Correlating multiple metadata streams provides the appropriate context so storage admins can make smart, metadata-driven decisions about stale data.

2. An Audit Trail of Human User Activity

We need to understand the behavior of our users, how they access data, what data they access frequently, and what data is never touched. Rather than continually checking the “lastmodifieddate” attribute of every single data container or file, an audit trail gives you a list of known changes by human users. This audit trail is crucial for quick and accurate scans for stale data, but also proves vital for forensics, behavioral analysis, and help desk use cases (“Hey! Who deleted my file?”).

3. Granular Data Selection

All data is not created equally. HR data might have different archiving criteria than Finance data or Legal data. Each distinct data set might require a different set of rules, so it’s important to have as many axes to pivot on as possible. For example, you might need to select data based on its last access data as well as the sensitivity of the content (e.g., PII, PCI, HIPAA) or the profile of the users who use the data most often (C-level vs. help desk).

The capability to be granular when slicing and dicing data to determine, with confidence, which data will be affected (and how) with a specific operation will make storage pros lives much easier.

4. Automation

Lastly, there needs to be a way to automate data selection, archival, and deletion. Stale data identification cannot consume more IT resources; otherwise the storage savings diminish. As we mentioned at the start, IT is always trying to do more with less, and intelligent automation is the key. The ability to automatically identify and archive or delete stale data, based on metadata will make this a sustainable and efficient task that can save time and money.

Interested in how you can save time and money by using automation to turbocharge your stale data identification? Request a demo of the Varonis Data Transport Engine.

Photo credit: austinevan

What you should do now

Below are three ways we can help you begin your journey to reducing data risk at your company:

Schedule a demo session with us, where we can show you around, answer your questions, and help you see if Varonis is right for you.
Download our free report and learn the risks associated with SaaS data exposure.
Share this blog post with someone you know who'd enjoy reading it. Share it with them via email, LinkedIn, Reddit, or Facebook.

Rob Sobers Rob Sobers is a software engineer specializing in web security and is the co-author of the book Learn Ruby the Hard Way.

4 Secrets for Archiving Stale Data Efficiently

Get the Free Essential Guide to US Data Protection Compliance and Regulations

Stale Data: A Common Problem

4 Secrets for Archiving Stale Data Efficiently

What you should do now

Try Varonis free.

Keep reading