Historical Versions of Files | Storage Networking Fundamentals: An Introduction to Storage Devices, Subsystems, Applications, Management, and File Systems (Vol 1)

While business continuity may be the most pressing application of historical data management, there are several compelling reasons to keep copies of historical versions of files:

Intellectual property documentation and protection
Compliance with government regulations
Human fallibility (protein robot malfunction)
Multiuse applications

Intellectual property includes many different things such as patents, copyrights, marketing plans, corporate plans, and various other corporate documents and data files. Disputes over the rights to own, access, and control intellectual property are settled based on accurate records and documentation proving when a company or person invented or acquired it. There are many good reasons to keep all electronic documents related to a company's intellectual property.

Corporate tax law in many countries requires data to be retained for a number of years in support of possible tax audits and disputes. However, today new government requirements, such as the Sarbanes-Oxley Act in the United States, mandate the retention of electronic records of all kinds of corporate data, including e-mails and instant messaging. Other regulations regarding customer and patient privacy require stored data to be secure from theft and improper access. The full impact of these new requirements on IT organizations is not clear, but it is likely to create new practice areas in IT for data archiving or ILM.

Another reason to store historical versions of files is the fallibility of human beings (protein robots). It continues to be true that the largest risk to data is a person. When users make mistakes on files they are working on, they sometimes prefer to start over again from the last wayward point of departure rather than undoing that which they have wrought.

NOTE

Of course, systems administrators never, ever make mistakes in the process of managing systems and data. But, if they did, their miscues might go unnoticed by the rest of the organization through amazing feats of skill in the operations of their snapshot or backup systems.

The notion of errors also applies to automated processes that may give disappointing results or have bugs. A process that modifies a data file improperly could be run again by starting over with the previous version of the data.

Creating Historical File Versions

Historical file versions are created when a storage process runs and makes a separate copy of a file. In other words, a spare copy is made and reserved for other purposes besides primary storage. Historical file versions are normally created through backup systems and point-in-time snapshot products.

Data Archiving

Unlike point-in-time copy applications, there usually is not a requirement to access individual files from a specific point in time. Instead, the particular version is what is wanted, regardless of when the file was actively being used.

Data archiving is the practice of making data copies solely for the purpose of accessing long-lasting historical versions of files.

NOTE

There is no agreement on the definition of data archiving within the industry. Many people use the term archive to mean backup. This is a little bit like referring to a river as an ocean, but that's storage talk for you.

The problem is thatamazingly enoughthere is no common term for the unnamed thing that makes long-lasting copies of files that can be easily accessed if they ever need to be. ILM seems to be emerging as a million-dollar acronym that encompasses the concept and extends it in a few similar dimensions.

So, I like to use terms like historical version or archival copy as opposed to the shorter term archive. Hopefully these work for you without getting in the way of your understanding.

Creating Historical Versions with Backup

One of the most common ways to create historical versions is with a backup system. Depending on the rotation algorithms and operations schedules, backups sweep file systems, backing up copies of files when they are changed and occasionally backing up complete volumes.

Snapshot products also create historical versions of files either by capturing the complete volume contents or by preserving older versions when they create point-in-time data sets.

But neither backup nor snapshot techniques can be relied on to create long-lasting (three to seven years) archival copies of files, because storage space in both systems is typically reused at some point. Backup tapes are periodically reused according to the tape rotation schedule. Snapshot disk capacity is eventually recycled as part of regular capacity management and system maintenance.

Currently the best way to keep file versions for an extended period of time is through backup technology where data can be stored off-line historical purposes. If it were not for maintenance problems, tape media would be perfect for the task. Tape has relatively large capacities, and multiple versions of a file can be stored on a single tape. Also, data stored on tape does not interfere with production online processing.

One way to create long-term archival copies of data is to make copies of backup tapes. The problem with this approach is that the backup metadata may be erased when the original tape is rewritten by the backup system. In that case, it is important to make sure a copy of the backup database is on each tape or is otherwise available to facilitate locating file versions on tape.

Another way to create long-lasting copies of files on tape is to create different pools for archival copies. These tapes can be managed separately, outside the normal rotation scheme, and could be created through tape-to-tape copy processes or could be separate, special-purpose backup jobs.

It is also possible to install separate backup software systems for archiving purposes. Normal day-to-day backup processes would use one backup system and archiving processes would run periodically with the archiving system. It is essential to practice diligent tape management if you do this to ensure the two tape systems do not become confused and mixed.

The backup system may be able to write to optical media in addition to tape. Storing historical versions on optical media is not a bad idea, but there may be problems finding optical media that is both fast enough and has the capacity for the job. Considering the relative lack of success the optical disk industry has had in its history, it is not clear that suitable optical systems and media will always be available, even though it would appear to be an excellent technology for historical data archiving.

Finally, historical versions of files can be accessed only if there is a device and a system that can use the device and software that runs on the system to use the device and read the media. That translates into archiving the complete backup system as well as the tapes. Either that, or copies must be made periodically from old media to new media as backup equipment is updated. This tape-to-tape copy approach has merit as a way to keep refreshing the media that data copies are stored on, but it is also likely to be the sort of work that is easily skipped or overlooked.

Creating Historical Versions with Snapshot Technology

Whereas tape is designed to sit on shelves or in libraries, point-in-time snapshots take up storage space in disk subsystems, which eventually leads to capacity-full problems, performance problems, or both. It could be argued that whole volume snapshots do not have this problem because they do not reside on production disks. However, the capacity needed for a whole volume snapshot can be enormous, and it is unrealistic to think that there would be many such versions spinning on disk drives within an organization.

One approach is to implement snapshots in a way that stores snapshot data on second tier storage. This could be using delta-volume snapshots, where the old overwritten data is written to a second tier storage subsystem. It also could be done with file system snapshots where the older, overwritten blocks of a file are relocated to second tier storage, so that their block locations in the primary file system can be returned to the free space pool.

Accessing Historical File Versions

Storage and retrieval are different but related processes. The following sections discuss techniques and technologies used to access historical versions of files.

Restoring Files from Backup Tapes

Traditionally, the most common way of accessing different versions of files is through a backup system restore process. This involves using an interface presented by the backup system and locating historical files on backup tapes. Most backup applications have a number of different views that are generated from information in the backup system metadata (database or log files). For instance, the administrator can search for files based on the day they were backed up or based on their location in the file system.

NOTE

One of the more common requests is to restore a file that was deleted by mistake. When that happens, the process of finding the last file version is complicated by the fact that users will sometimes swear on a stack of religious books that the file was always located in a particular directory that actually never contained it. That's one of the reasons administrators prefer file system snapshotsusers can restore their own files and waste their own time.

Once a particular file version has been identified, the backup software indicates one or more tapes that can be used to restore it. If a tape library is being used and one of the tapes is available in the library, this restore process can be mostly automated. File or path names will probably have to be changed to allow both the current version and the historical version to coexist in the same file system.

Accessing Snapshot Versions of Files

Point-in-time snapshot technology is also used to access historical versions of files. Unlike backup systems, which tend to be independent of operating systems and file systems, delta volume and file system snapshots can be closely integrated with operating system functions.

File system snapshot facilities allow users to view different versions of files from different virtual network volumes and select the one they want to work on. End-user access to snapshot data can be displayed through familiar name space interfaces, such as a mount point (drive letter) or folder.

Whole volume snapshot technology is the least useful for accessing historical file versions due to the size of the snapshot volume and the challenges in keeping multiple copies of the files available. Dedicated backup systems that back up whole volume archives could be used to locate data from tape and restore it.