6.6 Data Archiving Requirement Pattern

Basic Details

Related patterns:	Data longevity
Anticipated frequency:	Zero to three requirements, rarely more
Pattern classifications:	Affects database: Yes

Applicability

Use the data archiving requirement pattern to specify the moving or copying of data from one place of permanent storage to another.

Do not use the data archiving requirement pattern to specify the regular backing up of a database. It's reasonable to assume that whatever database product is being used supports the regular backing up of its data (and restoring it in the event of trouble). If you want requirements to this effect, specify them for the information storage (that is, database) infrastructure. (See the "Information Storage Infrastructure" section in Chapter 7, "Data Entity Requirement Patterns.")

Discussion

Archiving is one of those Information Technology terms that's uncomfortably vague, with tattered boundaries. Many people have different interpretations of it. Here's a definition of its meaning in this book:

Archiving: The moving or copying of a body of data from one medium of permanent storage to another. A "body of data" archived can be any subset of the available data. The backing up of a whole database is expressly excluded: it is not archiving.

This is, like many definitions, rather stark and cold. It's expressed in narrow functional terms-not technical, exactly, but neither does it hint at what business purposes archiving might serve. So for what might we want to use it? Archiving is commonly used for the following purposes (and possibly more than one simultaneously):

Historic, to create an offline record of data that's due to be deleted because it's no longer needed in the online system.
Performance, by minimizing the quantity of data to be searched and processed and by allowing data to be duplicated somewhere else so that heavy work on one copy doesn't affect the performance of another.
Noninterference, to create a copy of data that can be worked on and mucked around with without affecting the original data.
Security, because it's impossible to improperly access information that's no longer present.

More obscure uses for archiving include

Proof of existence, so that we can prove that certain data was present at the time an archive copy was made.
Expiry of permission, if the data belongs to someone else and our authority to continue using it has ended. This might occur if the data is associated with a third-party product or belongs to a company on whose behalf we have been doing processing but no longer.

Sometimes the benefits of using data archiving don't reveal themselves until the system's being designed, which is fine. So it's not always necessary for the requirements to specify data archiving, even if the system ends up using it. Write a data archiving requirement if you have a specific need. It's worth pointing out that this requirement pattern delves into technical matters more than we like to in requirements, but it's important to understand the implications of what we specify because they can profoundly affect the nature of the whole system. This is especially true of the "Extra Requirements" section, which deals with some of those implications.

A straightforward commercial system could have a database that will happily store all the data it ever accumulates in its lifetime and whose data conveniently stays in the same place all the time. Such a system might have no need for data archiving. Also, disk space is so plentiful and machines so powerful that the need to delete unwanted data has all but disappeared in many organizations, to the extent that the thought of doing so is alien to many developers. The cost of building software to purge data becomes harder to justify (but, surely, it's still the right thing to do!).

Data Archiving Model

Data archiving requirements are good at elucidating the details of all the kinds of archiving a system needs, but they can be made clearer by painting an overall picture of what's going on: what data goes from where to where? A relatively simple diagram like this one does the job:

image from book

It can help us decide what archiving is needed and help us spot omissions. As such it deserves to be treated with respect, which is more likely if we give it a fancy name: a data archiving model. In most systems, data is usually in the right place at the right time, but only data administrators know the whys and wherefores of the copying and movement of data. That's not good enough-everyone, especially testers and auditors, should be able to tell what's going on.

Use whatever format best suits what you have to say. This diagram uses a cylinder for all permanent storage, so it doesn't influence the choice of storage media-but it uses cylinder size to suggest relative quantities of data. Supplement the diagram with textual explanations if it'll make it clearer.

Avoid mentioning specific technology in a data archiving requirement, as far as possible. If you can, leave it up to the development team to pick the most suitable storage media and products. But place whatever realistic demands you wish upon the characteristics of the storage medium, such as the ease of reading it or use of nonproprietary formats. Ordinarily data is archived to machine-readable media, but it needn't be. Purely human-readable media is sometimes acceptable, and there are circumstances where it is preferable. Reports printed on paper are harder to tamper with undetectably than most digital media (unless you go to the trouble of using digital signatures and the like), and they can be read with no technology at all. Filing away piles of paper may seem old-fashioned, but it has its merits. Then there are intermediate technologies such as microfiche, which require low-tech readers but surely have more chance than a digital medium of being readable fifty years hence. (I can't help but reflect that our digital era will leave to posterity fewer enduring artifacts than the books and scrolls and tablets of earlier ages.)

Data archiving is often neglected, both in specifications and in systems themselves, partly because it's not prominently visible and partly because most archiving is unimportant when a system is new. It takes time for data to grow old. It's usually quite legitimate (and not detrimental) to give archiving-related requirements a relatively low priority and not to worry about them in initial versions of a system. The building of systems is dominated by short-term priorities: everyone wants to get their hands on a new system and start using at quickly as possible. Data archiving is all about the long term, which is why it's put on the back burner and why functions for manipulating very old data are all but unheard of in normal systems.

Content

A data archiving requirement should contain the following:

Data description What information is to be moved or copied? State as clearly as possible the criteria for selecting the data. It may be narrow (one type-from a single database table, say) or broad (many types); it may involve a small quantity of data (possibly none at all, sometimes) or lots. The nature and quantity of data to archive can have a major bearing on the implementation (and type of storage media), but the requirement need not concern itself with that.
Move or copy? Is the original data to be left where it is, or is to be removed? The word "move" implies that the original data is deleted, but you might want to say so explicitly, to forestall misunderstandings.
Origin Where does the data being moved or copied originally reside? It is most commonly a database.
Destination Where is the data to be moved or copied to? It might be an offline medium or another database. Keep an open mind about the storage medium, as far as possible.

In most cases, archiving involves one origin and one destination, but it can be more complex. Data in one place could be split up and saved separately, especially if it is segregated (as described in the multiness requirement pattern in Chapter 10)-for example, to create an archive tape for each company in a multicompany system.

Conversely, data from several places could be loaded into a single destination-for example, if we have a consolidated reporting database into which is placed data archived from several other systems. A situation like this involves various other complications, so properly analyze it and specify additional requirements as appropriate.
Frequency How often should the archiving be done? This might also encompass at what time of day, though try to express this in terms of intention rather than a specific time of day (for example, on the first day of each month, in the wee small hours when system activity is at its lowest). Frequencies can vary enormously-from once a year to every few seconds-which has a huge bearing on the nature of the implementation. If the archiving is manually initiated, the frequency is indicative only-to assist the devising of operational procedures.
Initiator What starts the archiving process? Should it run automatically, or only when a person requests it? Or doesn't it matter (as far as the requirement is concerned)?
Purpose Why is this data being archived? It might be because it is no longer needed online, to improve performance, or for any of the other reasons listed above.

One further item might be needed in rare circumstances, but avoid it if possible:

Archive format Say as little as you need to, and be as nonspecific as you can. Stating the archive format might be necessary if you're producing archives to be read by another system or to satisfy some external standard. It's possible to offer multiple archiving formats (perhaps pluggably, as per the extendability requirement pattern, also in Chapter 10).

Template(s)

Open table as spreadsheet

Summary	Definition
«Data summary» archiving	«Data description» shall be [moved]/[copied] from «Data origin» to «Data destination» «Frequency». «Initiator description». [The purpose of this is to «Archiving purpose».]

Summary

Definition

«Data summary» archiving

«Data description» shall be [moved]/[copied] from «Data origin» to «Data destination» «Frequency». «Initiator description».

[The purpose of this is to «Archiving purpose».]

Example(s)

Open table as spreadsheet

Summary	Definition
Customer order archiving	Customer orders and all details pertaining to each order shall be eligible to be moved to an offline storage medium a configurable number of days (expected to be of the order of 90 days) after the whole order has been fulfilled, and actually moved the next time the order archiving process is run thereafter. The resulting offline storage media shall be retained indefinitely. This requirement makes no judgment about how the order archiving process is to be initiated; it may be manual or automatic. The purpose of this is to reduce the quantity of transactional data in the online system (to assist performance) and to reduce the adverse impact of unauthorized access to the online data.
Reporting database archiving	All changes in the data in the Web order system (including new data) shall be archived to the order reporting system. This archiving shall be performed sufficiently frequently that any update to the Web order database shall reach the order reporting database within 60 minutes. The purpose of this is to keep the contents of the order reporting database reasonably up-to-date.
Whole company archiving	It shall be possible to create an offline archive of all data belonging to a nominated company. The purpose of this is to allow a company to obtain a copy of its data, particularly if it wants to end its participation in the services run by the system.
Customer statement archiving	For every printed statement sent to a customer, another printed copy shall be produced for archiving purposes. A statement for archiving shall be produced within one lapsed month from the printing of the original statement, and it must contain the same information as the original. All statements printed at one time shall be sorted by customer number. If more than one statement is printed for one customer, they shall be in statement number sequence. This is for legal reasons: to be able to defend the content of any statement disputed by a customer.

Extra Requirements

A data archiving requirement might deserve extra requirements for some of the following things:

Keep track of archives. Cabinets full of offline media aren't much use if you can't tell what each one contains. How do you know what data you have stored offline in archives? What data is stored where? How do you find what you want? Being able to answer these questions demands "indexes" of the archives' contents, which can either be stored on the archive media themselves or in the main system. The latter makes it easier to locate what you want. The level of detail needed in the "indexes" depends on how easily you want to be able to find specific items of data. They could, for example, go so far as to record which archives contain data for which customers. Some archives are simple enough in their content to be managed manually (or by using a spreadsheet).
View archived data. Archived data is of no use unless you can access it. How are we going to look at what's in our archives? This is described further in the "Viewing Archived Data" subsection later in this section.
Reload archived data. One way to view archived data is to load it back into the original system and use the normal functions it has for viewing-or doing other things with-data of this kind. See the "Reloading Archived Data" subsection for more.
Allow rearchiving. If you're serious about keeping your archived data in a usable form, you should be able to transfer an offline copy to another offline medium. Most media degrade over time. And most eventually become obsolete, so you need to copy data stored on it before you dump the last device capable of reading it (such as your last big reel tape reader or your last eight-inch diskette drive).
Deal with unreliable archiving. An archive process can be unreliable if the destination is remote from the origin (that is, dependent on unreliable communications) or if the destination media has inadequate capacity (or it asks you to load an extra medium, and you don't have one). When moving data, it should not be removed from the source until it has definitely been stored at the destination-and it's reasonable for any data archiving requirement to assume this implicitly. A "move" could separate the copy and the remove stages, with the remove being conditioned on acknowledgment that the copy succeeded (or for there to be a delay to give time to take another copy if the first failed). Acceptance of the copy might demand more than merely that it was produced successfully; it could also include successful delivery to its recipient. It could also involve creating more than one copy.
Allow access control. Once data is placed on an offline medium, the system's own access control mechanisms cease to protect it. (Also recognize that anyone capable of creating offline media might be able thereby to gain access to data they otherwise couldn't.) Data can be protected by encrypting it, but this raises its own headaches of tools for accessing it, management of encryption keys, and so on, which is too much for this requirement pattern to cope with. You must work this out for yourself.
Prevent copies of data being modified. If we're archiving data into a database, we may want to prevent the software that uses it there (which might be the main system's software) from modifying it, if we want it to be purely a faithful copy of the original data. In that case, write requirements to this effect.
Take over an old system's archives. If we're replacing an old system, are we taking responsibility for the data archives it produced during its life? Does that mean the new system must be able to access these archives? If so, write requirements specifying what we want. One option is to write a utility to convert the old archives into a format used by the new system (a kind of rearchiving, as described above). This achieves the goal without necessarily involving the main new system.

These features are remarkably rare in commercial systems. The reasons aren't hard to find: they're unsexy, infrequently used, obscure, hard to implement, and unimportant while the system is young-in short, too much work for too little return. Then, because systems tend not to have them, no one expects them. Nevertheless, a system that has gaps in its functionality as a result can't be regarded well-rounded and complete. And perhaps their omission contributes in a tiny way to dissatisfaction with a system and with systems in general.

Viewing Archived Data

There are several possible approaches to take to viewing archived data:

Reload the data into the main system, which will already have a wide range of familiar functions for viewing it. This is the subject of the next subsection, "Restoring Archived Data."
Store the data in an easily viewable format so that commonly available software can view it (a Web browser, or a plain text editor). There can be a trade-off here between convenient viewing by a person and convenient reading by a machine. Decide which is more important. For example, we could archive data as HTML pages to be easily viewable (but more problematical for a machine to extract the raw data), or we could use a more formal structure that suits a machine but is harder for a person to read.

There is scope for more creative answers to this dilemma, but to express one in requirements without it sounding like a solution (without mentioning specific technology, say), you must be nippy on your feet. A good basis for an answer is to store the data in a properly structured format and to write a definition of a way to transform it into a human-readable form. (For example, XML could be the structured format and an XSLT style sheet the definition-to facilitate converting it to HTML.) It's good practice to store any auxiliary definitions on the archive medium, too, so that it becomes self-sufficient.
Write special software to view the archive. This might sound over the top, but you might encounter circumstances where it's appropriate-for example, if the archive must be viewable by an external party to whom you can't or don't want to provide the main system's software. Also consider the platform(s) on which this software is to run: make it as widely usable as possible. Again, for self-sufficiency, place the software on the archive medium itself if there's room for it.

Functions for viewing archived data can usually be treated as low priority: to be built after the mad scramble to get the main system working is over. After all, they're of no use until our data begins to age, unless our system has taken responsibility for old data belonging to an earlier system it replaces. The danger here is that these functions will be forgotten about and never built: more pressing demands on resources will keep coming along. (Strictly speaking, you could forget about archive access functions altogether until you need them-if the chances of actually using them are low. This might be a valid attitude if archives are there only as a last resort-for instance, in case your company is sued.)

Information outlives the system that stores it. What happens to requirements for viewing archives once the system shuffles off its mortal coil? Should they not continue in effect? This is an interesting philosophical issue. When specifying a system, there is a tendency to see it as being born into a pristine new world and being able to live for ever. This ignores its responsibility to take ownership of any system(s) it replaces or to make it easier to hand over information to any system that comes afterwards. The latter may seem an awfully long way off, but a system is properly specified only once and that might be the only opportunity to think about it.

Reloading Archived Data

Accessing archived data by loading it back into the original system sounds like the obvious thing to do, but it's fraught with practical problems, so don't ask for it unless you know what you're doing (and you're aware how much trouble you might be causing). In short, reloading old data can confuse a system. It will probably want to archive it off again (delete it!) at the first opportunity, and it might interfere with other functions (such as statistics). One way around some of these difficulties is to load the data into a separate database, distinct from the main database, which might or might not be looked after by the main system.

Another cause of trouble for reloading archives is changes in the structure of the database during the system's life. For instance, data to be reloaded might be missing values for columns that have been added subsequently. The software for reloading must be able to cope with these inconsistencies, which, unless you're careful, might mean having to enhance it whenever the structure of the database is changed. If you want your archives to be reloadable even if the database is modified, say so in the requirements-though it may involve significant ongoing effort to achieve it. This is perhaps another argument against reloading archives back into the original system as a means to view it.

Here are a couple of requirements related to restoring archived data:

Open table as spreadsheet

Summary	Definition
Restore offline transactions	It shall be possible to restore selected transactions from offline storage to online storage in a convenient manner. "In a convenient manner" means that the number of offline media (for example, tapes) that must be loaded must be small. For example, having to load a daily tape for each day of the two-year life of a customer in order to restore that customer's history is not considered convenient. Criteria for selecting the transactional data to restore must include: For a given customer ID, and date range (with all activity back to the customer's registration being a special option). For a given company ID, and date/time range.
No inquiries load from offline storage	No inquiry shall directly request the loading of data from any offline storage medium. The purpose of this requirement is to prevent casual use of any inquiry from asking for the manual intervention that loading an offline medium is likely to involve.

Summary

Definition

Restore offline transactions

It shall be possible to restore selected transactions from offline storage to online storage in a convenient manner. "In a convenient manner" means that the number of offline media (for example, tapes) that must be loaded must be small. For example, having to load a daily tape for each day of the two-year life of a customer in order to restore that customer's history is not considered convenient.

Criteria for selecting the transactional data to restore must include:

For a given customer ID, and date range (with all activity back to the customer's registration being a special option).
For a given company ID, and date/time range.

No inquiries load from offline storage

No inquiry shall directly request the loading of data from any offline storage medium.

The purpose of this requirement is to prevent casual use of any inquiry from asking for the manual intervention that loading an offline medium is likely to involve.

Considerations for Development

The "Extra Requirements" section touches upon many technical considerations, more so than is normal in a requirement pattern.

Consider the format in which to store offline data. Keep it as open and accessible as possible, not tied to any particular storage product, in case you (or whoever accesses it) don't have that product at the time (perhaps because it's no longer available by then). Consider flat files in some generic format, such as XML, perhaps with an XSLT style sheet included, to make the data self-displaying by suitable software.

Considerations for Testing

The testing of archiving can vary considerably, depending on the nature and format of the destination and the frequency of archiving. You might need to simulate a relatively long time period before the system is due to archive any data at all. (See the data longevity requirement pattern earlier in this chapter for a couple of suggested ways.)

It's impractical to test retrieval of old data following upgrades to the system that affect the structure of data until such an upgrade occurs. Then you can have some fun! Get hold of some data stored offline, and attempt to reload it. If you don't have access to data from a live system, you need some forethought. Create some offline data during earlier testing, and store long-term the test archives you create for each version of the system.

If you're allowed to reload data from an archive into the main system, do so. Test whether the system deletes the reloaded data at the first opportunity. This can be irritating if whoever reloaded the data hasn't had the chance to look at it. If it's unclear (unspecified) how the system should behave in this situation, decide whether you think what it actually does seems sensible.