Section 8.2. U.K. Government Dossier on Iraq

8.2. U.K. Government Dossier on Iraq

Even when care is taken to remove the comments and modified text, other data may remain hidden in the dark corners of a Word document, which can still reveal more than its authors would prefer.

This was the case with a dossier prepared by the office of U.K. Prime Minister Tony Blair in February 2003, detailing the impact of Iraq's intelligence and security services on the United Nations weapons inspections that were taking place at the time. The document was used to support the argument that inspections were not working and that military action against Iraq was justified. Such an important document was bound to attract close scrutiny.

Glen Rangwala, a faculty member at Cambridge University, thought the text looked familiar. After some cross-checking in the library, he discovered that large sections of the text had been lifted from an article published in September 2002, by Ibrahim al-Marashi, a graduate student in the United States. Text had clearly been cut and pasted from the original work, as evidenced by the grammatical errors of the author being carried through to the dossier. Some sentences had been modified, but in all of these the new version was more strongly worded. Additional text had been taken from two other authors. None of the copied text was attributed to the original author. Rangwala's original analysis (http://www.casi.org.uk/discuss/2003/msg00457.html) makes for very interesting reading.

The report of such blatant plagiarism caught the attention of Richard M. Smith in the United States. He noticed that the dossier had been posted on the 10 Downing Street web site as a Microsoft Word document. There was an outside chance that it might contain some clues about the people involved in its preparation, so he downloaded a copy and started poking around. The file is available on his web site: http://www.computerbytesman.com/privacy/blair.doc.

Opening it up in Word showed that it had been properly sanitized. No evidence was left from the Track Changes feature and no comments could be retrieved. But Smith decided to delve a little deeper. He happened to know that a Word document contains a hidden revision log that represents its history, including the names of the people who worked on it and the names of the files that it was saved as. He was able to extract the log from the dossier, as shown here:

     Rev. #1: "cic22" edited file "C:\DOCUME~1\phamill\LOCALS~1\Temp\     AutoRecovery save of Iraq - security.asd"     Rev. #2: "cic22" edited file "C:\DOCUME~1\phamill\LOCALS~1\Temp\     AutoRecovery save of Iraq - security.asd"     Rev. #3: "cic22" edited file "C:\DOCUME~1\phamill\LOCALS~1\Temp\     AutoRecovery save of Iraq - security.asd"     Rev. #4: "JPratt" edited file "C:\TEMP\Iraq - security.doc"     Rev. #5: "JPratt" edited file "A:\Iraq - security.doc"     Rev. #6: "ablackshaw" edited file "C:\ABlackshaw\Iraq - security.doc"     Rev. #7: "ablackshaw" edited file "C:\ABlackshaw\A;Iraq - security.doc"     Rev. #8: "ablackshaw" edited file "A:\Iraq - security.doc"     Rev. #9: "MKhan" edited file "C:\TEMP\Iraq - security.doc"     Rev. #10: "MKhan" edited file "C:\WINNT\Profiles\mkhan\Desktop\Iraq.doc"

This short block of text is a treasure trove of information that he and Rangwala were able to dissect (http://www.computerbytesman.com/privacy/blair.htm). cic22 is a reference to a government office called the Communications Information Centre. The word phamill in the first three file paths looks like the name of a person; and JPratt, ablackshaw, and MKhan are clearly names. It took only a few calls to news reporters to figure out the role of each individual. Paul Hamill was a Foreign Office official, John Pratt worked in 10 Downing Street, Alison Blackshaw was the personal assistant to Blair's Press Secretary, and Murtaza Khan was a junior press officer in Downing Street. So not only was the document full of plagiarized text, but there was clear evidence that the Prime Minister's press office had played a major role in its preparation.

The affair of the so-called dodgy dossier became a major embarrassment for the government. The foreign secretary was hauled in front of a House of Commons select committee, where even he admitted that the affair was a complete Horlicks (a colorful British euphemism). Things quickly went from bad to worse with a controversial piece of reporting from the BBC alleging that Downing Street's press officers had changed the original intelligence assessments to suit their political agenda. The tragic suicide of a senior government scientist involved in the report, and the subsequent public inquiry, ensured that the dossier remained in the headlines for months, even as the events of the war itself unfolded.

The revision log tells us one more thing. The file paths indicate that the documents were edited on Windows systems, which is not surprising. However, note that several of the paths begin with A:. This is the default drive ID for a floppy disk drive. We can see that Pratt and Blackshaw both accessed the document on a floppy, perhaps preparing for transfer to another individual. Thanks to the select committee hearings we now know the recipient of that disk was none other than Colin Powell, U.S. Secretary of State, who used the dossier in his address to the United Nations as justification for the invasion of Iraq.

These seemingly mundane details in a file revision log reflect actions at the highest level of government that eventually led nations to war. This is a dramatic illustration of the power of Internet forensics and how simple tools can have an immense impact.

8.2.1. Extracting Word Revision Logs

Word documents use a proprietary format that is extremely complex. It has to represent not just the text of a document but also the details of how it is formatted. It can include images, embedded spreadsheets, and a host of other objects. Somewhere in the midst of all that is the revision log. Rather than try and recover that specific information, I will show you a general approach that will extract most text strings in a document. Look through that output; it is usually easy to spot the revision log.

The approach is to use the standard Unix program strings, which I discuss in Chapter 3 in the context of dissecting email attachments. Running strings on a Word document will display the text of the document along with various other pieces of information. Here is the output from a very simple Word document, with a few duplicate lines edited out:

     % strings HelloWord.doc     jbjbq     Hello Word     Hello Word     Robert Jones     Normal     Robert Jones     Microsoft Word 10.1     Craic Computing LLC     Hello Word     Title     Microsoft Word Document     NB6W     Word.Document.8

That reveals the content of the document: the phrase "Hello Word," along with the author's name, the organization that owns the software, the title, and the version of Word that was used. But it does not include anything that looks like a filename. By default, strings will only look for ASCII characters that are encoded as single 7-bit bytes, which is the standard way of encoding regular text in binary documents. For various reasons, mostly to do with representing characters from non-ASCII alphabets, Word saves certain text in different encodings. In recent versions of strings, you can specify alternate encodings using the -e option. Read the man page on your system to see if you have this capability. By running strings -eb, for example, you can reveal any text encoded in 16-bit little endian format. To save you the hassle of running the program multiple times with different options, I have written a Perl script that does that for you and that presents the resulting text in the proper order. This is shown in Example 8-1.

Example 8-1. superstrings.pl

     #!/usr/bin/perl -w     die "Usage: $0 <word doc>\n" unless @ARGV == 1;     my %hash = (  );     foreach my $encoding ('s', 'b', 'B', 'l', 'L') {        my $text = `strings -td -e$encoding $ARGV[0]`;        foreach my $line (split /\n/, $text) {           if($line =~ /^\s*(\d+)\s+(.*)$/) {              $hash{$1} = $2;           }        }     }     foreach my $offset (sort { $a <=> $b } keys %hash) {         printf "%s\n", $hash{$offset};     }

Running this on my example document produces about twice as many lines of output. Most of these are related to document formatting, but included among them is the following:

     Robert JoneseMacintoshHD:Users:jones:Documents:Craic:Writing:     Forensics:Examples:HelloWord.doc

This is the path to the example document on the Macintosh that I am using to write this book. The output that results from running the script on the Iraq dossier includes this block:

     cic22JC:\DOCUME~1\phamill\LOCALS~1\Temp\AutoRecovery save of     Iraq - security.asd     cic22JC:\DOCUME~1\phamill\LOCALS~1\Temp\AutoRecovery save of     Iraq - security.asd     cic22JC:\DOCUME~1\phamill\LOCALS~1\Temp\AutoRecovery save of     Iraq - security.asd     JPratt C:\TEMP\Iraq - security.doc     JPratt A:\Iraq - security.doc     ablackshaw!C:\ABlackshaw\Iraq - security.doc     ablackshaw#C:\ABlackshaw\A;Iraq - security.doc     ablackshaw A:\Iraq - security.doc     MKhan C:\TEMP\Iraq - security.doc     MKhan(C:\WINNT\Profiles\mkhan\Desktop\Iraq.doc

With the exception of an arbitrary character between each the username and file path on each line, this block is identical to the revision log shown in the previous section. Try running the script on your own Word documents, or more interestingly, on those that you have received as email attachments.

If you want to look further afield, Google can supply you with a wide variety of source material. Including the phrase filetype:doc in your query will limit the results to Word documents. The number of these that are publicly available is astounding. The query press release filetype:doc reports around 695,000 matches. This combination of Google searches and revision log parsing could be very productive for investigative journalists with a basic knowledge of Unix.

8.2.2. Discovering Plagiarism

Plagiarism is a widespread problem that has benefited greatly from the growth of the Internet and the capabilities of search engines. It involves college students copying essays from Internet sites, scientists trying to pass off the results of other researchers as their own in grant applications and papers, and journalists stealing content from their colleagues to include in their own dispatches. The Iraq dossier is an unusually bold example given its high profile and its brazen copying of text from other sources without attribution.

Detecting plagiarism is difficult. The Iraq dossier was only revealed because Glen Rangwala was familiar with the area and recognized the text from another paper. Automated detection in the most general sense is extremely difficult. Some success has been achieved in the area of college papers, due in part to the volume of examples that are available. Several companies, such as Turnitin.com, offer online services that screen all papers submitted by students of subscribing institutions. The risk that attempted plagiarism will be detected can be an effective deterrent, regardless of how well the detection software might perform on any given example.

On a more basic level, you can use Google to identify similar documents on the Web. The related: syntax lets you search for content that is similar to a specific page. A search of related:www.computerbytesman.com/privacy/blair.doc returns around 30 pages. Most of these include quotes from the dossier or they describe the copying of content from elsewhere. But some way down the list is the original article from which most of the text was copied (http://meria.idc.ac.il/journal/2002/issue3/jv6n3a1.html). The measure of similarity that Google uses to relate texts is not ideal for this purpose and it can return a bunch of seemingly unrelated material. But if two or more pages with very similar content have been indexed in Google, then a related: search with any one of them should identify the other examples.

The downside of this approach is that the text you want to search with must be available on the Web and already be indexed by Google prior to your search. Unfortunately, you cannot create a new page, post it on a web site, and submit a related: search that refers to it. Google appears to look for that page in its existing index, rather than fetching it from the original site. If it fails to retrieve the page, then it returns results based simply on the URL, which is not going to be what you expect.

Having discovered two documents that appear to be related, the next step is to identify the identical or similar text. This is a difficult problem in and of itself. If the files are essentially carbon copies of each other, then the Unix utility diff might be useful, but for most cases it fails completely. diff was designed for comparing very structured text from source code listings and computer output, and it cannot handle the diversity in the way text is laid out in regular documents.

The comparison of arbitrary text and the alignment of similar, but non-identical, sentences are hard problems that continue to attract the interest of computer scientists. A related problem is that of DNA and protein sequence comparison, which lie at the heart of bioinformatics. Algorithms based on dynamic programming have proven to be very useful in both fields, although their performance characteristics have led to faster, more approximate methods being developed.

The Perl script shown in Example 8-2 is a very simple implementation of dynamic programming as applied to text comparison. It has a number of significant limitations but it serves as a useful way to find identical text within two arbitrary documents. It splits the text of the two documents into two arrays of words, eliminating punctuation and words of less than four characters. It then takes each word from the first document in turn and looks for a match in the second document. For every match it finds, it initiates a comparison of the arrays starting at that point. It steps forward through the arrays one word at a time, increasing the score of the matching segment if the words are identical. Conceptually, this is like taking a diagonal path through a matrix where each element represents the comparison of word i from array 0 with word j from array 1. All matching segments with greater than a minimum score are saved, overlapping segments are resolved, and finally, the text that comprises each of these is output. A more complete implementation would be able to handle insertions or deletions of words in one text relative to the other.

Example 8-2. compare_text.pl

     #!/usr/bin/perl -w     die "Usage: $0 <file1> <file2>\n" unless @ARGV == 2;           my $minscore = 5;     my @words0 = (  );     my @words1 = (  );           loadWords($ARGV[0], \@words0);     loadWords($ARGV[1], \@words1);           my %segment = (  );     my $score = 0;     my $maxscore = 0;     my $maxi0 = 0;     my $maxi1 = 0;           for(my $i0 = 0; $i0 < @words0; $i0++) {        my $word0 = $words0[$i0];        for(my $i1 = 0; $i1 < @words1; $i1++) {           if(lc $words1[$i1] eq lc $word0) {              ($maxscore, $maxi0, $maxi1) =                   matchDiagonal(\@words0, \@words1, $i0, $i1);                    if(exists $segment{$maxi0}{$maxi1}) {                 if($maxscore > $segment{$maxi0}{$maxi1}){                    $segment{$maxi0}{$maxi1} = $maxscore;                 }              } else {                 $segment{$maxi0}{$maxi1} = $maxscore;              }           }        }     }     foreach my $maxi0 (sort keys %segment) {        foreach my $maxi1(sort keys %{$segment{$maxi0}}) {           $maxscore = $segment{$maxi0}{$maxi1};           if($maxscore >= $minscore) {              printf "%s\n\n",              traceBack(\@words0, \@words1, $maxi0, $maxi1, $maxscore);           }        }     }           sub matchDiagonal {        # Extend an initial word match along both word arrays        my ($words0, $words1, $i0, $i1) = @_;        my $maxscore = 0;        my $maxi0 = $i0;        my $maxi1 = $i1;        my $score = 0;        my $j1 = $i1;        for(my $j0 = $i0; $j0 < @$words0; $j0++) {            if(lc $words0->[$j0] eq lc $words1->[$j1]) {                $score++;                if($score > $maxscore) {                    $maxscore = $score;                    $maxi0 = $j0;                    $maxi1 = $j1;                }            } else {                $score--;            }            if($score < 0) {                $score = 0;                last;            }            $j1++;            last if($j1 >= @$words1);        }        ($maxscore, $maxi0, $maxi1);     }           sub traceBack {        # Trace back from the maximum score to reconstruct the matching string        my ($words0, $words1, $maxi0, $maxi1, $score) = @_;        my @array0 = (  );        my @array1 = (  );        my $i1 = $maxi1;        for(my $i0 = $maxi0; $i0 >= 0; $i0--) {            push @array0, $words0[$i0];            push @array1, $words1[$i1];            if(lc $words0[$i0] eq lc $words1[$i1]) {                $score--;            }            last if($score == 0);            $i1--;            last if($i1 < 0);        }              my @array = (  );        for(my $i=0; $i<@array0; $i++) {           if(lc $array0[$i] eq lc $array1[$i]) {               push @array, $array0[$i];           } else {               push @array, sprintf "((%s/%s))", $array0[$i], $array1[$i];           }        }        join ' ', reverse @array;     }           sub loadWords {        # Read in the text word by word - skip short words        my ($filename, $words) = @_;        my $minsize = 4;        open INPUT, "< $filename" or die "Unable to open file: $filename\n";        while(<INPUT>) {          $_ =~ s/[^a-zA-Z0-9]+/ /g;          $_ =~ s/^\s+//;          foreach my $word (split /\s+/, $_) {            if(length $word >= $minsize) {                push @$words, $word;            }          }        }        close INPUT;     }

To use the script, you need to extract the plain text from the two documents, as opposed to the HTML source of a web page, for example. The removal of punctuation and short words improves the quality of the comparison but makes the output more difficult to read. Word differences within matching segments are shown within two sets of parentheses, which enclose the non-matching words.

Applying the program to the text files saved from the Iraq dossier Word document, and a web page containing the al-Marashi paper, on which it was based, produces a large number of matching segments that indicate the extent of the plagiarism in that case. Here are some examples from that output:

     % ./compare_documents.pl marashi.txt dossier.txt     [...]     Jihaz Hamaya Khas Special Protection Apparatus charged with     protecting Presidential Offices Council Ministers     [...]     informants external activities include ((monitoring/spying)) Iraqi     ((embassies/diplomats)) abroad collecting overseas intelligence     ((aiding/supporting)) ((opposition/terrorist)) ((groups/organisations))     hostile regimes conducting sabotage subversion terrorist operations     against     [...]     shifting directors these agencies establish base security     ((organization/organisation)) substantial period time     [...]

Some of the differences represent simple replacement of the original American English text by the British spelling of the same wordfor example, organization has been replaced with organisation throughout the document. Most troubling are examples like the second block in this output. Here the original text aiding opposition groups has been replaced with the more strongly worded phrase supporting terrorist organisations.

Even though this script has serious limitations, it provides a simple way to compare two text files, to display similar blocks of text, and to highlight small but possibly significant differences between them.

8.2.3. The Right Way to Distribute Documents

Most of these Word document problems could have been prevented if the authors had converted the files to PDF before distributing them. All of the Word-specific revision logs, comments, and edits would have been stripped out as part of that process. PDF files do have hidden information of their own but it is typically limited to identifying the software used to create the file, unless the author has explicitly added comments and the like. using Adobe Acrobat. The publicity surrounding the various Word document disclosures in recent years has prompted many governments to require that documents be converted to PDF prior to publication.

But for many other purposes, it still makes sense to transfer documents in Word format, especially in the business world. Many situations arise where two or more parties need to revise and comment on the wording of a document, and Word documents remain the most convenient way to do this. So how should you sanitize a document?

Most of the issues can be dealt with by removing any identifying information in the program preferences, such as your name, and then either avoiding Track Changes or being careful to accept or reject all outstanding edits and comments before final release.

If the text styling and layout of the document is relatively simple, then a quick and effective solution is to save the document in Rich Text Format (RTF). This is a subset of native Word format that can represent most documents, but RTF has the advantage of not including the metadata with it.

Microsoft has proven to be quite forthcoming about metadata and its removal. One of several Knowledge Base articles on its web site, entitled "How to minimize metadata in Word 2003" (http://support.microsoft.com/kb/825576), details 20 different types of metadata that can be removed manually or through setting certain preferences. They also offer a Remove Hidden Data plug-in for recent versions of Word, Excel, and PowerPoint to make the process less burdensome.

Tools are also available from third-party vendors. Most of these are targeted at law firms, and some can be integrated with mail servers to apply security policies automatically to all outgoing email. Workshare Protect, iScrub, and Metadata Scrubber are three commercial tools.

Be aware that any complex document format may contain hidden metadata. At the very least, be sure to check for identifying information in the various menu items available in the software used to create or view the document. To be thorough, run the script shown in Example 8-1 on the document file and look for hidden strings. Always understand the hidden information that your documents carry with them.