2.4 Complex Data Structures | Mastering Perl for Bioinformatics

Different algorithms require different data structures. Using references in Perl, it is possible to build very complex data structures.

This section gives a short introduction to some of the possibilities, such as a hash with array values and a two-dimensional array of hashes. See the recommended reading in Section 2.9 of this chapter for books and sections of the Perl manual that are very helpful.

Perl uses the basic data types of scalar, array, and hash, plus the ability to declare scalar references to those basic data types, to build more complex structures. For instance, an array must have scalar elements, but those scalar elements can be references to hashes, in which case you have effectively created an array of hashes.

2.4.1 Hash with Array Values

A common example of a complex data structure is a hash with array values. Using such a data structure, you can associate a list of items with each keyword. The following code shows an example of how to build and manage such a data structure. Assume you have a set of human genes, and for each human gene, you want to manage an array of organisms that are known to have closely related genes. Of course, each such array of related organisms can be a different length:

 use Data::Dumper; %relatedgenes = (  ); $relatedgenes{'stromelysin'} = [     'C.elegans',     'Arabidopsis thaliana' ]; $relatedgenes{'obesity'} = [     'Drosophila',     'Mus musculus' ]; # Now add a new related organism to the entry for 'stromelysin' push( @{$relatedgenes{'stromelysin'}}, 'Canis' ); print Dumper(\%relatedgenes);

This program prints out the following (the very useful Data::Dumper module is described in more detail later; try typing perldoc Data::Dumper for the details of this useful way to print out complex data structures):

 $VAR1 = {         'stromelysin' => [                            'C.elegans',                            'Arabidopsis thaliana',                            'Canis'                          ],         'obesity' => [                       'Drosophila',                       'Mus musculus'                      ] };

The tricky part of this short program is the push . The first argument to push must be an array. In the program, this array is @{$relatedgenes{'stromelysin'}} . Examining this array from the inside out, you can see that it refers to the value of the hash with key stromelysin : $relatedgenes{'stromelysin'} . You know that the values of this %relatedgenes hash are references to anonymous arrays. This hash value is contained within a block of curly braces, which returns the reference to the anonymous array: {$relatedgenes{'stromelysin'}} , and the block is preceded by an @ sign that dereferences the anonymous array: @{$relatedgenes{'stromelysin'}} .

2.4.2 Two-Dimensional Array of Hashes

As another example, say you have data from a microarray experiment in which each location on a plate can be identified by an x and y location; each location is also associated with a particular gene and has a set of reported measurements. You can implement this particular data as a two-dimensional array, each entry of which is a (reference to a) hash whose keys are gene names and whose values are (references to) arrays of the measurements. Here's how you can initialize one of the entries of that two-dimensional array:

 $array[3][4]{'stromelysin'} = [3, 4, 5];

The position on the plate is represented by an entry in the two-dimensional array such as $array[3][4] . The fact that the entry is a hash is shown by the reference to a particular key with {'stromelysin'} . That the value for that key is an array is shown by the assignment to that key $array[3][4]{'stromelysin'} of the anonymous array [3, 4, 5] . To print out the array associated with the key stromelysin , you have to remember to tell Perl that the value for that key is an array by surrounding the expression with curly braces preceded by an @ sign @{$array[3][4]{'stromelysin'}} :

 $array[3][4]{'stromelysin'} = [3, 4, 5]; print "The scores for plate position 3, 4 were @{$array[3][4]{'stromelysin'}}     \n";

This prints:

 The scores for plate position 3, 4 were 3 4 5

A common Perl trick is to dereference a complex data structure by enclosing the whole thing in curly braces and preceding it with the correct symbol: $ , @ , or % . So, take a moment and reread the last example. Do you see how the following:

 $array[3][4]{'stromelysin'}

is the key for a hash? Do you see how the phrase:

 @{$array[3][4]{'stromelysin'}}

makes it clear that the value for that hash key is an array? Similarly, if the value for that hash key was a scalar, you could say:

 ${$array[3][4]{'stromelysin'}}

and if the value for that hash key was a hash, you could say:

 %{$array[3][4]{'stromelysin'}}

2.4.3 Complex Data Structures

References give you a fair amount of flexibility. For example, your data structures can combine references to different types of data. You can have an anonymous array such as in the following short program:

 $gene = [     # hash of basic information about the gene name, discoverer,     #  discovery date and laboratory.     {          name       => 'antiaging',         reference  => [ 'G. Mendel', '1865'],         laboratory => [ 'Dept. of Genetics', 'Cornell University', 'USA']     },     # scalar giving priority     'high',     # array of local work history     ['Jim', 'Rose', 'Eamon', 'Joe'] ]; print "Name is ", ${$gene->[0]}{'name'}, "\n"; print "Priority is ", $gene->[1], "\n"; print "Research center is ", ${${$gene->[0]}{'laboratory'}}[1], "\n"; print "These individuals worked on the gene: ", "@{$gene->[2]}", "\n";

This program produces the output:

 Name is antiaging Priority is high Research center is Cornell University These individuals worked on the gene: Jim Rose Eamon Joe

Let's examine this code to understand how it works; it contains most of the points made in this chapter.

$gene is a pointer to an anonymous array of three elements. Therefore each element of $gene is referred to by either:

 $$gene[0] $$gene[1] $$gene[2]

or equivalently (and our choice in this code) by:

 $gene->[0] $gene->[1] $gene->[2]

To be specific, the first element is a reference to an anonymous hash, the second element is a scalar string high , and the third element is a reference to an anonymous workgroup array.

The plot thickens when you examine the anonymous hash that is referenced by the first array element. It has three keys, one of which, name , has a simple scalar value. The other two keys have values that are references to anonymous arrays of scalar strings.

So, this certainly qualifies as a complex data structure!

When you place any of the elements of the $gene anonymous array within a block of curly braces, you have a reference that must be dereferenced appropriately. To refer to the entire hash at the beginning of the array, say:

 %{$gene->[0]}

As done with the program code, the scalar value that is the second element of the array is accessed simply as:

 $gene->[1]

The third part of this data structure is an anonymous array, which we can refer to in total as:

 @{$gene->[2]}

This is also done in the program code.

Now, let's finish by looking into the first element of the $gene anonymous array. This is a reference to an anonymous hash. One of the keys of that hash has a simple scalar string value, which is referenced with:

 ${$gene->[0]}{name}

as was done in the program code. To make sure we understand this, let's write it out:

 ${$gene->[0]}{name}      is $ hashref    {name}      is 'antiaging'

{$gene->[0]} is a block containing a reference to an anonymous hash. It is then used as is typical for a hash reference: it's preceded by a $ and followed by the key name in curly braces and so resolves to a lookup of the key name in the anonymous hash.

The most intricate dereference in this program is that which digs out the name of the research center:

 ${${$gene->[0]}{laboratory} }[1]      is ${$ hashref    {laboratory} }[1]      is $  arrayref                  [1]      is 'Cornell University'

Here, the {$gene->[0]} is a reference to an anonymous hash. The value for the key laboratory is retrieved from that anonymous hash; the value is an anonymous array. Finally, that anonymous array ${$gene->[0]}{laboratory} is enclosed in a block of curly braces, preceded by a $ , and followed by an array index 1 in square brackets, which dereferences the anonymous array and returns the second element Cornell University .

Note that the last expression can also be written as:

 $gene->[0]->{laboratory}->[1]

You see how the use of references within blocks enables you to dereference some rather deep-nested data structures. I urge you to take the time to understand this example and to use the resources listed in Section 2.9.