2.2 References | Mastering Perl for Bioinformatics

Many computer languages provide variables that allow you to refer to, or point at, other values. So, instead of a variable containing data such as a string or number of interest, the variable contains the location of the data; it tells you where to go to get the value you want. In Perl, the use of a scalar variable to refer to another value is called a reference , and the value being pointed at is called a referent .

References allow you to do many useful things in Perl; you can define multidimensional arrays and other more complex data structures and avoid copying large amounts of data (for instance, when passing arguments into subroutines). Using references can make your programs faster, more efficient, and shorter. References have a number of uses, as you'll see in the next sections.

2.2.1 References to Scalars

Here's an example of a reference:

 $peptide = 'EIQADEVRL'; $peptideref = $peptide; print "Here is what's in the reference:\n"; print $peptideref, "\n"; print "Here is what the reference is pointing to:\n"; print ${$peptideref}, "\n"; print $$peptideref, "\n";

This Perl code produces the following output:

 Here is what's in the reference: SCALAR(0x80fe4ac) Here is what the reference is pointing to: EIQADEVRL EIQADEVRL

What's going on here?

First, a string value of EIQADEVRL is assigned to the scalar variable $peptide . Next, a backslash operator is used before the $peptide variable to return a reference to the variable. This reference is saved in the scalar variable $peptideref .

The next lines of code show what this example really does. When you print out the (actual) value of the reference variable $peptideref , you get the value:

 SCALAR(0x80fe4ac)

This says that the reference variable $peptideref is pointing to a scalar value (which is the value of the scalar variable $peptide ). It also gives a hexadecimal number that specifies where in the computer memory the value for that variable resides.

The 0x at the beginning of the number says that it is a hexadecimal number. ^[2] Hexadecimal (base 16) numbers are a way to specify locations in computer memory. The exact location in the computer memory where this $peptide value resides is almost never of practical importance to you. However, it can help when debugging code that uses references, and so it is displayed when you print the value of a reference as we've just done or when you use the Perl ref command (which we'll use later).

^[2] Recall that hexadecimal numbers use 16 digits, from 0 to f, and that the decimal (base 10) numbers:

2.2.1.1 Dereferencing

Finally, our code fragment performs the essential task of dereferencing a reference. In Perl a reference to a scalar variable can be dereferenced by surrounding it with curly braces {} and prepending another dollar sign to it. ${$peptideref} returns the value the reference variable is pointing at. The value being pointed at is the same as the value of the $peptide variable, which has the value ' EIQADEVRL ', so ${$peptideref} also has the value ' EIQADEVRL '.

Surrounding a reference with curly braces before prepending the appropriate symbol ( $ for scalar, @ for array, % for hash) is generally the best way to dereference reference variables. As you start using more intricate references, you'll find that it's often the only way to dereference properly. However, for simple reference variables, it is possible to omit the additional curly braces. So, our example shows both ways of dereferencing our scalar reference:

 ${$peptideref} $$peptideref

In Perl, every reference must be dereferenced properly in the program (in other words, by the programmer) to be useful. Perl doesn't automatically dereference for you, nor can it figure out when you want a reference or when you want the value that the reference is pointing to. So, it's up to you to specify that you want the value of a reference by prepending a % , @ , or $ to hash, array, or scalar references, respectively. (And, as just pointed out, you often need to surround the reference with curly braces, although for simple references, they can be omitted.)

2.2.1.2 Anonymous data

A scalar constant can also be referenced, as in the following code:

 $peptideref = \'EIQADEVRL'; print "Here is what's in the reference:\n"; print $peptideref, "\n"; print "Here is what the reference is pointing to:\n"; print ${$peptideref}, "\n";

This produces the output:

 Here is what's in the reference: SCALAR(0x80fe4a0) Here is what the reference is pointing to: EIQADEVRL

In this case the reference points directly to a location in memory in which the string value EIQADEVRL is being stored.

Compare this code with the previous example. The reference was to an existing variable that held a scalar value. Think of it as a scalar value with a "name" that is the already existing variable. Now, the reference is to a scalar value alone. This scalar value isn't contained in any variable; it has no name . Thus, it's called an anonymous referent, which can only be used via the reference to it.

You may well ask, "Why bother?" Anonymous scalars are, for most practical purposes, not any more desirable than simple scalar variables. However, anonymous data structures, and references to them, are frequently useful, as you shall see.

2.2.2 References of References

It is sometimes useful to have references of references. Since a reference is just a variable containing a scalar value, it's possible to make a reference to a reference:

 $value = 'ACGAAGCT'; $refvalue = $value; $refrefvalue = $refvalue; print $value, "\n"; print $$refvalue, "\n"; print $$$refrefvalue, "\n";

This prints out:

 ACGAAGCT ACGAAGCT ACGAAGCT

(Notice that here I've omitted the surrounding curly braces from around the references.) You can also apply several levels of reference at one go:

 $value = 'ACGAAGCT'; $refrefrefvalue = \$value; print $value, "\n"; print $$$$refrefrefvalue, "\n";

This prints out:

 ACGAAGCT ACGAAGCT

2.2.3 References to Arrays

References to arrays obey pretty much the same syntax as references to scalars. You make a reference to an array by prepending a backslash to the @ sign; you dereference the array by surrounding the reference variable with curly braces and prepending an @ sign, as in the following example:

 @pentamers = ('cggca', 'tgatc', 'ttggc'); $arrayref = \@pentamers; print "Here is what's in the reference:\n"; print $arrayref, "\n"; print "Here is what the reference is pointing to:\n"; print "@{$arrayref}\n"; print "Here is the second value in the array:\n"; print ${$arrayref}[1], "\n";

This Perl code produces the following output:

 Here is what's in the reference: ARRAY(0x80fe4c4) Here is what the reference is pointing to: cggca tgatc ttggc Here is the second value in the array: tgatc

An important point to remember here is that Perl doesn't automatically know if you want the data a reference is pointing to or the reference variable itself. If it's pointing to a scalar value, it's up to you to prepend the $ sign to the reference in order to dereference the value. Similarly, as in this example, if a reference is pointing to an array value, it's up to you to prepend the @ sign to the reference to dereference the value; @{$arrayref} is correct; @$arrayref is also okay.

On the other hand, if you want the value of one element of the referenced array, you prepend a dollar sign because the value will be a scalar value. Recall that to get the scalar value of one element of an array you use a dollar sign; for example, $array[0] . Similarly, to get the scalar value of one element from an array with a reference, you prepend a dollar sign; for example, ${$arrayref}[0] or $$arrayref[0] .

2.2.3.1 The arrow operator

References to arrays (and references to hashes and subroutines) can be dereferenced using another syntax that's popular and important to learn. If $arrayref is a reference to an array, then to dereference the second element (for instance) of that array, you can say either:

 $$arrayref[1]

or, equivalently:

 $arrayref->[1]

The following code fragment shows this:

 @pentamers = ('cggca', 'tgatc', 'ttggc'); $arrayref = \@pentamers; print "Here is the second element of the pentamers array:\n"; print $$arrayref[1], "\n"; print "And here it is again:\n"; print $arrayref->[1], "\n";

This code prints out:

 Here is the second element of the pentamers array: tgatc And here it is again: tgatc

The arrow operator appears between the name of the reference to an array and the square brackets and subscript. It works similarly with hashes and with subroutines, as you'll see later.

As a convenient shortcut, it is sometimes possible to drop multiple arrow operators in a reference. Thus, if:

 $array = [ [ 'Dennis', 'Drayna' ], [ 'Callum', 'Bell' ] ];

the following are synonymous:

 print $$array[1][2]; print $array->[1][2]; print $array->[1]->[2];

Here's the output:

 BellBellBell

I'll show more examples of this shortcut later in this chapter.

2.2.3.2 Anonymous arrays

You can create an anonymous array by surrounding a list with square brackets. (A mnemonic device to remember this bit of syntax is that square brackets are also used with arrays to refer to a particular element, as in $arr[4] .) You can then create a reference to the anonymous array like so:

 $pentamers = ['cggca', 'tgatc', 'ttggc']; print "The third and last element of the array is ", $pentamers->[2], "\n";

This gives the output:

 The third and last element of the array is ttggc

In this case, $pentamers is a reference to an (anonymous) array. The third element can equally well be printed using $$pentamers[2] . The entire array is named by prepending an @ sign:

 $pentamers = ['cggca', 'tgatc', 'ttggc']; print "The third and last element of the array is $$pentamers[2]\n"; print "The entire array is: @$pentamers\n";

This produces the output:

 The third and last element of the array is ttggc The entire array is: cggca tgatc ttggc

2.2.4 References to Hashes

References to hashes also follow the same rules as references to scalars and arrays. You make a reference to a hash by prepending a backslash to the % sign; you dereference by prepending the percent sign to the dollar sign on the reference variable:

 %geneticmarkers = ('curly' => 'yes', 'hairy' => 'no', 'topiary' => 'yes'); $hashref = \%geneticmarkers; print "Here is what's in the reference:\n"; print $hashref, "\n"; print "Here is what the reference is pointing to:\n"; foreach $k (keys %$hashref) {     print "key\t$k\t\tvalue\t$$hashref{$k}\n"; } print "Dereferencing using the arrow operator:\n"; foreach $k (keys %$hashref) {     print "key\t$k\t\tvalue\t$hashref->{$k}\n"; }

This Perl code produces the following output:

 Here is what's in the reference: HASH(0x80fe4c4) Here is what the reference is pointing to: key        topiary              value        yes key        curly                value        yes key        hairy                value        no Dereferencing using the arrow operator: key        topiary              value        yes key        curly                value        yes key        hairy                value        no

Notice that the keys are printed in a different order than they were specified: hashes do not preserve the order of their keys. (Also recall that, in a double quoted string, \t prints a tab space.)

If you want one value of the referenced hash, you prepend a dollar sign because the value will be a scalar value. To get one value of a hash, you use a dollar sign, e.g., $geneticmarkers{'curly'} ; to get one value from a reference to a hash, you also use a dollar sign, e.g., $$hashref{'curly'} .

The arrow operator -> works with hashes the way it works with arrays. With hashes, the arrow operator is placed between the name of the hash and the curly braces. To illustrate :

 %geneticmarkers = ('curly' => 'yes', 'hairy' => 'no', 'topiary' => 'yes'); $hashref = \%geneticmarkers; print "For key 'curly' the value is '", $$hashref{'curly'}, "'\n"; print "For key 'curly' the value is '", $hashref->{'curly'}, "'\n";

This prints:

 For key 'curly' the value is 'yes' For key 'curly' the value is 'yes'

2.2.4.1 Anonymous hashes

You can create an anonymous hash by surrounding a list with curly braces. (The mnemonic device to remember this bit of syntax is that curly braces are also used with hashes to refer to a particular key, as in $hash{'curly'} .) You can then create a reference to the anonymous hash like so:

 $geneticmarkers = {'curly' => 'yes', 'hairy' => 'no', 'topiary' => 'yes'}; print "Here is what is in the anonymous hash:\n"; foreach $k (keys %$geneticmarkers) {     print "key\t$k\tvalue\t$geneticmarkers->{$k}\n"; }

This gives the output:

 Here is what is in the anonymous hash: key        topiary        value        yes key        curly          value        yes key        hairy          value        no

In this case, $geneticmarkers is a reference to an (anonymous) hash. The values can equally well be printed using $$geneticmarkers{$k} or $geneticmarkers->{$k} .

Curly braces can also be used for blocks and for subroutine definitions. The Perl interpreter can occasionally get confused as to which of these constructs is meant , although it's rare. To be clear, you can put a plus sign + in front of an anonymous hash to specify that it is an anonymous hash and not a block:

 $anonhash = +{ 'one' => 1, 'two' => 2 }; print "The old $$anonhash{'one'} $anonhash->{'two'}\n";

This prints:

 The old 1 2

2.2.5 References to Subroutines

References to subroutines are yet another way to reference in Perl. This may seem a little odd. References to scalars, arrays, and hashes are references to data structures. But references to subroutines? A subroutine isn't a data structure, so how did this come about?

There are two reasons why references to subroutines make sense the same way that references to data structures make sense. The first reason is that just as variables are managed with Perl's symbol tables, so also are subroutine definitions managed by the symbol table. In Chapter 3, you'll see the deliberate manipulation of a symbol table to make subroutine definitions on the fly. In this sense, subroutines, hashes, arrays, and scalars all refer to data that has a name.

The second reason is that references to subroutines are sometimes a great tool to use when writing a program. There are times when you might apply one of a number of different subroutines depending on the program logic and the input, and using references to subroutines can make this kind of code easier to write. That's the real justification for just about everything you might find in the toolbox that we call a programming language, right? (I admit it sometimes seems that sheer orneriness was the motivation.)

References to subroutines follow the same rules as references to scalars and arrays. Recall that a subroutine name may optionally be prepended with the ampersand sign & when it is called. ^[3] Thus, these two are equivalent:

^[3] The ampersand was required in older versions of Perl.

 findmotif('ATTAATTTTCCGATC'); &findmotif('ATTAATTTTCCGATC');

To make a reference to a subroutine, you prepend a backslash to the ampersand:

 $subref = \&findmotif;

You dereference a subroutine one of two ways: by prepending an ampersand to the subroutine reference, like so:

 &$subref(  );

or by using the arrow operator, like so:

 $subref->(  );

This is demonstrated by the following code fragment (which includes a subroutine definition):

 print "Mark 1:\n"; findmotif('ATTAATTTTCCGATC'); print "Mark 2:\n"; &findmotif('ATTAATTTTCCGATC'); print "Mark 3:\n"; $subref = \&findmotif; &$subref('ATTAATTTTCCGATC'); print "Mark 4:\n"; $subref = \&findmotif; $subref->('ATTAATTTTCCGATC'); print "Mark 5:\n"; $subref2 = \findmotif; &$subref2('ATTAATTTTCCGATC'); sub findmotif {     my($input) = @_;     if($input =~ /CCGA/) {         print "I found CCGA!\n";     }else{         print "No motif\n";     } }

This produces the output:

 Mark 1: I found CCGA! Mark 2: I found CCGA! Mark 3: I found CCGA! Mark 4: I found CCGA! Mark 5: Not a CODE reference at - line 17.

This code defines a little subroutine findmotif that looks for a short motif in DNA sequence data. The first two calls to the subroutine simply demonstrate that you can call subroutines with or without a leading ampersand & . The third calls the subroutine by means of a reference to the subroutine, as just described. The fourth call is by means of a reference to the subroutine using the alternative arrow operator. Finally, the fifth call produces an error; the problem is just a syntactical one; it tries to take a reference to a subroutine by prepending the backslash to the name of the subroutine without including the leading ampersand.

It's useful to remind the gentle reader that the error produced by that fifth call to findmotif occurs only if you don't use the use strict directive (as you are encouraged always to do). Without use strict , the program fails only when it reaches that bad call. With use strict , the program complains and fails immediately. What if that call isn't made until several hours into the running program (which is not an uncommon running time in bioinformatics)? use strict can save a lot of time and effort.

2.2.5.1 Anonymous subroutines

You can create an anonymous subroutine by giving the keyword sub followed by a subroutine definition within the usual curly braces, followed by a semicolon. An anonymous subroutine definition is just like a normal subroutine definition, except the name of the subroutine is omitted, and you must follow it with a semicolon. (Recall that subroutine definitions normally are not followed by a semicolon, as with the subroutine findmotif in the previous example.)

You can create a reference to the anonymous subroutine like so:

 $findmotif = sub {     my($input) = @_;     if($input =~ /CCGA/) {         print "I found CCGA!\n";     }else{         print "No motif\n";     } }; $findmotif->('ATTAATTTTCCGATC'); &$findmotif('ATTAATTTTCCGATC');

This gives the output:

 I found CCGA! I found CCGA!

In this case, $findmotif is a reference to an (anonymous) subroutine. The subroutine reference was dereferenced and called twice to show the use of the two alternative choices of syntax: the prepended ampersand and the arrow operator.

2.2.5.2 Passing references to subroutines

Perl collapses all arguments to a subroutine as a list of scalars. This makes it impossible to distinguish between, say, two arrays you might try to pass to a subroutine, as the following example illustrates:

 @aminoacids1 = ('E', 'V', 'L'); @aminoacids2 = ('D', 'T', 'Y'); printacids(@aminoacids1, @aminoacids2); sub printacids {     my(@aa1, @aa2) = @_;     print "Amino acids 1\n";     print "@aa1\n";     print "Amino acids 2\n";     print "@aa2\n"; }

This prints out:

 Amino acids 1 E V L D T Y Amino acids 2

As you can see, the elements of both arrays are passed to the subroutine by means of the special array @_ , and Perl assigns this entire array to the first local array @aa1 .

In order to pass an arbitrary list of any combination of scalars, arrays, or hashes to a subroutine, it's necessary to pass the values as references. Here's how to fix the previous example:

 @aminoacids1 = ('E', 'V', 'L'); @aminoacids2 = ('D', 'T', 'Y'); printacids(\@aminoacids1, \@aminoacids2); sub printacids {     my($aa1, $aa2) = @_;     print "Amino acids 1\n";     print "@$aa1\n";     print "Amino acids 2\n";     print "@$aa2\n"; }

This prints out:

 Amino acids 1 E V L Amino acids 2 D T Y

In this version, the subroutine is passed references to the arrays. Inside the subroutine, the references are collected in the variables $aa1 and $aa2 and are dereferenced to print out their contents using the forms @$aa1 and @$aa2 .

Even when you're passing just one scalar to a subroutine, you might want to pass a reference. Say you have the DNA sequence of human chromosome 1 in a variable $chrom1 . You want to pass this sequence into a subroutine that searches for restriction enzymes. A problem can arise because passing a variable into a subroutine involves making a copy of the data into the subroutine's variables, and you've just used up a significant portion of your computer's memory.

By passing a reference to the DNA sequence data, you avoid making a copy of the data, and your program will use less memory. It will also run much faster because copying large strings is a fairly time-consuming process for a program.

Here's a simple example of how to pass a scalar reference to a subroutine:

 my $chrom1 = getchrom('1');  # assume we read in human chromosome 1 here my @enzyme_sites = findrestrictionenzymes($chrom1, 'HindIII'); sub findrestrictionenzymes {   my($seqref, $re) = @_; # $seqref is a reference to a scalar string                          # $re contains the name of a restriction enzyme      ... program logic follows, where $$seqref is the sequence data ... }

Writing programs is a type of engineering, and engineering always seems to come back to the idea of tradeoffs. The downside of passing references to subroutines is that anything the subroutine does to the referenced data stays in effect after the subroutine has exited. This "action at a distance" needs to be treated with care, so as not to modify data unintentionally.

2.2.5.3 Returning references from subroutines

You'll see in Chapter 3 how the subroutine called new returns a reference to an anonymous data structure declared within the subroutine. Until then, I'll defer a detailed discussion of how this works; the bottom line is that a subroutine can return a reference because a reference is "really" just a scalar value.

2.2.6 Symbolic Versus Hard References

There are two kinds of references, hard and symbolic. Hard references actually point to locations in computer memory.

For example, a hard reference to a scalar:

 $name = 'Joel';

is defined like so:

 $nameref = $name;

and the values associated with the hard reference $$nameref are:

 print '$nameref has the value ', $nameref, ' and points to the referent ',      $$nameref, "\n";

This prints:

 $nameref has the value SCALAR(0x80fe4ac) and points to the referent Joel

Symbolic references refer to a name, not an address. As a brief example, let's say we have four array variables @mark1 , @mark2 , @mark3 , and @mark4 . It is possible to have another variable that is set to one of these variable names ; let's say the variable is called $arrayname and it's set to the value mark3 , and that is the array we want to access.

You can place the $arrayname variable in a block. Because a block returns the value of its last expression, this block returns the string mark3 . You can then place the special array symbol @ in front of the block, and Perl will recognize this as meaning the @mark3 array. Here is a demonstration of how this works:

 @mark1 = ( 'a1', 'a2', 'a3', 'a4' ); @mark2 = ( 'b1', 'b2', 'b3', 'b4' ); @mark3 = ( 'c1', 'c2', 'c3', 'c4' ); @mark4 = ( 'd1', 'd2', 'd3', 'd4' ); $arrayname = 'mark3'; print "@{$arrayname}\n";

This program prints out the result:

 c1 c2 c3 c4

Symbolic references are avoided by some programmers and used frequently by others; you may sometimes come across them, or even find yourself using them. They are used in the AUTOLOAD methods that install methods at runtime, which you'll learn about in the later chapters.