6.4 SCANTEXT: Processing Bytes | ItaniumR Architecture for Programmers. Understanding 64-Bit Processors and EPIC Principles

Modern computers are used to do much more than perform mathematical functions. Many applications involve analysis of character data, since characters can be stored and considered one byte at a time, as continuous strings, and even as groups of records.

One might suppose that the extract and deposit instructions would be ideal for working with byte data. Given that accessing memory is slower than using registers, it might be tempting to read data as quad words and then analyze the bytes individually, using loops.

Different operating systems, separated principally into the little-endian camp (e.g., most Linux implementations) and the big-endian camp (e.g., most Unix implementations), store and interpret groups of bytes in opposite ways (Section 2.5). A byte-oriented string or data file will be transportable across systems, and a source program may be syntactically acceptable to assemblers across systems. Yet the program behavior when reading groups of bytes could be radically different on an Itanium system running Linux versus HP-UX.

Despite a facility to indicate the endian preference of a system (see psr.be in Appendix D.7), notice that the parameters in the extract and deposit instructions are all immediate values that cannot be dynamically adjusted. You would have to write two entirely separate routines with parameters appropriate to big- or little-endian data within quad words, and test the endian flag (psr.be) to branch to the appropriate block of code.

We are going to avoid that complexity by loading individual bytes in our program for this section. Suppose that we wish to engage in rudimentary textual analysis, perhaps just to count the number of words in a line of text. We consider words to be separated by spaces, and the end of the line is a zero, or NUL, byte.

The program in Figure 6-3, SCANTEXT, counts the total number of words and characters within those words, including adjacent punctuation.

Figure 6-3 SCANTEXT: An illustration of processing string data

 // SCANTEXT      Text Analyzer // This program will count the number of characters // (including punctuation but not spaces) in a sentence // and find how many words it contained.          SPACE   = 0x20           // ASCII code for <SP>          .data                    // Declare storage          .align  8                // Desired alignment QUADS:   .skip   2*8              // Space for results TEXT:    stringz "The faster I run the behinder I get."          .text                    // Section for code          .align  32               // Desired alignment          .global main             // These three lines          .proc   main             //  mark the mandatory main:                             //   'main' program entry          .body                    // Now we really begin... first:   mov     r20 = 0          // Gr20 = character count          mov     r21 = 0          // Gr21 = word count          addl    r14 = @gprel(TEXT),gp    // Gr14 --> TEXT          addl    r15 = @gprel(QUADS),gp;; // Gr15 --> QUADS next:    ld1     r22 = [r14],1;;  // Get a character; bump          cmp.eq  p6,p7 = r0,r22   // Null code marks end    (p6)  br.cond.spnt.few nomore;; //  of our work          cmp.eq  p6,p7 = SPACE,r22;; // End of word?    (p7)  add     r20 = 0x1,r20    // No: count a character    (p6)  add     r21 = 0x1,r21    // Yes: count a word          br.cond.sptk.few next;;  // Go back for more nomore:  add     r21 = 0x1,r21;;  // The last word          st8     [r15] = r20,8;;  // Number of characters          st8     [r15] = r21      // Number of words done:    mov     r8 = 0           // Signal all is normal          br.ret.sptk.many b0;;    // Back to command line          .endp   main             // Mark end of procedure

This illustrative program expects to find a NUL byte (which is appended by the stringz directive) marking the end of the input text. Two address pointers are used, register r14 to step through the string in the main loop, and register r15 to access the information units where the results of program operation will be stored. For each character, a three-way case structure determines whether it is a null (end of line), a space (end of word), or any other character. Multiple adjacent spaces in the stored form of the string would lead to an overestimate of the number of words. The sample sentence in SCANTEXT contains eight words and 29 characters, including the period. The add instruction at the label nomore ensures that we count the last word of the sentence.

Determining whether a character is part of a word, or instead signals the end of a word, is a classic application of an if…then…else structure ideally handled by Itanium predication.