Strings

	Ruby Way By Hal Fulton Slots : 1.0
	Table of Contents

Atoms were once thought to be fundamental, elementary building blocks of nature; protons were then thought to be fundamental, then quarks. Now we say the string is fundamental.

David Gross, professor of theoretical physics, Princeton University

We offer an anecdote here. In the early 1980s, a computer science professor started out his data structures class with a single question. He didn't introduce himself or state the name of the course; he didn't hand out a syllabus or give the name of the textbook. He walked to the front of the class and asked, "What is the most important data type?"

There were one or two guesses. Someone guessed, "Pointers." He brightened, but said no, that wasn't it. Then he offered his opinion: The most important data type was character data.

He had a valid point. Computers are supposed to be our servants, not our masters, and character data has the distinction of being human readable. (Some humans can read binary data easily, but we will ignore them.) The existence of characters (and thus strings) enables communication between humans and computers. Every kind of information we can imagine, including natural language text, can be encoded in character strings.

What do we find ourselves wanting to do with strings? We want to concatenate them, tokenize them, analyze them, perform searches and substitutions, and more. Ruby makes most of these tasks easy.

Performing Specialized String Comparisons

Ruby has built-in ideas about comparing strings; comparisons are done lexicographically as we have come to expect (that is, based on character set order). But if we want, we can introduce rules of our own for string comparisons, and these can be of arbitrary complexity.

As an example, suppose that we want to ignore the English articles a, an, and the at the front of a string, and we also want to ignore most common punctuation marks. We can do this by overriding the built-in method <=>, which is called for <, <=, >, and >= (see Listing 2.1).

Listing 2.1 Specialized String Comparisons

 class String     alias old_compare <=>     def <=>(other)       a = self.dup     b = other.dup     # Remove punctuation     a.gsub!(/[\,\.\?\!\:\;]/, "")     b.gsub!(/[\,\.\?\!\:\;]/, "")     # Remove initial articles     a.gsub!(/^(a |an |the )/i, "")     b.gsub!(/^(a |an |the )/i, "")     # Remove leading/trailing whitespace     a.strip!     b.strip!     # Use the old <=>     a.old_compare(b)   end end title1 = "Calling All Cars" title2 = "The Call of the Wild" # Ordinarily this would print "yes" if title1 < title2   puts "yes" else   puts "no"         # But now it prints "no" end

Note that we save the old <=> with an alias and then call it at the end. This is because if we tried to use the < method, it would call the new <=> rather than the old one, resulting in infinite recursion and a program crash.

Note also that the == operator doesn't call the <=> method (mixed in from comparable). This means that if we need to check equality in some specialized way, we will have to override the == method separately. But in this case, == works as we want it to anyhow.

Tokenizing a String

The split method will parse a string and return an array of tokens. It accepts two parameters, a delimiter, and a field limit, which is an integer.

The delimiter defaults to whitespace. Actually, it uses $; or the English equivalent $FIELD_SEPARATOR. If the delimiter is a string, the explicit value of that string is used as a token separator.

 s1 = "It was a dark and stormy night." words = s1.split          # ["It", "was", "a", "dark", "and",                           #  "stormy", "night"] s2 = "apples, pears, and peaches" list = s2.split(", ")     # ["apples", "pears", "and peaches"] s3 = "lions and tigers and bears" zoo = s3.split(/ and /)   # ["lions", "tigers", "bears"]

The limit parameter places an upper limit on the number of fields returned, according to these rules:

If it is omitted, trailing null entries are suppressed.
If it is a positive number, the number of entries will be limited to that number (stuffing the rest of the string into the last field as needed). Trailing null entries are retained.
If it is a negative number, there is no limit to the number of fields, and trailing null entries are retained.

These three rules are illustrated here:

 str = "alpha,beta,gamma,," list1 = str.split(",")     # ["alpha","beta","gamma"] list2 = str.split(",",2)   # ["alpha", "beta,gamma,,"] list3 = str.split(",",4)   # ["alpha", "beta", "gamma", ","] list4 = str.split(",",8)   # ["alpha", "beta", "gamma", "", ""] list5 = str.split(",",-1)  # ["alpha", "beta", "gamma", "", ""]

Formatting a String

String formatting is done in Ruby as it is in Cwith the sprintf method. It takes a string and a list of expressions as parameters and returns a string. The format string contains essentially the same set of specifiers that are available with C's sprintf (or printf).

 name = "Bob" age = 28 str = sprintf("Hi, %s... I see you're %d years old.", name, age)

You might ask why we would use this instead of simply interpolating values into a string using the #{ expr} notation. The answer is that sprintf makes it possible to do extra formatting such as specifying a maximum width, specifying a maximum number of decimal places, adding or suppressing leading zeroes, left-justifying, right-justifying, and more.

 str = sprintf("%-20s  %3d", name, age)

The String class has a method %, which will do much the same thing. It takes a single value or an array of values of any type.

 str = "%-20s  %3d" % [name, age]  # Same as previous example

We also have the methods ljust, rjust, and center; these take a length for the destination string and pad with spaces as needed.

 str = "Moby-Dick" s1 = str.ljust(13)           # "Moby-Dick    " s2 = str.center(13)          # "  Moby-Dick  " s3 = str.rjust(13)           # "    Moby-Dick"

For more information, see any reference.

Controlling Uppercase and Lowercase

Ruby's String class offers a rich set of methods for controlling case. We offer an overview of them here.

The downcase method will convert a string to all lowercase. Likewise upcase will convert it to all uppercase.

 s1 = "Boston Tea Party" s2 = s1.downcase             # "boston tea party" s3 = s2.upcase               # "BOSTON TEA PARTY"

The capitalize method will capitalize the first character of a string while forcing all the remaining characters to be lowercase.

 s4 = s1.capitalize           # "Boston tea party" s5 = s2.capitalize           # "Boston tea party" s6 = s3.capitalize           # "Boston tea party"

The swapcase method will exchange the case of each letter in a string.

 s7 = "THIS IS AN ex-parrot." s8 = s7.swapcase             # "this is an EX-PARROT."

Each of these has its in-place equivalent (upcase!, downcase!, capitalize!, swapcase!).

There are no built-in methods for detecting case, but this is easy to do with regular expressions.

 if string =~ /[a-z]/   puts "string contains lowercase charcters" end if string =~ /[A-Z]/   puts "string contains uppercase charcters" end if string =~ /[A-Z]/ and string =~ /a-z/   puts "string contains mixed case" end if string[0..0] =~ /[A-Z]/   puts "string starts with a capital letter" end

Note that all these methods ignore locale.

Accessing and Assigning Substrings

In Ruby, substrings can be accessed in several different ways. Normally the bracket notation is used, as for an array; but the brackets can contain a pair of Fixnums, a range, a regex, or a string. Each case is discussed in turn.

If a pair of Fixnum values is specified, they are treated as an offset and a length, and the corresponding substring is returned:

 str = "Humpty Dumpty" sub1 = str[7,4]         # "Dump" sub2 = str[7,99]        # "Dumpty" (overrunning is OK) sub3 = str[10,-4]       # nil (length is negative)

It is important to remember that these are an offset and a length (number of characters), not beginning and ending offsets.

A negative index counts backward from the end of the string. In this case, the index is one-based, not zero-based. The length is still added in the forward direction.

 str1 = "Alice" sub1 = str1[-3,3]   # "ice" str2 = "Through the Looking-Glass" sub3 = str2[-13,4]  # "Look"

A range can be specified. In this case, the range is taken as a range of indices into the string. Ranges can have negative numbers, but the numerically lower number must still be first in the range. If the range is backward or if the initial value is outside the string, nil is returned.

 str = "Winston Churchill" sub1 = str[8..13]    # "Church" sub2 = str[-4..-1]   # "hill" sub3 = str[-1..-4]   # nil sub4 = str[25..30]   # nil

If a regular expression is specified, the string matching that pattern will be returned. If there is no match, nil will be returned.

 str = "Alistair Cooke" sub1 = str[/l..t/]   # "list" sub2 = str[/s.*r/]   # "stair" sub3 = str[/foo/]    # nil

If a string is specified, that string will be returned if it appears as a substring (or nil if it doesn't).

 str = "theater" sub1 = str["heat"]  # "heat" sub2 = str["eat"]   # "eat" sub3 = str["ate"]   # "ate" sub4 = str["beat"]  # nil sub5 = str["cheat"] # nil

Finally, in the trivial case, a single Fixnum as index will yield an ASCII code (or nil if out of range).

 str = "Aaron Burr" ch1 = str[0]     # 65 ch1 = str[1]     # 97 ch3 = str[99]    # nil

It is important to realize that the notations we have described here will serve for assigning values as well as for accessing them.

 str1 = "Humpty Dumpty" str1[7,4] = "Moriar"     # "Humpty Moriarty" str2 = "Alice" str2[-3,3] = "exandra"   # "Alexandra" str3 = "Through the Looking-Glass" str3[-13,13]  = "Mirror" # "Through the Mirror" str4 = "Winston Churchill" str4[8..13] = "H"        # "Winston Hill" str5 = "Alistair Cooke" str5[/e$/] ="ie Monster" # "Alistair Cookie Monster" str6 = "theater" str6["er"] = "re"        # "theatre" str7 = "Aaron Burr" str7[0] = 66             # "Baron Burr"

Assigning to an expression evaluating to nil will have no effect.

Substituting in Strings

You've already seen how to perform simple substitutions in strings. The sub and gsub methods provide more advanced pattern-based capabilities. There are also sub! and gsub!, which are their in-place counterparts.

The sub method will substitute the first occurrence of a pattern with the given substitute string or the given block.

 s1 = "spam, spam, and eggs" s2 = s1.sub(/spam/,"bacon")               # "bacon, spam, and eggs" s3 = s2.sub(/(\w+), (\w+),/,'\2, \1,')  # "spam, bacon, and eggs" s4 = "Don't forget the spam." s5 = s4.sub(/spam/) {  |m| m.reverse }      # "Don't forget the maps." s4.sub!(/spam/) {  |m| m.reverse } # s4 is now "Don't forget the maps."

As this example shows, the special symbols \1, \2, and so on can be used in a substitute string. However, special variables such as $& (or the English version $MATCH) might not.

If the block form is used, the special variables can be used. However, if all you need is the matched string, it will be passed into the block as a parameter. If it isn't needed at all, the parameter can of course be omitted.

The gsub method (global substitution) is essentially the same except that all matches are substituted rather than just the first.

 s5 = "alfalfa abracadabra" s6 = s5.gsub(/a[bl]/,"xx")     # "xxfxxfa xxracadxxra" s5.gsub!(/[lfdbr]/) {  |m| m.upcase + "-" } # s5 is now "aL-F-aL-F-a aB-R-acaD-aB-R-a"

The method Regexp.last_match is essentially identical to $& or $MATCH.

Searching a String

Besides the techniques for accessing substrings, there are other ways of searching within strings. The index method will return the starting location of the specified substring, character, or regex. If the item isn't found, the result is nil.

 str = "Albert Einstein" pos1 = str.index(?E)     # 7 pos2 = str.index("bert") # 2 pos3 = str.index(/in/)   # 8 pos4 = str.index(?W)     # nil pos5 = str.index("bart") # nil pos6 = str.index(/wein/) # nil

The method rindex (right index) will start from the right side of the string (that is, from the end). The numbering, however, proceeds from the beginning as usual.

 str = "Albert Einstein" pos1 = str.rindex(?E)     # 7 pos2 = str.rindex("bert") # 2 pos3 = str.rindex(/in/)   # 13 (finds rightmost match) pos4 = str.rindex(?W)     # nil pos5 = str.rindex("bart") # nil pos6 = str.rindex(/wein/) # nil

The include? method simply tells whether the specified substring or character occurs within the string.

 str1 = "mathematics" flag1 = str1.include? ?e         # true flag2 = str1.include? "math"     # true str2 = "Daylight Saving Time" flag3 = str2.include? ?s         # false flag4 = str2.include? "Savings"  # false

The scan method will repeatedly scan for occurrences of a pattern. If called without a block, it will return an array. If the pattern has more than one (parenthesized) group, the array will be nested.

 str1 = "abracadabra" sub1 = str1.scan(/a./) # sub1 now is ["ab","ac","ad","ab"] str2 = "Acapulco, Mexico" sub2 = str2.scan(/(.)(c.)/) # sub2 now is [ ["A","ca"], ["l","co"], ["i","co"] ]

If a block is specified, the method will pass the successive values to the block:

 str3 = "Kobayashi" str3.scan(/[^aeiou]+[aeiou]/) do |x|   print "Syllable: #{ x} \n" end

This code will produce the following output:

 Syllable: Ko Syllable: ba Syllable: ya Syllable: shi

Converting Between Characters and ASCII Codes

In Ruby, a character is already an integer.

 str = "Martin" print str[0]        # 77

If a Fixnum is appended directly onto a string, it is converted to a character.

 str2 = str << 111   # "Martino"

The method length can be used for finding the length of a string. A synonym is size.

 str1 = "Carl" x = str1.length     # 4 str2 = "Doyle" x = str2.size       # 5

Processing a Line at a Time

A Ruby string can contain newlines. For example, a small enough file can be read into memory and stored in a single string. The default iterator each will process such a string one line at a time.

 str = "Once upon\na time...\nThe End\n" num = 0 str.each do |line|   num += 1   print "Line #{ num} : #{ line} " end

This code produces three lines of output:

 Line 1: Once upon Line 2: a time... Line 3: The End

The method each_with_index could also be used in this case.

Processing a Byte at a Time

Because Ruby isn't fully internationalized at the time of this writing, a character is essentially the same as a byte. To process these in sequence, use the each_byte iterator.

 str = "ABC" str.each_byte do |char|   print char, " " end # Produces output: 65 66 67

Appending an Item onto a String

The append operator << can be used to append strings onto another string. It is stackable in that multiple operations can be performed in sequence on a given receiver.

 str = "A" str << [1,2,3].to_s << " " << (3.14).to_s # str is now "A123 3.14"

If a Fixnum in the range 0255 is specified, it will be converted to a character.

 str = "Marlow" str << 101 << ", Christopher" # str is now "Marlowe, Christopher"

Removing Trailing Newlines and Other Characters

Often we want to remove extraneous characters from the end of a string. The prime example is a newline on a string read from input.

The chop method will remove the last character of the string (typically, a trailing newline character). If the character before the newline is a carriage return (\r), it will be removed also. The reason for this behavior is the discrepancy between different systems' concept of what a newline is. On some systems such as UNIX, the newline character is represented internally as a linefeed (\n). On others, such as DOS and Windows, it is stored as a carriage return followed by a linefeed (\r\n).

 str = gets.chop         # Read string, remove newline s2 = "Some string\n"    # "Some string" (no newline) s3 = s2.chop!           # s2 is now "Some string" also s4 = "Other string\r\n" s4.chop!                # "Other string" (again no newline)

Note that the in-place version of the method (chop!) will modify its receiver.

It is also very important to note that in the absence of a trailing newline, the last character will be removed anyway.

 str = "abcxyz" s1 = str.chop           # "abcxy"

Because a newline might not always be present, the chomp method might be a better alternative.

 str = "abcxyz" str2 = "123\n" s1 = str.chomp          # "abcxyz" s2 = str2.chomp         # "123"

There is also a chomp! method as we would expect.

If a parameter is specified for chomp, it will remove the set of characters specified from the end of the string rather than the default record separator. Note that if the record separator appears in the middle of the string, it is ignored.

 str1 = "abcxyz" str2 = "abcxyz" s1 = str1.chomp("yz")   # "abcx" s2 = str2.chomp("x")    # "abcxyz"

Trimming Whitespace from a String

The strip method will remove whitespace from the beginning and end of a string. Its counterpart strip! will modify the receiver in place.

 str1 = "\t  \nabc  \t\n" str2 = str1.strip         # "abc" str3 = str1.strip!        # "abc" # str1 is now "abc" also

Whitespace, of course, consists mostly of blanks, tabs, and end-of-line characters.

If we want to remove whitespace only from the beginning of a string, it is better to do it another way. Here we do substitution with the sub method. (Here \s matches a whitespace character.)

 str1 = "\t  \nabc  \t\n" # Remove from beginning of string str2 = str1.sub(/^\s*/,"")   # "abc  \t\n"

However, note that removing whitespace from the end of a string is problematic. If we only remove spaces and tabs, we are fine; but if we try to remove a newline, we run into difficulties. This is because a newline is considered to mark the end of a string; the dollar sign ($) will match the earliest newline even if multiline mode is being used. So the naive method of using $ won't work. Here we show a technique that will work even for newlines; it is unconventional but effective.

 str3 = str2.reverse.sub(/^[ \t\n]*/,"").reverse # Reverse the string; remove the whitespace; reverse it again # Result is "\t  \nabc"

Repeating Strings

In Ruby, the multiplication operator (or method) is overloaded to enable repetition of strings. If a string is multiplied by n, the result is n copies of the original string concatenated together.

 yell = "Fight! "*3    # "Fight! Fight! Fight! " ruler = "+" + ("."*4+"5"+"."*4+"+")*3 # "+....5....+....5....+....5....+"

Expanding Variables in Input

This is a case of the use-mention distinction that is so common in computer science: Am I using this entity or only mentioning it? Suppose that a piece of data from outside the program (for example, from user input) is to be treated as containing a variable name or expression. How can we evaluate that expression?

The eval method comes to our rescue. Suppose that we want to read in a variable name and tell the user what its value is. The following fragment demonstrates this idea:

 alpha=10 beta=20 gamma=30 print "Enter a variable name: " str = gets.chop! result = eval(str) puts "#{ str}  = #{ result} "

If the user enters alpha, for instance, the program will respond with alpha = 10.

However, we will point out a potential danger here. It is conceivable that a malicious user could enter a string specifically designed to run an external program and produce side effects that the programmer never intended or dreamed of. For example, on an UNIX system, one might enter %x[rm -rf *] as an input string. When the program evaluated that string, it would recursively remove all the files under the current directory!

For this reason, you must exercise caution when doing an eval of a string you didn't build yourselves. (This is particularly true in the case of Web-based software that is accessible by anyone on the Internet.) For example, you could scan the string and verify that it didn't contain backticks, the %x notation, the method name system, and so on.

Embedding Expressions Within Strings

The #{ } notation makes embedding expressions within strings easy. You need not worry about converting, appending, and concatenating; you can interpolate a variable value or other expression at any point in a string.

 puts "#{ temp_f}  Fahrenheit is #{ temp_c}  Celsius" puts "The discriminant has the value #{ b*b - 4*a*c} ." puts "#{ word}  is #{ word.reverse}  spelled backward."

Some shortcuts for global, class, and instance variables can be used so that the braces can be dispensed with.

 print "$gvar = #$gvar and ivar = #@ivar."

Note that this technique isn't applicable for single-quoted strings (because their contents aren't expanded), but it does work for double-quoted documents and regular expressions.

Parsing Comma-separated Data

Comma-delimited data is common in computing. It is a kind of lowest common denominator of data interchange, which is used (for example) to transfer information between incompatible databases or applications that know no other common format.

We assume here that we have a mixture of strings and numbers and all strings are enclosed in quotes. We further assume that all characters are escaped as necessary (commas and quotes inside strings, and so on).

The problem becomes simple because this data format looks suspiciously like a Ruby array of mixed types. In fact, we can simply add brackets to enclose the whole expression, and we have an array of items.

 string = gets.chop! # Suppose we read in a string like this one: # "Doe, John", 35, 225, "5'10\"", "555-0123" data = eval("[" + string + "]")   # Convert to array data.each { |x| puts "Value = #{ x} "}

This fragment will produce the following output:

 sValue = Doe, John Value = 35 Value = 225 Value = 5' 10" Value = 555-0123

Converting Strings to Numbers (Decimal and Otherwise)

Frequently, we need to capture a number that is embedded in a string. For the simple cases, we can use to_f and to_i to convert to floating point numbers and integers, respectively. Each will ignore extraneous characters at the end of the string, and each will return a zero if no number is found.

 num1 = "237".to_i                     # 237 num2 = "50 ways to leave...".to_i     # 50 num3 = "You are number 6".to_i        # 0 num4 = "no number here at all".to_i   # 0 num5 = "3.1416".to_f                  # 3.1416 num6 = "0.6931 is ln 2".to_f          # 0.6931 num7 = "ln 2 is 0.6931".to_f          # 0.0 num8 = "nothing to see here".to_f     # 0.0

Octal and hexadecimal can similarly be converted with the oct and hex methods as shown in the following. Signs are optional as with decimal numbers.

 oct1 = "245".oct                      # 165 oct2 = "245 Days".oct                 # 165 # Leading zeroes are irrelevant. oct3 = "0245".oct                     # 165 oct4 = "-123".oct                     # -83 # Non-octal digits cause a halt oct4 = "23789".oct                    # 237 hex1 = "dead".hex                     # 57005 # Uppercase is irrelevant hex2 = "BEEF".hex                     # 48879 # Non-hex letter/digit causes a halt hex3 = "beefsteak".hex                # 48879 hex4 = "0x212a".hex                   # 8490 hex5 = "unhexed".hex                  # 0

There is no bin method to convert from binary, but you can write your own (see Listing 2.2). Notice that it follows all the same rules of behavior as oct and hex.

Listing 2.2 Converting from Binary

 class String   def bin     val = self.strip     pattern = /^([+-]?)(0b)?([01]+)(.*)$/         parts = pattern.match(val)     return 0 if not parts     sign = parts[1]     num  = parts[3]     eval(sign+"0b"+num)   end end a = "10011001".bin       # 153 b = "0b10011001".bin     # 153 c = "0B1001001".bin      # 0 d = "nothing".bin        # 0 e = "0b100121001".bin    # 9

Encoding and Decoding `rot13` Text

The rot13 method is perhaps the weakest form of encryption known to humankind. Its historical use is simply to prevent people from accidentally reading a piece of text. It is commonly seen in Usenet; for example, a joke that might be considered offensive might be encoded in rot13, or you could post the entire plot of Star Wars: Episode II the day before the premiere.

The encoding method consists simply of rotating a string through the alphabet, so A becomes N, B becomes O, and so on. Lowercase letters are rotated in the same way; digits, punctuation, and other characters are ignored. Because 13 is half of 26 (the size of our alphabet), the function is its own inverse; applying it a second time will decrypt it.

The following is an implementation as a method added to the String class. We present it without further comment.

 class String   def rot13     self.tr("A-Ma-mN-Zn-z","N-Zn-zA-Ma-m")   end end joke = "Y2K bug" joke13 = joke.rot13    # "L2X oht" episode2 = "Fcbvyre: Nanxva qbrfa'g trg xvyyrq." puts episode2.rot13

Obscuring Strings

Sometimes we don't want strings to be immediately legible. For example, passwords shouldn't be stored in plain text, no matter how tight the file permissions are.

The standard method crypt uses the standard function of the same name in order to DES-encrypt a string. It takes a "salt" value as a parameter (similar to the seed value for a random number generator).

A trivial application for this is shown in the following, where we ask for a password that Tolkien fans should know.

 coded = "hfCghHIE5LAM." puts "Speak, friend, and enter!" print "Password: " password = gets.chop if password.crypt("hf") == coded   puts "Welcome!" else   puts "What are you, an orc?" end

There are other conceivable uses for hiding strings. For example, we sometimes want to hide strings inside a file so that they aren't easily read. Even a binary file can have readable portions easily extracted by the UNIX strings utility or the equivalent, but a DES encryption will stop all but the most determined crackers.

It is worth noting that you should never rely on encryption of this nature for a server-side Web application. That is because a password entered on a Web form is still transmitted over the Internet in plaintext. In a case like this, the easiest security measure is the Secure Sockets Layer (SSL). Of course, you could still use encryption on the server side, but for a different reasonto protect the password as it is stored rather than during transmission.

Counting Characters in Strings

The count method will count the number of occurrences of any set of specified characters.

 s1 = "abracadabra" a  = s1.count("c")      # 1 b  = s1.count("bdr")    # 5

The string parameter is similar to a very simple regular expression. If it starts with a caret, the list is negated.

 c = s1.count("^a")      # 6 d = s1.count("^bdr")    # 6

A hyphen indicates a range of characters.

 e = s1.count("a-d")     # 9 f = s1.count("^a-d")    # 2

Reversing a String

A string can be reversed very simply with the reverse method (or its in-place counterpart reverse!).

 s1 = "Star Trek" s2 = s1.reverse         # "kerT ratS" s1.reverse!             # s1 is now "kerT ratS"

Suppose that you have a sentence and need to reverse the word order (rather than character order). Use the %w operator to make it an array of words, reverse the array, and then use join to rejoin them.

 words = %w( how now brown cow ) # ["how", "now", "brown", "cow"] words.reverse.join(" ") # "cow brown now how"

This can be generalized with String#split, which allows you to divide the words based on your own pattern.

 phrase = "Now here's a sentence" phrase.split(" ").reverse.join(" ") # "sentence a here's Now"

Removing Duplicate Characters

Runs of duplicate characters can be removed using the squeeze method.

 s1 = "bookkeeper" s2 = s1.squeeze         # "bokeper" s3 = "Hello..." s4 = s3.squeeze         # "Helo."

If a parameter is specified, only those characters will be squeezed.

 s5 = s3.squeeze(".")    # "Hello."

This parameter follows the same rules as the one for the count method (see "Counting Characters in Strings"); that is, it understands the hyphen and the caret.

There is also a squeeze! method.

Removing Specific Characters from Within a String

The delete method will remove characters from a string if they appear in the list of characters passed as a parameter.

 s1 = "To be, or not to be" s2 = s1.delete("b")            # "To e, or not to e" s3 = "Veni, vidi, vici!" s4 = s3.delete(",!")           # "Veni vidi vici"

This parameter follows the same rules as the one for the count method (see "Counting Characters in Strings"); that is, it understands the hyphen and the caret.

There is also a delete! method.

Printing Special Characters

The dump method will provide explicit printable representations of characters that might ordinarily be invisible or print differently.

 s1 = "Listen" << 7 << 7 << 7   # Add three ASCII BEL characters puts s1.dump                   # Prints: Listen\007\007\007 s2 = "abc\t\tdef\tghi\n\n" puts s2.dump                   # Prints: abc\t\tdef\tghi\n\n s3 = "Double quote: \"" puts s3.dump                   # Prints: Double quote: \"

Generating Successive Strings

On rare occasions we might want to find the successor value for a string; for example, the successor for "aaa" is "aab" (then "aac", "aad", and so on).

Ruby provides the method succ for this purpose.

 droid = "R2D2" improved  = droid.succ         # "R2D3" pill  = "Vitamin B" pill2 = pill.succ              # "Vitamin C"

We don't recommend the use of this feature unless the values are predictable and reasonable. If you start with a string that is esoteric enough, you will eventually get strange and surprising results.

There is also an upto method that will apply succ repeatedly in a loop until the desired final value is reached.

 "Files, A".upto "Files, X" do |letter|   puts "Opening: #{ letter} " end # Produces 24 lines of output

Again, we stress that this isn't used very frequently, and you use it at your own risk. Also we want to point out that there is no corresponding predecessor function at the time of this writing.

Calculate the Levenstein Distance Between Two Strings

The concept of distance between strings is important in inductive learning (AI), cryptography, proteins research, and in other areas.

The Levenstein distance (see Listing 2.3) is the minimum number of modifications needed to change one string into another, using three basic modification operations: del (deletion), ins (insertion), and sub (substitution). A substitution is also considered to be a combination of a deletion and insertion (indel). There are various approaches to this, but we will avoid getting too technical. Suffice it to say that this Ruby implementation allows you to provide optional parameters to set the cost for the three types of modification operations, and defaults to a single indel cost basis (cost of insertion=cost of deletion).

Listing 2.3 Levenstein Distance

 class String   def levenstein(other, ins=2, del=2, sub=1)     # ins, del, sub are weighted costs     return nil if self.nil?     return nil if other.nil?     dm = []        # distance matrix     # Initialize first row values     dm[0] = (0..self.length).collect {  |i| i * ins }     fill = [0] * (self.length - 1)     # Initialize first column values     for i in 1..other.length       dm[i] = [i * del, fill.flatten]     end     # populate matrix     for i in 1..other.length       for j in 1..self.length         # critical comparison         dm[i][j] = [              dm[i-1][j-1] +                (self[j-1] == other[i-1] ? 0 : sub),              dm[i][j-1] + ins,              dm[i-1][j] + del            ].min       end     end     # The last value in matrix is the     # Levenstein distance between the strings     dm[other.length][self.length]   end end s1 = "ACUGAUGUGA" s2 = "AUGGAA" d1 = s1.levenstein(s2)    # 9 s3 = "pennsylvania" s4 = "pencilvaneya" d2 = s3.levenstein(s4)    # 7 s5 = "abcd" s6 = "abcd" d3 = s5.levenstein(s6)    # 0

Now that we have the Levenstein distance defined, it's conceivable that we could define a similar? method, giving it a threshold for similarity.

 class String   def similar?(other, thresh=2)     if self.levenstein(other) < thresh       true     else       false       end     end end if "polarity".similar?("hilarity")   puts "Electricity is funny!" end

Of course, it would also be possible to pass in the three weighted costs to the similar? method so that they could in turn be passed into the Levenstein method. We have omitted these for simplicity.

Using Strings as Stacks and Queues

These routines make it possible to treat a string as a stack or a queue (see Listing 2.4), adding the operations shift, unshift, push, pop, rotate_left, and rotate_right. The operations are implemented both at the character and the word level. These have proved useful in one or two programs that we have written, and they might be useful to you also. Use your imagination.

There might be some confusion as to what is returned by each method. In the case of a retrieving operation such as pop or shift, the return value is the item that was retrieved. In a storing operation such as push or unshift, the return value is the new string. All rotate operations return the value of the new string. And we will state the obvious: Every one of these operations modifies its receiver, although none of them is marked with an exclamation point as suffix.

Listing 2.4 String as Queues

 class String   def shift     # Removes first character from self and     #   returns it, changing self     return nil if self.empty?     item=self[0]     self.sub!(/^./,"")     return nil if item.nil?     item.chr   end   def unshift(other)     # Adds last character of provided string to     #   front of self     newself = other.to_s.dup.pop.to_s + self     self.replace(newself)   end   def pop     # Pops last character off self and     #   returns it, changing self     return nil if self.empty?     item=self[-1]     self.chop!     return nil if item.nil?     item.chr   end    def push(other)     # Pushes first character of provided     #   string onto end of self     newself = self + other.to_s.dup.shift.to_s     self.replace(newself)   end   def rotate_left(n=1)     n=1 unless n.kind_of? Integer     n.times do       char = self.shift       self.push(char)     end     self   end   def rotate_right(n=1)     n=1 unless n.kind_of? Integer     n.times do       char = self.pop       self.unshift(char)     end     self   end   @@first_word_re = /^(\w+\W*)/   @@last_word_re = /(\w+\W*)$/   def shift_word     # Shifts first word off of self     #   and returns; changes self     return nil if self.empty?     self=~@@first_word_re     newself= $' || ""       # $' is POSTMATCH     self.replace(newself) unless $'.nil?     $1   end    def unshift_word(other)     # Adds provided string to front of self     newself = other.to_s + self     self.replace(newself)   end   def pop_word     # Pops and returns last word off     # self; changes self     return nil if self.empty?     self=~@@last_word_re     newself= $` || ""       # $` is PREMATCH     self.replace(newself) unless $`.nil?     $1   end   def push_word(other)     # Pushes provided string onto end of self     newself = self + other.to_s     self.replace(newself)   end   def rotate_word_left     word = self.shift_word     self.push_word(word)   end   def rotate_word_right     word = self.pop_word     self.unshift_word(word)   end   alias rotate_Left rotate_word_left   alias rotate_Right rotate_word_right end  # ------------ str = "Hello there" puts str.rotate_left                # "ello thereH" puts str.pop                        # "H" puts str.shift                    # "e" puts str.rotate_right                # "ello ther" puts str.unshift("H")                # "Hello ther" puts str.push("e")                # "Hello there" puts str.push_word(", pal!")        # "Hello there, pal!" puts str.rotate_Left                # "there, pal!Hello " puts str.pop_word                # str is "there, pal!"                                     # result is "Hello " puts str.shift_word                # str is "pal!"                                     # result is "there, " puts str.unshift_word("Hi there, ") # "Hi there, pal!" puts str.rotate_Right                # "pal!Hi there, " puts str.rotate_left(4)             # "Hi there, pal!" puts "Trying again..." str = "pal! Hi there, " puts str.rotate_left(5)             # "Hi there, pal!"

Note that the [] operator with a range might be used to gain a window onto a string that is being rotated.

 str = ".....duck....*...*..*..........*......*..." loop do   print str.rotate_left[0..7],"\r"} end # speed reading string="See Bill run. Run Bill run! See Jane sit. Jane sees Bill." loop{ print string.rotate_Left[0..4],"\r"}

Creating an Abbreviation or Acronym

Suppose that we have a string and we want to create an abbreviation from the initial letters of each word in it. The code fragment shown in Listing 2.5 accomplishes that. We have added a threshold value such that any word fewer than that number of letters will be ignored. The threshold value defaults to zero, including all words.

Note that this uses the shift_word function, defined in "Using Strings as Stacks and Queues."

Listing 2.5 Acronym Creator

 class String   def acronym(thresh=0)     acro=""     str=self.dup.strip     while !str.nil? && !str.empty?       word = str.shift_word       if word.length >= thresh         acro += word.strip[0,1].to_s.upcase       end     end     acro   end end s1 = "Same old, same old" puts s1.acronym       #  "SOSO" s2 = "three-letter abbreviation" puts s2.acronym       #  "TLA" s3 = "light amplification by stimulated emission of radiation" puts s3.acronym       #  "LABSEOR" puts s3.acronym(3)    #  "LASER"

Here is a less readable but perhaps more instructive version of the same method.

 def acro(thresh=0)   self.split.find_all { |w| w.length > thresh } .        collect { |w| w[0,1].upcase} .join end

Don't fail to notice the trailing dot on the find_all call.

Encoding and Decoding Base64 Strings

Base64 is frequently used to convert machine-readable data into a text form with no special characters in it. For example, newsgroups that handle binary files, such as program executables, frequently will use base64.

The easiest way to do a base64 encode/decode is to use the built-in features of Ruby. The Array class has a pack method that will return a base64 string (given the parameter "m"). The String class has a method unpack that will likewise unpack the string (decoding the base64).

 str = "\007\007\002\abdce" new_string = [str].pack("m")         # "BwcCB2JkY2U=" original   =  new_string.unpack("m") # ["\a\a\002\abdce"]

Note that an array is returned by unpack.

Encoding and Decoding Strings (`uuencode`/`uudecode`)

The uu in these names means UNIX-to-UNIX. The uuencode and uudecode utilities are a time-honored way of exchanging data in text form (similar to the way base64 is used).

 str = "\007\007\002\abdce" new_string = [str].pack("u") # '(!P<"!V)D8V4`' original = new_string.unpack("u") # ["\a\a\002\abdce"]