Section 6.3. Strings and Operators

6.3. Strings and Operators

6.3.1. Standard Type Operators

In Chapter 4, we introduced a number of operators that apply to most objects, including the standard types. We will take a look at how some of those apply to strings. For a brief introduction, here are a few examples using strings:

>>> str1 = 'abc' >>> str2 = 'lmn' >>> str3 = 'xyz' >>> str1 < str2 True >>> str2 != str3 True >>> str1 < str3 and str2 == 'xyz' False

When using the value comparison operators, strings are compared lexicographically (ASCII value order).

6.3.2. Sequence Operators

Slices ( `[ ]` and `[ : ]` )

Earlier in Section 6.1.1, we examined how we can access individual or a group of elements from a sequence. We will apply that knowledge to strings in this section. In particular, we will look at:

Counting forward
Counting backward
Default/missing indexes

For the following examples, we use the single string 'abcd'. Provided in the figure is a list of positive and negative indexes that indicate the position in which each character is located within the string itself.

Using the length operator, we can confirm that its length is 4:

>>> aString = 'abcd' >>> len(aString) 4

When counting forward, indexes start at 0 to the left and end at one less than the length of the string (because we started from zero). In our example, the final index of our string is:

final_index       = len(aString) - 1                   = 4 - 1                   = 3

We can access any substring within this range. The slice operator with a single argument will give us a single character, and the slice operator with a range, i.e., using a colon ( : ), will give us multiple consecutive characters. Again, for any ranges [start:end], we will get all characters starting at offset start up to, but not including, the character at end. In other words, for all characters x in the range [start:end], start <= x < end.

 >>> aString[0] 'a' >>> aString[1:3] 'bc' >>> aString[2:4] 'cd' >>> aString[4] Traceback (innermost last):   File "<stdin>", line 1, in ? IndexError: string index out of range

Any index outside our valid index range (in our example, 0 to 3) results in an error. Above, our access of aString[2:4] was valid because that returns characters at indexes 2 and 3, i.e., 'c' and 'd', but a direct access to the character at index 4 was invalid.

When counting backward, we start at index -1 and move toward the beginning of the string, ending at negative value of the length of the string. The final index (the first character) is located at:

 final_index     = -len(aString)                  = -4 >>> aString[-1] 'd' >>> aString[-3:-1] 'bc' >>> aString[-4] 'a'

When either a starting or an ending index is missing, they default to the beginning or end of the string, respectively.

>>> aString[2:] 'cd' >>> aString[1:] 'bcd' >>> aString[:-1] 'abc' >>> aString[:] 'abcd'

Notice how the omission of both indices gives us a copy of the entire string.

Membership (`in, not in`)

The membership question asks whether a (sub)string appears in a (nother) string. true is returned if that character appears in the string and False otherwise. Note that the membership operation is not used to determine if a substring is within a string. Such functionality can be accomplished by using the string methods or string module functions find() or index() (and their brethren rfind() and rindex()).

Below are a few more examples of strings and the membership operators. Note that prior to Python 2.3, the in (and not in) operators for strings only allowed a single character check, such as the second example below (is 'n' a substring of 'abcd'). In 2.3, this was opened up to all strings, not just characters.

>>> 'bc' in 'abcd' True >>> 'n' in 'abcd' False >>> 'nm' not in 'abcd' True

In Example 6.1, we will be using the following predefined strings found in the string module:

>>> import string >>> string.uppercase 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' >>> string.lowercase 'abcdefghijklmnopqrstuvwxyz' >>> string.letters 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ' >>> string.digits '0123456789'

Example 6.1 is a small script called idcheck.py which checks for valid Python identifiers. As we now know, Python identifiers must start with an alphabetic character. Any succeeding characters may be alphanumeric.

Example 6.1. ID Check (`idcheck.py`)

Tests for identifier validity. First symbol must be alphabetic and remaining symbols must be alphanumeric. This tester program only checks identifiers that are at least two characters in length.

1  #!usr/bin/env python 2 3  import string 4 5  alphas = string.letters + '_' 6  nums = string.digits 7 8  print 'Welcome to the Identifier Checker v1.0' 9  print 'Testees must be at least 2 chars long.' 10 myInput = raw_input('Identifier to test? ') 11 12 if len(myInput) > 1: 13 14     if myInput[0] not in alphas: 15         print '''invalid: first symbol must be 16             alphabetic''' 17     else: 18         for otherChar in myInput[1:]: 19 20              if otherChar not in alphas + nums: 21                  print '''invalid: remaining 22                     symbols must be alphanumeric''' 23                  break 24         else: 25             print "okay as an identifier"

The example also shows use of the string concatenation operator ( + ) introduced later in this section.

Running this script several times produces the following output:

$ python idcheck.py Welcome to the Identifier Checker v1.0 Testees must be at least 2 chars long. Identifier to test? counter okay as an identifier $ $ python idcheck.py Welcome to the Identifier Checker v1.0 Testees must be at least 2 chars long. Identifier to test? 3d_effects invalid: first symbol must be alphabetic

Let us take apart the application line by line.

Lines 36

Import the string module and use some of the predefined strings to put together valid alphabetic and numeric identifier strings that we will test against.

Lines 812

Print the salutation and prompt for user input. The if statement on line 12 filters out all identifiers or candidates shorter than two characters in length.

Lines 1416

Check to see if the first symbol is alphabetic. If it is not, display the output indicating the result and perform no further processing.

Lines 1718

Otherwise, loop to check the other characters, starting from the second symbol to the end of the string.

Lines 2023

Check to see if each remaining symbol is alphanumeric. Note how we use the concatenation operator (see below) to create the set of valid characters. As soon as we find an invalid character, display the result and perform no further processing by exiting the loop with break.

Core Tip: Performance

In general, repeat performances of operations or functions as arguments in a loop are unproductive as far as performance is concerned.

while i < len(myString):          print 'character %d is:', myString[i]

The loop above wastes valuable time recalculating the length of string myString. This function call occurs for each loop iteration. If we simply save this value once, we can rewrite our loop so that it is more productive.

length = len(myString) while i < length:     print'character %d is:', myString[i]

The same idea applies for this loop above in Example 6.1.

for otherChar in myInput[1:]:      if otherChar not in alphas + nums:

The for loop beginning on line 18 contains an if statement that concatenates a pair of strings. These strings do not change throughout the course of the application, yet this calculation must be performed for each loop iteration. If we save the new string first, we can then reference that string rather than make the same calculations over and over again:

alphnums = alphas + nums for otherChar in myInput[1:]:     if otherChar not in alphnums:         :

Lines 2425

It may be somewhat premature to show you a for-else loop statement, but we are going to give it a shot anyway. (For a full treatment, see Chapter 8). The else statement for a for loop is optional and, if provided, will execute if the loop finished in completion without being "broken" out of by break. In our application, if all remaining symbols check out okay, then we have a valid identifier name. The result is displayed to indicate as such, completing execution.

This application is not without its flaws, however. One problem is that the identifiers tested must have length greater than 1. Our application "as is" is not reflective of the true range of Python identifiers, which may be of length 1. Another problem with our application is that it does not take into consideration Python keywords, which are reserved names that cannot be used for identifiers. We leave these two tasks as exercises for the reader (see Exercise 6-2).

Concatenation ( `+` )

Runtime String Concatenation

We can use the concatenation operator to create new strings from existing ones. We have already seen the concatenation operator in action above in Example 6-1. Here are a few more examples:

>>> 'Spanish' + 'Inquisition' 'SpanishInquisition' >>> >>> 'Spanish' + ' ' + 'Inquisition' 'Spanish Inquisition' >>> >>> s = 'Spanish' + ' ' + 'Inquisition' + ' Made Easy' >>> s 'Spanish Inquisition Made Easy' >>> >>> import string >>> string.upper(s[:3] + s[20])    # archaic (see below) 'SPAM'

The last example illustrates using the concatenation operator to put together a pair of slices from string s, the "Spa" from "Spanish" and the "M" from "Made." The extracted slices are concatenated and then sent to the string.upper() function to convert the new string to all uppercase letters. String methods were added to Python back in 1.6 so such examples can be replaced with a single call to the final string method (see example below). There is really no longer a need to import the string module unless you are trying to access some of the older string constants which that module defines.

Note: Although easier to learn for beginners, we recommend not using string concatenation when performance counts. The reason is that for every string that is part of a concatenation, Python has to allocate new memory for all strings involved, including the result. Instead, we recommend you either use the string format operator ( % ), as in the examples below, or put all of the substrings in a list, and using one join() call to put them all together:

>>> '%s %s' % ('Spanish', 'Inquisition') 'Spanish Inquisition' >>> >>> s = ' '.join(('Spanish', 'Inquisition', 'Made Easy')) >>> s 'Spanish Inquisition Made Easy' >>> >>> # no need to import string to use string.upper(): >>> ('%s%s' % (s[:3], s[20])).upper() 'SPAM'

Compile-Time String Concatenation

The above syntax using the addition operator performs the string concatenation at runtime, and its use is the norm. There is a less frequently used syntax that is more of a programmer convenience feature. Python's syntax allows you to create a single string from multiple string literals placed adjacent to each other in the body of your source code:

>>> foo = "Hello" 'world!' >>> foo 'Helloworld!'

It is a convenient way to split up long strings without unnecessary backslash escapes. As you can see from the above, you can mix quotation types on the same line. Another good thing about this feature is that you can add comments too, like this example:

>>> f = urllib.urlopen('http://'  # protocol ... 'localhost'                   # hostname ... ':8000'                       # port ... '/cgi-bin/friends2.py')       # file

As you can imagine, here is what urlopen() really gets as input:

>>> 'http://' 'localhost' ':8000' '/cgi-bin/friends2.py' 'http://localhost:8000/cgi-bin/friends2.py'

Regular String Coercion to Unicode

When concatenating regular and Unicode strings, regular strings are converted to Unicode first before the operation occurs:

>>> 'Hello' + u' ' + 'World' + u'!' u'Hello World!'

Repetition ( `*` )

The repetition operator creates new strings, concatenating multiple copies of the same string to accomplish its functionality:

>>> 'Ni!' * 3 'Ni!Ni!Ni!' >>> >>> '*'*40 '****************************************' >>> >>> print '-' * 20, 'Hello World!', '-' * 20 -------------------- Hello World! -------------------- >>> who = 'knights' >>> who * 2 'knightsknights' >>> who 'knights'

As with any standard operator, the original variable is unmodified, as indicated in the final dump of the object above.