Section 23.2. Regular Expressions | Linux Application Development (paperback) (2nd Edition)

23.2. Regular Expressions

Regular expressions, as used in sed, awk, grep, vi, and countless other Unix programs through the years, have become a major part of the Unix programming environment. They are also available for use within C programs. This section explains how to use them and then presents a simple file parser using these functions.

23.2.1. Linux Regular Expressions

Regular expressions have two flavors: basic regular expressions (BREs) and extended regular expressions (EREs). They correspond (roughly) to the grep and egrep commands. Both forms of regular expressions are explained in the grep man page, in the POSIX.2 standard [IEEE, 1993], in A Practical Guide to Red Hat Linux 8 [Sobell, 2002], and in other places, so we do not describe their syntax here, only the function interface that allows you to use regular expressions from within your programs.

23.2.2. Regular Expression Matching

POSIX specifies four functions to provide regular expression handling:

 #include <regex.h> int regcomp(regex_t *preg, const char *regex, int cflags); int regexec(const regex_t *preg, const char *string, size_t nmatch,             regmatch_t pmatch[], int eflags); void regfree(regex_t *preg); size_t regerror(int errcode, const regex_t *preg, char *errbuf,                 size_t errbuf_size);

Before you can compare a string to a regular expression, you need to compile it with the regcomp() function. The regex_t *preg holds all the state for the regular expression. You need one regex_t for each regular expression that you wish to have available concurrently. The regex_t structure has only one member on which you should rely: re_nsub, which specifies the number of parenthesized subexpressions in the regular expression. Consider the rest of the structure opaque.

The cflags argument determines many things about how the regular expression regex is interpreted. It may be zero, or it may be the bitwise OR of any of the following four items:

 REG_EXTENDED

If set, use ERE syntax instead of BRE syntax.

 REG_ICASE

If set, do not differentiate between upper- and lowercase.

 REG_NOSUB

If set, do not keep track of substrings. The regexec() function then ignores the nmatch and pmatch arguments.

 REG_NEWLINE

If REG_NEWLINE is not set, the newline character is treated essentially the same as any other character. The ^ and $ characters match only the beginning and end of the entire string, not adjacent newline characters. If REG_NEWLINE is set, you get the same behavior as you do with grep, sed, and other standard system tools; ^ anchors both to the beginning of a string and to the character after a newline (technically, it matches zero-length strings following a newline character); $ anchors to the end of the string and to newline characters (technically, it matches a zero-length string preceding the newline character); and . does not match a newline character.

A typical invocation looks like this:

 if ((rerr = regcomp(&p, "(^(.*[^\\])#.*$)|(^[^#]+$)",             REG_EXTENDED|REG_NEWLINE))) {     if (rerr == REG_NOMATCH) {         /* string simply did not match regular expression */     } else {         /* some other error, such as a badly formed expression */     } }

This ERE finds lines of a file that are not commented out, or that are, at most, partially commented out, by # characters not prefixed with \ characters. This kind of regular expression might be useful as part of a simple parser for an application's configuration file.

Even if you are compiling an expression that you know is good, you should still check for errors. regcomp() returns zero for a successful compilation and a nonzero error code for an error. Most errors involve invalid regular expressions of one sort or another, but another possible error is running out of memory. See page 562 for a description of the regerror() function.

 #include <regex.h> int regexec(const regex_t *preg, const chat *string, size_t nmatch,             regmatch_t pmatch[], int eflags);

The regexec() function tests a string against a compiled regular expression. The eflags argument may be zero, or it may be the bitwise OR of any of the following symbols:

 REG_NOTBOL

If set, the first character of the string does not match a ^ character. Any character following a newline character still matches ^ as long as REG_NEWLINE was set in the call to regcomp().

 REG_NOTEOL

If set, the final character of the string does not match a $ character. Any character preceding a newline character still matches $ as long as REG_NEWLINE was set in the call to regcomp().

An array of regmatch_t structures is used to represent the location of subexpressions in the regular expression:

 #include <regex.h> typedef struct {     regoff_t rm_so;  /* byte index within string of start of match */     regoff_t rm_eo;  /* byte index within string of end of match */ } regmatch_t;

The first regmatch_t element describes the entire string that was matched; note that any newline, including a trailing newline, is included in this entire string, regardless of whether REG_NEWLINE is set or not.

Following array elements express parenthesized subexpressions in the order they are expressed in the regular expression, in order by the location of the opening parenthesis. (In C code, element i is equivalent to the replacement expression \i in sed or awk.) Subexpressions that do not match have a value of-1 in their regmatch_t.rm_so member.

This code matches a string against a regular expression with subexpressions, and prints out all the subexpressions that match:

  1: /* match.c */  2:  3: #include <alloca.h>  4: #include <sys/types.h>  5: #include <regex.h>  6: #include <stdlib.h>  7: #include <string.h>  8: #include <stdio.h>  9: 10: void do_regerror(int errcode, const regex_t *preg) { 11:     char *errbuf; 12:     size_t errbuf_size; 13: 14:     errbuf_size = regerror(errcode, preg, NULL, 0); 15:     errbuf = alloca(errbuf_size); 16:     if (!errbuf) { 17:         perror("alloca"); 18:         return; 19:     } 20: 21:     regerror(errcode, preg, errbuf, errbuf_size); 22:     fprintf(stderr, "%s\n", errbuf); 23: } 24: 25: int main() { 26: 27:     regex_t p; 28:     regmatch_t *pmatch; 29:     int rerr; 30:     char *regex = "(^(.*[^\\])#.*$)|(^[^#]+$)"; 31:     char string[BUFSIZ+1]; 32:     int i; 33: 34:     if ((rerr = regcomp(&p, regex, REG_EXTENDED | REG_NEWLINE))) { 35:         do_regerror(rerr, &p); 36:     } 37: 38:     pmatch = alloca(sizeof(regmatch_t) * (p.re_nsub+1)); 39:     if (!pmatch) { 40:         perror("alloca"); 41:     } 42: 43:     printf("Enter a string: "); 44:     fgets(string, sizeof(string), stdin); 45: 46:     if ((rerr = regexec(&p, string, p.re_nsub+1, pmatch, 0))) { 47:         if (rerr == REG_NOMATCH) { 48:             /* regerror can handle this case, but in most cases 49:              * it is handled specially 50:              */ 51:             printf("String did not match %s\n", regex); 52:         } else { 53:             do_regerror(rerr, &p); 54:         } 55:     } else { 56:         /* match succeeded */ 57:         printf("String matched regular expressioon %s\n", regex); 58:         for(i = 0; i <= p.re_nsub; i++) { 59:             /* print the matching portion(s) of the string */ 60:             if (pmatch[i].rm_so != -1) { 61:                 char *submatch; 62:                size_t matchlen = pmatch[i].rm_eo - pmatch[i].rm_so; 63:                 submatch = malloc(matchlen+1); 64:                 strncpy(submatch, string+pmatch[i].rm_so, 65:                         matchlen); 66:                 submatch[matchlen] = '\0'; 67:                 printf("matched subexpression %d: %s\n", i, 68:                        submatch); 69:                 free(submatch); 70:             } else { 71:                 printf("no match for subexpression %d\n", i); 72:             } 73:         } 74:     } 75:     exit(0); 76: }

In the sample regular expression given in match.c, there are three subexpressions: The first is an entire line containing text followed by a comment character, the second is the text in that line that precedes the comment character, and the third is an entire line containing no comment character. For a line with a comment character at the beginning, the second and third elements of pmatch[] have rm_so set to -1; for a line with a comment character at the beginning, the first and second are set to -1; and for a line with no comment characters, the second and third are set to -1.

Whenever you are done with a compiled regular expression, you need to free it to avoid a memory leak. You must use the regfree() function to free it, not the free() function:

 #include <regex.h> void regfree(regex_t *preg);

The POSIX standard does not explicitly specify whether you need to use regfree() each time you call regcomp() or only after the final time you call regcomp() on one regex_t structure. Therefore, regfree() your regex_t structures between uses to avoid memory leaks.

Whenever you get a nonzero return code from regcomp() or regexec(), the regerror() function can provide a detailed message explaining what went wrong. It writes as much as possible of an error message into a buffer and returns the size of the total message. Because you do not know beforehand how big the error message might be, you first ask for its size, then allocate the buffer, and then use the buffer, as demonstrated in our sample code below. Because that kind of error handling gets old fast, and because you need to include that error handling code at least twice (once after regcomp() and once after regexec()), we recommend that you write your own wrapper around regerror(), as shown on line 10 of match.c.

23.2.3. A Simple grep

Grep is a popular utility, specified by POSIX, which provides regular expression searching in text files. Here is a simple (not POSIX-compliant) version of grep implemented using the standard regular expression functions:

   1: /* grep.c */   2:   3: #include <alloca.h>   4: #include <ctype.h>   5: #include <popt.h>   6: #include <regex.h>   7: #include <stdio.h>   8: #include <string.h>   9: #include <unistd.h>  10:  11: #define MODE_REGEXP         1  12: #define MODE_EXTENDED       2  13: #define MODE_FIXED          3  14:  15: void do_regerror(int errcode, const regex_t *preg) {  16:     char *errbuf;  17:     size_t errbuf_size;  18:  19:     errbuf_size = regerror(errcode, preg, NULL, 0);  20:     errbuf = alloca(errbuf_size);  21:     if (!errbuf) {  22:         perror("alloca");  23:         return;  24:     }  25:  26:     regerror(errcode, preg, errbuf, errbuf_size);  27:     fprintf(stderr, "%s\n", errbuf);  28: }  29:  30: int scanFile(FILE * f, int mode, const void * pattern,  31:              int ignoreCase, const char * fileName,  32:              int * maxCountPtr) {  33:     long lineLength;  34:     char * line;  35:     int match;  36:     int rc;  37:     char * chptr;  38:     char * prefix = "";  39:  40:     if (fileName) {  41:         prefix = alloca(strlen(fileName) + 4);  42:         sprintf(prefix, "%s: ", fileName);  43:     }  44:  45:     lineLength = sysconf(_SC_LINE_MAX);  46:     line = alloca(lineLength);  47:  48:     while (fgets(line, lineLength, f) && (*maxCountPtr)) {  49:         /* if we don't have a final '\n' we didn't get the  50:            whole line */  51:         if (line[strlen(line) - 1] != '\n') {  52:             fprintf(stderr, "%sline too long\n", prefix);  53:             return 1;  54:         }  55:  56:         if (mode == MODE_FIXED) {  57:             if (ignoreCase) {  58:                 for (chptr = line; *chptr; chptr++) {  59:                     if (isalpha(*chptr)) *chptr = tolower(*chptr);  60:                 }  61:             }  62:             match = (strstr(line, pattern) != NULL);  63:         } else {  64:             match = 0;  65:             rc = regexec(pattern, line, 0, NULL, 0);  66:             if (!rc)  67:                 match = 1;  68:             else if (rc != REG_NOMATCH)  69:                 do_regerror(match, pattern);  70:         }  71:  72:         if (match) {  73:             printf("%s%s", prefix, line);  74:             if (*maxCountPtr > 0)  75:                 (*maxCountPtr)--;  76:         }  77:     }  78:  79:     return 0;  80: }  81:  82: int main(int argc, const char ** argv) {  83:     const char * pattern = NULL;  84:     regex_t regPattern;  85:     const void * finalPattern;  86:     int mode = MODE_REGEXP;  87:     int ignoreCase = 0;  88:     int maxCount = -1;  89:     int rc;  90:     int regFlags;  91:     const char ** files;  92:     poptContext optCon;  93:     FILE * f;  94:     char * chptr;  95:     struct poptOption optionsTable[] = {  96:             { "extended-regexp", 'E', POPT_ARG_VAL,  97:               &mode, MODE_EXTENDED,  98:               "pattern for match is an extended regular "  99:               "expression" }, 100:             { "fixed-strings", 'F', POPT_ARG_VAL, 101:               &mode, MODE_FIXED, 102:               "pattern for match is a basic string (not a " 103:               "regular expression)", NULL }, 104:             { "basic-regexp", 'G', POPT_ARG_VAL, 105:               &mode, MODE_REGEXP, 106:               "pattern for match is a basic regular expression" }, 107:             { "ignore-case", 'i', POPT_ARG_NONE, &ignoreCase, 0, 108:               "perform case insensitive search", NULL }, 109:             { "max-count", 'm', POPT_ARG_INT, &maxCount, 0, 110:               "terminate after N matches", "N" }, 111:             { "regexp", 'e', POPT_ARG_STRING, &pattern, 0, 112:               "regular expression to search for", "pattern" }, 113:             POPT_AUTOHELP 114:             { NULL, '\0', POPT_ARG_NONE, NULL, 0, NULL, NULL } 115:     }; 116: 117:     optCon = poptGetContext("grep", argc, argv, optionsTable, 0); 118:     poptSetOtherOptionHelp(optCon, "<pattern> <file list>"); 119: 120:     if ((rc = poptGetNextOpt(optCon)) < -1) { 121:         /* an error occurred during option processing */ 122:         fprintf(stderr, "%s: %s\n", 123:                 poptBadOption(optCon, POPT_BADOPTION_NOALIAS), 124:                 poptStrerror(rc)); 125:         return 1; 126:     } 127: 128:     files = poptGetArgs(optCon); 129:     /* if we weren't given a pattern it must be the first 130:        leftover */ 131:     if (!files && !pattern) { 132:         poptPrintUsage(optCon, stdout, 0); 133:         return 1; 134:     } 135: 136:     if (!pattern) { 137:         pattern = files[0]; 138:         files++; 139:     } 140: 141:     regFlags = REG_NEWLINE | REG_NOSUB; 142:     if (ignoreCase) { 143:         regFlags |= REG_ICASE; 144:         /* convert the pattern to lower case; this doesn't matter 145:            if we're ignoring the case in a regular expression, but 146:            it lets strstr() handle -i properly */ 147:         chptr = alloca(strlen(pattern) + 1); 148:         strcpy(chptr, pattern); 149:         pattern = chptr; 150: 151:         while (*chptr) { 152:             if (isalpha(*chptr)) *chptr = tolower(*chptr); 153:             chptr++; 154:         } 155:     } 156: 157: 158:     switch (mode) { 159:     case MODE_EXTENDED: 160:         regFlags |= REG_EXTENDED; 161:     case MODE_REGEXP: 162:         if ((rc = regcomp(&regPattern, pattern, regFlags))) { 163:             do_regerror(rc, &regPattern); 164:             return 1; 165:         } 166:         finalPattern = &regPattern; 167:         break; 168: 169:     case MODE_FIXED: 170:         finalPattern = pattern; 171:         break; 172:     } 173: 174:     if (!*files) { 175:         rc = scanFile(stdin, mode, finalPattern, ignoreCase, NULL, 176:                       &maxCount); 177:     } else if (!files[1]) { 178:         /* this is handled separately because the file name should 179:            not be printed */ 180:         if (!(f = fopen(*files, "r"))) { 181:             perror(*files); 182:             rc = 1; 183:         } else { 184:             rc = scanFile(f, mode, finalPattern, ignoreCase, NULL, 185:                           &maxCount); 186:             fclose(f); 187:         } 188:     } else { 189:         rc = 0; 190: 191:         while (*files) { 192:             if (!(f = fopen(*files, "r"))) { 193:                 perror(*files); 194:                 rc = 1; 195:             } else { 196:                 rc |= scanFile(f, mode, finalPattern, ignoreCase, 197:                                *files, &maxCount); 198:                 fclose(f); 199:             } 200:             files++; 201:             if (!maxCount) break; 202:         } 203:     } 204: 205:     return rc; 206: }