Hack 15 Manipulate Files with sed

If you've ever had to change the formatting of a file, you know that it can be a time-consuming process.

Why waste your time making manual changes to files when Unix systems come with many tools that can very quickly make the changes for you?

2.4.1 Removing Blank Lines

Suppose you need to remove the blank lines from a file. This invocation of grep will do the job:

% grep -v '^$' letter1.txt > tmp ; mv tmp letter1.txt

The pattern ^$ anchors to both the start and the end of a line with no intervening characters the regexp definition of a blank line. The -v option reverses the search, printing all nonblank lines, which are then written to a temporary file, and the temporary file is moved back to the original.

grep must never output to the same file it is reading, or the file will end up empty.

You can rewrite the preceding example in sed as:

% sed '/^$/d' letter1.txt > tmp ; mv tmp letter1.txt

'/^$/d' is actually a sed script. sed's normal mode of operation is to read each line of input, process it according to the script, and then write the processed line to standard output. In this example, the expression '/^$/ is a regular expression matching a blank line, and the trailing d' is a sed function that deletes the line. Blank lines are deleted and all other lines are printed. Again, the results are redirected to a temporary file, which is then copied back to the original file.

2.4.2 Searching with sed

sed can also do the work of grep:

% sed -n '/$USER/p' *

This command will yield the same results as:

% grep '$USER' *

The -n (no-print, perhaps) option prevents sed from outputting each line. The pattern /$USER/ matches lines containing $USER, and the p function prints matched lines to standard output, overriding -n.

2.4.3 Replacing Existing Text

One of the most common uses for sed is to perform a search and replace on a given string. For example, to change all occurrences of 2003 into 2004 in a file called date, include the two search strings in the format 's/oldstring/newstring/', like so:

% sed 's/2003/2004/' date Copyright 2004 ... This was written in 2004, but it is no longer 2003. ...

Almost! Noticed that that last 2003 remains unchanged. This is because without the g (global) flag, sed will change only the first occurrence on each line. This command will give the desired result:

% sed 's/2003/2004/g' date

Search and replace takes other flags too. To output only changed lines, use:

% sed -n 's/2003/2004/gp' date

Note the use of the -n flag to suppress normal output and the p flag to print changed lines.

2.4.4 Multiple Transformations

Perhaps you need to perform two or more transformations on a file. You can do this in a single run by specifying a script with multiple commands:

% sed 's/2003/2004/g;/^$/d' date

This performs both substitution and blank line deletion. Use a semicolon to separate the two commands.

Here is a more complex example that translates HTML tags of the form <font> into PHP bulletin board tags of the form [font]:

% cat index.html <title>hello </title> % sed 's/<\(.*\)>/[\1]/g' index.html [title]hello [/title]

How did this work? The script searched for an HTML tag using the pattern '<.*>'. Angle brackets match literally. In a regular expression, a dot (.) represents any character and an asterisk (*) means zero or more of the previous item. Escaped parentheses, $ and $, capture the matched pattern laying between them and place it in a numbered buffer. In the replace string, \1 refers to the contents of the first buffer. Thus the text between the angle brackets in the search string is captured into the first buffer and written back inside square brackets in the replace string. sed takes full advantage of the power of regular expressions to copy text from the pattern to its replacement.

% cat index1.html <title>hello</title> % sed 's/<\(.*\)>/[\1]/g' index1.html [title>hello</title]

This time the same command fails because the pattern .* is greedy and grabs as much as it can, matching up to the second >. To prevent this behavior, we need to match zero or more of any character except <. Recall that [...] is a regular expression that lists characters to match, but if the first character is the caret (^), the match is reversed. Thus the regular expression [^<] matches any single character other than <. I can modify the previous example as follows:

% sed 's/<\([^<]*\)>/[\1]/g' index1.html [title]hello[/title]

Remember, grep will perform a case-insensitive search if you provide the -i flag. sed, unfortunately, does not have such an option. To search for title in a case-insensitive manner, form regular expressions using [...], each listing a character of the word in both upper- and lowercase forms:

% sed 's/[Tt][Ii][Tt][Ll][Ee]/title/g' title.html

2.4.5 See Also

man grep
man sed
man re_format (regular expressions)
"sed & Regular Expressions" at http://main.rtfiber.com.tw/~changyj/sed/
Cool sed tricks at http://www.wagoneers.com/UNIX/SED/sed.html
The sed FAQ (http://doc.ddart.net/shell/sedfaq.htm)
The sed Script Archive (http://sed.sourceforge.net/grabbag/scripts/)