Searching and Replacing Throughout Multiple Documents with Sed
Back in Chapter 6, we talked about sed and how to use it to search and replace throughout files, one file at a time. Although we're sure you're still coming down off of the power rush from doing that, we'll now show you how to combine sed with shell scripts and loops. In doing this, you can take your search-and-replace criteria and apply them to multiple documents. For example, you can search through all of the .html documents in a directory and make the same change to all of them. In this example (Figure 17.2), we strip out all of the <BLINK> tags, which are offensive to some HTML purists.
Figure 17.2. Create a script to search and replace in multiple documents.
Before you get started, you might have a look at Chapter 6 for a review of sed basics and Chapter 10 for a review of scripts and loops.
To Search and Replace Throughout Multiple Documents:
Use the editor of your choice to create a new script. Name the file whatever you want.
Start the shell script with the name of the program that should run the script.
for i in 'ls -1 *.htm*'
Start a loop. In this case, the loop will process all of the .htm or .html documents in the current directory.
Indicate the beginning of the loop content.
cp $i $i.bak
Make a backup copy of each file before you change it. Remember, Murphy is watching you.
sed "s/<\/*BLINK>//g" $i > $i.new
Specify your search criteria and replacement text. A lot is happening in this line, but don't panic. From the left, this command contains sed followed by
- ", which starts the command.
- s/, which tells sed to search for something.
- <, which is the first character to be searched for.
- \/, which allows you to search for the /. (The \ escapes the / so the / can be used in the search.)
- *, which specifies zero or more of the previous characters ((/)), which takes care of both the opening and closing tags (with and without a / at the beginning).
- BLINK>, which indicates the rest of the text to search for. Note that this searches only for capital letters. You'll want to add a line if your HTML document might use lowercase tags.
- //, which ends the search section and the replace section (there's nothing in the replace section because the tag will be replaced with nothing).
- g, which tells sed to make the change in all occurrences (globally), not just in the first occurrence on each line.
- ", which closes the command.
- $i is replaced with each filename in turn as the loop runs.
- > $i.new indicates that the output is redirected to a new filename. (See Code Listing 17.3)
mv $i.new $i
Move the new file back over the old file.
echo "$i is done."
Optionally, print a status message onscreen, which can be reassuring if there are a lot of files to process.
Indicate the end of the loop.
Save and close out of your script.
Try it out.
Remember to make your script executable with chmod u+x and the filename, and then run it with ./thestinkingblinkintag. In our example, we'll see the "success reports" for each of the HTML documents processed (Code Listing 17.3).
Code Listing 17.3. You can even use sed to strip out bad HTML tags, as shown here.
[ejr@hobbes scripting]$ more >thestinkinblinkintag #! /bin/sh for i in "ls -1 *.htm*" do cp $i $i.bak sed "s/<\/*BLINK>//g" $i > $i.new mv $i.new $i echo "$i is done!" done [ejr@hobbes scripting]$ chmod u+x thestinkinblinkintag [ejr@hobbes scripting]$ ./thestinkinblinkintag above.htm is done! file1.htm is done! file2.htm is done! html.htm is done! temp.htm is done! [ejr@hobbes scripting]$
You could perform any number of other operations on the files within the loop, if you wanted. For example, you could strip out other codes, use tidy as shown in the previous section, replace a former Webmaster's address with your own, or automatically insert comments and last-update dates.