79 Verifying Spelling on Web Pages


#79 Verifying Spelling on Web Pages

This script, webspell , is an amalgamation of ideas presented in earlier scripts, particularly Script #27, Adding a Local Dictionary to Spell, which demonstrates how to interact with the aspell spelling utility and how to filter its reported misspellings through your own list of additional acceptable words. It relies on the lynx program to pull all the text out of the HTML of a page, either local or remote, and then feeds the resultant text to aspell or an equivalent spelling program.

The Code

 #!/bin/sh # webspell - Uses the spell feature + lynx to spell-check either a # web page URL or a file. # Inevitably you'll find that there are words it flags as wrong but # you think are fine. Simply save them in a file, one per line, and # ensure that 'okaywords' points to that file. okaywords="$HOME/bin/.okaywords" tempout="/tmp/webspell.$$" trap "/bin/rm -f $tempout" 0 if [ $# -eq 0 ] ; then   echo "Usage: webspell fileURL" >&2; exit 1 fi for filename do   if [ ! -f "$filename" -a "$(echo $filenamecut -c1-7)" != "http://" ]   then     continue       # picked up directory in '*' listing   fi   lynx -dump $filename  tr ' ' '\n'  sort -u  \     grep -vE "(^[^a-z]')"  \     # Adjust the following line to produce just a list of misspelled words     ispell -a  awk '/^\&/ { print  }'  \     sort -u > $tempout   if [ -r $okaywords ] ; then     # If you have an okaywords file, screen okay words out     grep -vif $okaywords < $tempout > ${tempout}.2     mv ${tempout}.2 $tempout   fi   if [ -s $tempout ] ; then     echo "Probable spelling errors: ${filename}"     cat $tempout  paste - - - -  sed 's/^/ /'   fi done exit 0 

How It Works

Using the helpful lynx command, this script extracts just the text from each of the specified pages and then feeds the result to a spell-checking program ( ispell in this example, though it works just as well with aspell or another spelling program. See Script #25, Checking the Spelling of Individual Words, for more information about different spell-checking options in Unix).

Notice the file existence test in this script too:

 if [ ! -f "$filename" -a "$(echo $filenamecut -c1-7)" != "http://" 

It can't just fail if the given name isn't readable, because $filename might actually be a URL, so the test becomes rather complex. However, when referencing filenames, the script can work properly with invocations like webspell * , though you'll get better results with a filename wildcard that matches only HTML files. Try webspell *html instead.

Whichever spell-checking program you use, you'll need to ensure that the result of the following line is a list only of misspelled words, with none of the spell-checking utility's special formatting included:

 ispell -a  awk '/^\&/ { print  }'  \ 

This spell line is but one part of a quite complex pipeline that extracts the text from the page, translates it to one word per line (the tr invocation), sorts the words, and ensures that each one appears only once in the pipeline ( sort -u ). After the sort operation, we screen out all the lines that don't begin with a lowercase letter (that is, all punctuation, HTML tags, and other content). Then the next line of the pipe runs the data stream through the spell utility, using awk to extract the misspelled word from the oddly formatted ispell output. The results are run through a sort -u invocation, screened against the okaywords list with grep , and formatted for attractive output with paste (which produces four words per line in this instance).

Running the Script

This script can be given one or more web page URLs or a list of HTML files. To check the spelling of all source files in the current directory, for example, use *.html as the argument.

The Results

 $  webspell http://www.clickthrustats.com/index.shtml *.html  Probable spelling errors: http://www.clickthrustats.com/index.shtml   cafepress     microurl        signup urlwire Probable spelling errors: 074-contactus.html   webspell      werd 

In this case, the script checked a web page on the network from the Click-ThruStats.com site and five local HTML pages, finding the errors shown.

Hacking the Script

It would be a simple change to have webspell invoke the shpell utility presented in Script #26, but it can be dangerous correcting very short words that might overlap phrases or content of an HTML tag, JavaScript snippet, and so forth, so some caution is probably in order.

Also worth considering, if you're obsessed with avoiding any misspellings creeping into your website, is this: With a combination of correcting genuine misspellings and adding valid words to the okaywords file, you can reduce the output of webspell to nothing and then drop it into a weekly cron job to catch and report misspellings automatically.




Wicked Cool Shell Scripts. 101 Scripts for Linux, Mac OS X, and Unix Systems
Wicked Cool Shell Scripts
ISBN: 1593270127
EAN: 2147483647
Year: 2004
Pages: 150
Authors: Dave Taylor

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net