Quite often, HTML pages contain hyperlinks or references to e-mail addresses. The <A HREF=mailto:address@server> tag is used for specifying a hyperlink for sending e-mail. Clicking on such a link generally will cause an e-mail client to pop up and allow the user to compose a message to the recipient specified in the mailto: hyperlink.
E-mail addresses also can be a part of HTML comments, especially specifying who is in charge of maintaining a particular section or a page. Extracting e-mail addresses from HTML is a simple task. All it involves is searching for the @ character within the HTML text.
Unsolicited bulk e-mail, unsolicited commercial e-mail, junk e-mail, or spam whatever you call it causes lots and lots of wasted bandwidth and annoyance to Internet users. At Foundstone, there are days when we receive an equal number of junk e-mails and e-mails from valid users.
Recently, "e-mail harvesting" became a profitable activity for some companies that operate bulk e-mail servers that for a fee send out mass e-mailings of advertisements for small businesses. CDs containing millions of e-mail addresses gathered and categorized are sold in certain places. Such e-mail lists and directories are compiled by Web crawler programs specifically written to gather e-mail addresses from Web pages.
Essentially, e-mail harvesting programs operate on the same principle download the HTML code of a Web page, extract e-mail addresses by looking for patterns with the @ character, recursively follow other hyperlinks within the Web page, and repeat the same process.