Hypertext and hypermedia DM can be characterized as mining data that includes text, hyperlinks, text markups, and various other forms of hypermedia information. As such, it is closely related to both Web mining and multimedia mining, which are covered separately in this section, but in reality are quite close in terms of content and applications. While the WWW is substantially composed of hypertext and hypermedia elements, there are other kinds of hypertext/hypermedia data sources, including online catalogues, digital libraries, online information DBs, and hyperlink and inter-document structures.
Some of the important DM techniques used for hypertext and hypermedia DM include classification (supervised learning), clustering (unsupervised learning), semistructured learning, and social network analysis. In the case of classification, or supervised learning, the process starts off by reviewing training data in which items are marked as being part of a certain class or group. This data are the basis from which the algorithm is trained (Chakrabarti, 2000). Unsupervised learning, or clustering, differs from classification in that while classification involves the use of training data, clustering is concerned with the creation of hierarchies of documents based on similarity and organizes the documents based on that hierarchy. Semi-supervised learning and social network analysis are other methods that are important to hypermedia-based DM. Semisupervised learning is the case where there are both labeled and unlabeled documents and there is a need to learn from both types of documents. Social network analysis is also applicable because the Web is considered a social network that examines networks formed through collaborative association (Larson, 1996; Mizruchi, Mariolis, Schwartz, & Mintz, 1986). Other research conducted in the area of hypertext DM includes work on distributed hypertext resource discovery (Chakrabarti, van den Berg, & Dom, 1999).