Chapter 17

Overview

interMedia is a collection of features, tightly integrated into the Oracle database, that enable you to load rich content into the database, securely manage it, and ultimately, deliver it for use in an application. Rich content is widely used in most web applications today, and includes text, data, images, audio files, and video.

This chapter focuses on one of my favorite components of interMedia: interMedia Text. I find that interMedia Text is typically an under-utilized technology. This stems from a lack of understanding of what it is, and what it does. Most people understand what interMedia Text can do from a high level, and how to 'text-enable' a table. Once you delve beneath the surface though, you'll discover that interMedia Text is a brilliantly crafted feature of the Oracle database.

After a quick overview of interMedia's history we will:

Discuss the potential uses of interMedia Text such as searching for text, indexing data from many different sources of data and searching XML applications, amongst others.
Take a look at how the database actually implements this feature.
Cover some of the mechanisms of interMedia Text such as indexing, the ABOUT operator, and section searching.

A Brief History

During my work on a large development project in 1992, I had my first introduction to interMedia Text or, as it was called then, SQL*TextRetrieval. At that time, one of my tasks was to integrate a variety of different database systems in a larger, distributed network of databases. One of these database products was proprietary in every way possible. It lacked a SQL interface for database definition and amounts of textual data. Our job was to write a SQL interface to it.

Sometime in the middle of this development project, our Oracle sales consultant delivered information about the next generation of Oracle's SQL*TextRetrieval product, to be called TextServer3. One of the advantages of TextServer3 was that it was highly optimized for the client-server environment. Along with TextServer3 came a somewhat obtuse C-based interface, but at least now I had the ability to store all of my textual data within an Oracle database, and also to access other data within the same database via SQL. I was hooked.

In 1996, Oracle released the next generation of TextServer, called ConText Option, which was dramatically different from previous versions. No longer did I have to store and manage my textual content via a C or Forms-based API. I could do everything from SQL. ConText Option provided many PL/SQL procedures and packages that enabled me to store text, create indexes, perform queries, perform index maintenance, and so on, and I never had to write a single line of C. Of the many advances ConText Option delivered, two in my opinion are most noteworthy. First and foremost, ConText Option was no longer on the periphery of database integration. It shipped with Oracle7, and was a separately licensed, installable option to the database, and for all practical purposes, was tightly integrated with the Oracle7 database. Secondly, ConText Option went beyond standard text-retrieval, and offered linguistic analysis of text and documents, enabling an application developer to build a system that could 'see' beyond just the words and actually exploit the overall meaning of textual data. And don't forget that all of this was accessible via SQL, which made use of these advanced features dramatically easier.

One of the advanced features of the Oracle8i database is the extensibility framework. Using the services provided, developers were now given the tools to create custom, complex data types, and also craft their own core database services in support of these data types. With the extensibility framework, new index types could be defined, custom statistics collection methods could be employed, and custom cost and selectivity functions could be integrated into an Oracle database. Using this information, the query optimizer could now intelligently and efficiently access these new data types. The ConText Option team recognized the value of these services, and went about creating the current product, interMedia Text, which was first released with Oracle8i in 1999.

Uses of interMedia Text

There are countless ways in which interMedia Text can be exploited in applications, including:

Searching for text - You need to quickly build an application that can efficiently search textual data.
Managing a variety of documents - You have to build an application that permits searching across a mix of document formats, including text, Microsoft Word, Lotus 1-2-3, and Microsoft Excel.
Indexing text from many data sources - You need to build an application that manages textual data not only from an Oracle database, but also from a file system as well as the Internet.
Building applications beyond just text - Beyond searching for words and phrases, you are tasked to build a knowledge base with brief snapshots or 'gists' about each document, or you need to classify documents based upon the concepts they discuss, rather than just the words they contain.
Searching XML applications - interMedia Text gives the application developer all the tools needed to build systems which can query not only the content of XML documents, but also perform these queries confined to a specific structure of the XML document.

And, of course, the fact that this functionality is in an Oracle database means that you can exploit the inherent scalability and security of Oracle, and apply this to your textual data.

Searching for Text

There are, of course, a number of ways that you can use to search for text within the Oracle database, without using the interMedia functionality. In the following example, we create a simple table, insert a couple of rows then use the standard INSTR function and LIKE operator to search the text column in the table:

SQL> create table mytext   2  ( id      number primary key,   3    thetext varchar2(4000)   4  )   5  /      Table created.      SQL> insert into mytext   2  values( 1, 'The headquarters of Oracle Corporation is ' ||   3             'in Redwood Shores, California.');      1 row created.      SQL> insert into mytext   2  values( 2, 'Oracle has many training centers around the world.');      1 row created.      SQL> commit;      Commit complete.      SQL> select id   2    from mytext   3   where instr( thetext, 'Oracle') > 0;              ID ----------          1          2      SQL> select id   2    from mytext   3   where thetext like '%Oracle%';              ID ----------          1          2

Using the SQL INSTR function, we can search for the occurrence of a substring within another string. Using the LIKE operator we can also search for patterns within a string. There are many times when the use of the INSTR function or LIKE operator is ideal, and anything more would be overkill, especially when searching across fairly small tables.

However, these methods of locating text will typically result in a full tablescan, and they tend to be very expensive in terms of resources. Furthermore, they are actually fairly limited in functionality. They would not, for example, be of use if you needed to build an application that answered questions such as:

Such queries just scratch the surface of what cannot be done via traditional means, but which can easily be accomplished through the use of interMedia Text. In order to demonstrate how easily interMedia can answer questions such as those posed above, we first need to create a simple interMedia Text index on our text column:

With the creation of the new index type, CTXSYS.CONTEXT, we have 'Text-enabled' our existing table. We can now make use of the variety of operators that interMedia Text supports, for sophisticated handling of textual content. The following examples demonstrate the use of the CONTAINS query operator to answer three of the above four questions (don't worry about the intricacies of the SQL syntax for now, as this will be explained a little later):

Overview

A Brief History

Uses of interMedia Text

Searching for Text

Managing a Variety of Documents

Indexing Text from Many Data Sources

It's an Oracle Database, After All

Generating Themes

Searching XML Applications

How interMedia Text Works

interMedia Text Indexing

About ABOUT

Section Searching

Caveats

It is NOT Document Management

Index Synchronization

Indexing Information Outside the Database

Document Services

The Catalog Index

Errors You May Encounter

Index Out of Date

External Procedure Errors

The Road Ahead

Summary