8.4 Java Specification Request (JSR) | Internet-Enabled Business Intelligence


Team-Fly

	Internet-Enabled Business Intelligence By William A. Giovinazzo
	Table of Contents

	Chapter 8. Java

8.4 Java Specification Request (JSR)

All languages, with one glaring exception, have their dialects. When I was an undergraduate , it took me several years to decide what I wanted to do with my life. (Oh no, here comes another story!) During that time, I took a variety of classes, one of which was Italian. My mother was Sicilian and my father's family was from Calabria. The Italian spoken in our home consequently was a combination of the two dialects. When I went to school, we were taught the Florentine dialect , a variation that lacked, in my opinion, the passion and flare of the more southern tongues to which I had grown accustomed. In many cases, the two were so different that they were barely recognizable as the same language. While they both started with a common language, they were separated enough that their speech evolved in different ways.

Languages such as Cobol, C, Smalltalk, and even Fortran, all have their own dialects. Programming languages, just as human speech, change and evolve as people use them. The way one vendor or operating system implements C++ deviates from another. At times, these deviations are referred to as extensions. Roughly defined, extensions are a vendor's way of taking something standard and making it specific to its own environment. Software developers are locked into a particular platform. This is the antithesis of all Java's objectives.

Java is the one glaring exception to this rule of language. The reason for this is Sun Microsystems. Wisely, Sun has stated quite simply to the market, "Java is ours and this is how it works." There are no extensions to the language. Java, the Popeye of programming languages, is what it is and that is all that it is. Now, I said Sun was wise in this decision, and this isn't because I am looking to get a job from Scott McNealy. Rather, it avoids the problems associated with different vendors modifying the language to the extent that it loses its portability and platform-independence. While the need for control is recognized, there is also a realization that environments, situations, and attitudes change. A word or phrase in one generation may have one meaning, while in another have a completely different connotation. Programming languages evolve in the same way. The original definition of a language could not possibly anticipate environmental changes or new applications. JSRs are the way in which the language can grow to meet changing conditions.

A JSR is a formal request to extend the Java specification to include new functionality within the language. A JSR is adopted through the Java Community Process (JCP). The JCP is composed of the following steps:

JSR Review Community members define a specification, which is approved by the Executive Committee (EC). The EC is composed of stakeholders and members of the Java community. Currently there are two executive committees ; one responsible for desktop/server technologies and one responsible for consumer/embedded technologies.
Public Review The JSR draft is reviewed by the public at large. An Expert Group further revises the JSR based on the response from the public.
Proposed Final Draft A final draft is composed based on the revisions from the public review. This serves as the basis for the Reference Implementation (RI) and the Technology Compatibility Kit (TCK). The RI is a proof of concept prototype that demonstrates the capabilities of the specification. The TCK is a package that assists in testing whether a specific implementation is compliant with the specification.
Final Release The RI and the TCK are completed, and the specification is sent to the EC for approval.
Maintenance Review During this phase, the RI and the TCK are updated to meet enhancements, revisions, and requests for clarification and interpretation.

The importance of this process is that it controls what is and isn't Java. By having a central committee, the controlled evolution of Java maintains the design objectives of the language. In the following subsections, we discuss two JSRs specifically related to IEBI. Since these JSRs are not yet in final release, this discussion merely serves as an introduction to what is being proposed, without delving into the details of the implementations .

8.4.1 JSR-73: THE JAVA DATA MINING API

In Chapter 3, we discussed data mining and the different types of data mining applications. To date, the interfaces to these applications have been vendor-dependent , which is contrary to Java's objective of platform-independence. To maintain this independence, the Java community is in the process of establishing a standard Application Program Interface (API) for data mining systems that is freed from any specific implementation. Although not yet finalized, JSR-73 defines the JDM API. Client applications written against this single API run against any back-end system that supports the standard. Although the standard has a required baseline functionality, there are optional packages within the standard. For example, some vendors may support classification models using decision tree-based algorithms, while others support only clustering using the k-means algorithm. As one can well imagine, the breadth of data mining functions and algorithms would require this type of approach. How an individual vendor implements the API is at the discretion of the vendor. A vendor can implement JDM as an API native to the application or develop a driver/adapter scheme. In the latter, the driver/adapter mediates between the JDM layer and multiple vendor products.

We begin the data mining process by preparing the data used to build a model. The validity of the conclusions we draw from the data mining operation, the quality of the output, is a function of the quality of the data that goes into the model. For this reason, the preparation of the data is perhaps the most critical step in the process. We have seen this when discussing the data warehouse: The majority of the work in building a data warehouse is the piece that prepares that datathe extraction, transformation, cleansing, and loading. The same is true for data mining. Generally, about 80 percent of the work in data mining involves preparation of the data. Just as we do in data warehousing, we need to cleanse and normalize that data. In some instances, we may need to define new attributes that are functions or combination of others.

The next step is to build a model. A model is a scaled down version, or representation, of something else. Because it is a simplification of the actual thing, we are able to analyze and work with the model more easily than with the modeled object. Whatever we learn from the model, we can then apply to the actual thing.

I have to believe that every American male spent a good deal of his adolescence building models. Oh those sweet afternoons spent with nothing more than a box of plastic pieces, a few toothpicks and a tube of model glue. (This was in the days before sniffing.) It was sheer delight to watch those odd shapes form various hot rods, airplanes, and battleships. The only problem was that I wasn't getting my homework done. Sister Robert Francis was quick to point out that I was probably never going to amount to anything by frittering away my time. Then again, she never put together a 1969 Corvette complete with windows that really rolled down!

One year, for Christmas, my older brother did it. He gave me the veritable Mona Lisa of models, the Mount Everest of miniatures. He gave me the "Visible V-8"! It was truly the model that all other models hoped to be. When finished, the engine would run like a real V-8. The pistons would move up and down; the lifts would lift; the fan would fan; the crankshaft would crank. What made this model special was the clear plastic engine block that allowed you to see all the parts working together. Of course, it didn't run on gasoline. Instead, a small electronic motor turned the crankshaft and made it look like the engine was running. How astounded my family and friends would be when, once the V-8 was assembled , they could watch for hours the workings of an internal combustion engine. How they would marvel at my skill and expertise. Women would want me! Men would want to be like me!! Mothers would want their daughters marry me!!! I'd show Sister Francis. All I needed to do was to figure out how to put the thing together.

By modeling a V-8, I grew to understand how an engine works. I couldn't have very well put together a real V-8 engine in my room when I was 11, although some in my neighborhood who tried it in their garages. So I took something else, something that was a representation of that engine, a model. I learned from that model how an engine works. I also learned that I didn't want to be an auto mechanic .

Models are similar in the data mining world. The models we built as children were, in a sense, a compact representation of something. Through this representation, I discovered something about the real world. In this way they are similar. Note that there is a difference in how these two types of models take us through this process of discovery. The child's model reduces reality, yet it still is very much like the real-world object. Data mining, however, creates fully transformed, reduced versions of the data that may look nothing like the actual data. Consider the decision trees we discussed in Chapter 3. We extracted from the data a series of if-then rules that created our tree structure. This is more than just a simplification of the data. Looking at the actual data would give us no insight into the knowledge hidden within the complexity of the data.

While we can see that both a child's model and a data mining model seek to describe something, they do so in very different ways. Yet, this is still only one type of data mining model: descriptive. Data mining models can be descriptive, predictive, or both. The descriptive model assists us in understanding the complexities found in the data. The predictive model helps us look into the future and predict some future result. We will see this come into play in later chapters when we discuss the use of data mining in the personalization of our Web site. The data mining model describes the behavior of certain customers and customer types. Based on this description, we will attempt to predict their behavior.

As we can see, the data mining model is very different from the child's model. The child's model is constructed ; the data mining model is discovered . The actual process of discovering this model is data mining. Data mining is very different from most applications. In most applications, we create a process that models the real world, the purchasing of a product or the servicing of a customer request, and run data of specific occurrences of these events through the model. In these environments, the data is dynamic and the model is static. In data mining, we reverse this: The data is static and the model is dynamic. We take the static data and from that derive a model. If the data changes to reflect a change in behavior, our model changes.

We can see this process of discovery as we look at the JDM process. We specify function settings that describe the type of problem we are trying to solve. Is it a classification problem or a clustering situation? We then define a task that builds the model using the input data and function settings. The output of the task, the discovery process, is the model itself. Models learn in two ways. The first, supervised learning, requires a known value to be predicted , referred to as the target. The second is unsupervised learning, which uses no such target. Examples of unsupervised learning include clustering and association rules.

To insure that the model is truly predicting results and has not merely learned the input data set, we test the model with other known data sets. This gives us a sense of the accuracy of the model. Supervised models and some unsupervised models, such as clustering, can be applied to data to predict target values or make assignments to categories or clusters.

8.4.1.1 Model Building

The first step in the data mining process is for the client to build a model. As we have discussed throughout this chapter, Java is object-oriented. We take advantage of this approach in the construction of our model. We define function settings as noted above. Function settings are built through a series of objects whose attributes are set and methods invoked. In a sense, we objectify the data mining process.

If we think of the data mining process as a group of objects working in conjunction with one another, the first object we encounter is the process, or data mining task, itself. We therefore begin the process with the creation of a data mining task object. When we do this, we define the physical data, mining function settings, algorithm settings, and finally the task. In some instances, we may wish to define map attributes. This is the object that is the embodiment of our data mining process. Envision it as the tool by which we manipulate the data mining process. We use this object to specify the asynchronous processing of tasks. As you can well imagine, data mining tasks can be quite lengthy. It is beneficial, therefore, to be able to run these tasks asynchronously. We can also use this object to terminate or interrupt tasks that are currently processing. The data mining task object also receives the input parameters to the data mining process. There are different kinds of tasks : build, test, apply, import, and export.

The data mining task object receives two forms of input. The first type of input received by the data mining task is the data to be mined. The physical data object defines the layout of the data. Note that the JDM API allows the inclusion of table and multirecord data types. A table is a simple table in which each record contains a case. Each table column contains a specific variable of the analysis, such as age, gender, or income. In a multirecord data type, the columns of the record assume a specific role. Columns that assume a role contain such elements as customer IDs, sequence numbers , or item names .

The second type of input the task object receives is the commands, or settings, that define how we are going to mine the data. The settings fall into two groups: function settings and algorithm settings. Function settings are the high-level specification for the construction of a model. These are defined by the function settings object. The function settings are high-level enough that the client can identify the type of results that are desired without having to specify a particular type of algorithm. A client can specify functions of classification, approximation , attribute importance, association rules, and clustering models. Some parameters for function settings are optional. Omitted parameters or specific algorithms are automatically selected by the back-end data mining process that best supports the task defined by the client's input parameters. Algorithm settings provide specifications for the specific algorithm that is to be used by the model. These are defined in the algorithm settings object.

The JDM API, as of this writing, recognizes four data mining functions or algorithms. Each function has its own distinct strengths and weaknesses. The JDM API data mining functions are as follows :

Association Used extensively in marketing for situations such as market basket analysis. This type of data mining operation finds an association between data items. There is the old example of a high correlation between the purchase of disposable diapers and beer. Association rules are a form of unsupervised learning.
Classification Used extensively in customer segmentation and credit analysis. This type of data mining places a record in specified groups. Classification is a form of supervised learning.
Approximation A method used to predict the difference between predicted data and actual data sets. This method is based on the concept of a regression towards a mean, which was first introduced by Francis Galton approximately 100 years ago. Approximation is a form of supervised learning.
Clustering Used in retailing when retailers would like to understand similarities in a customer base, such as customer churn. In this function, subjects that share common behaviors and characteristics are grouped together. Clustering is a form of unsupervised learning.

As we can see, these data mining functions fall into two basic categories. The first is descriptive; the data mining method attempts to provide an insight into the characteristics of the data in a more concise manner. The second is predictive. This type of mining predicts some future behavior.

The separation of the algorithm settings from the mining function settings provides for two types of users. One might wonder about the need for two types of settings. The JDM API provides for the needs of many types of users and clients . The function settings provide for the needs of the majority of users. More advanced users, users with greater knowledge of the data mining process, can fine-tune data mining tasks with the algorithm settings.

8.4.1.2 Model Testing

The next step in the data mining process is to test the model we have built. Note that testing is applicable to supervised models, those models that have a target. It is not a required step, but helps assess the accuracy of the model. The testing phase evaluates how well the model predicts outcomes . It is important to perform the test with a known data set that is different from the data used to construct the model. We need to understand the predictive qualities of our model. If we test the model with the same data set we used to build the model, we could have a model that has simply memorized the data used in construction. When testing the model with a second known data set, we are performing an out-of-sample test. An out-of-sample test tells us how well the model is able to predict against an unknown data set.

With the JDM API, the mining task accepts as input mining data and the mining model. The format of the input data is the same as the input to the model during the construction phase. The results of the test are stored in the test result object. The format of the output is dependent on the type of mining model used. This makes sense. One would not expect the output of an approximation, for example, to be in the same format as the output of a classification model.

8.4.1.3 Applying the Model to Data (Scoring)

The final step in the data mining process is to apply the model to the actual data. Typically, applying is not used for association rules. The data to which we apply the model is possibly a previously unseen data set. This is the actual data for which we intend to make predictions . As such, the data must be preprocessed in the same or a similar way as the build data. By similar, we mean that the data must contain, at a minimum, the same set of attributes used to build the model. For example, if we built a model that uses gender, age, and income to perform a prediction, then the actual data to which the model is applied must contain gender, age, and income. The main emphasis is on the preprocessing. In other words, we need to use the same statistics from the build data transformations on the score and test data.

The results of the data mining operation are stored in the location specified by the data mining task object. They are typically comprised of one result per test case. For example, we may use classification to predict whether a particular type of individual belongs to one political party or another. In another instance, we may be interested in the probability of a particular household to purchase a product or service. In this case, the data mining operation returns a probability for an entire household, not for a particular individual. In either case, the JDM API allows the user to specify the content of the results returned by the data mining process.

8.4.2 JSR-69: THE JAVA OLAP API

As of this writing, JSR-69 has not been fully described. The OLAP Council's Multidimensional API (MDAPI), however, approximates what we expect to see. The MDAPI provides clients with an object-oriented interface to multidimensional databases. Through this interface, client applications are able to connect to a multidimensional database and query its metadata and data. In keeping with the objectives of the Java language, the MDAPI hides the idiosyncrasies of the underlying database, providing portability and system-independence. In Chapter 3, we discussed OLAP and multidimensionality, so we will not go into the basic principles here. In this section, we specifically discuss the OLAP Council's MDAPI.

We noted how Java abstracts the underlying platform, hiding the variations between platforms. The MDAPI, in the same way, uses an object-oriented approach to abstract the implementation details of the underlying multidimensional database. The application manipulates the objects that are part of the API. There are four basic types of objects in the MDAPI: session, metadata, queries, and drivers. The individual vendors implement objects, but their specification is defined by the OLAP Council. The only exception to this rule is the session object, which is implemented by the OLAP Council. Figure 8.8 presents the relationship between the objects in the MDAPI.

Figure 8.8. MDAPI data model.

graphics/08fig08.gif

Understand that the MDAPI is not a piece of software or a thing purchased from the OLAP Council. The MDAPI is primarily a specification. The OLAP Council publishes the specification, and the OLAP vendors provide the implementation. As part of this implementation, vendors are allowed to extend the MDAPI to include special database features and functions. At first glance, this may seem to be contrary to the basic portability functions of Java. The objects in the interface, however, are implemented as Java interfaces. By specifying the MDAPI as a set of interfaces, the OLAP Council provides vendors flexibility in implementing the underlying methods.

In a typical OLAP session, an application establishes a connection to a database and retrieves the metadata to understand the analysis space represented by the multidimensional database. The application then requests some data, performs some analysis, and possibly queries more data. Each action is performed by the application with a specific MDAPI object.

We establish a connection to the database via the session object. In Figure 8.8, we see that the session object is the root of the MDAPI. It is through this object that applications access the multidimensional database. As we said above, this is the only object that is actually implemented by the OLAP Council and is therefore vendor-independent. The session object is the first MDAPI object to be installed on a platform; subsequent installations of MDAPI objects register with the session object. In this way, the session object is cognizant of all MDAPI instances on a system. This enables the session object to load any driver installed on the system.

The session object method getDriverByName returns an instance of the driver object specified by the driver name . The driver class is the vendor's specific implementation of the MDAPI. It is through the driver that we establish a connection to the database. The connection remains open until it is specifically closed. Since the connection is subordinate to the session, the connection will also be closed if the session terminates. As shown in Figure 8.8, a single session can support zero to many simultaneous connections to a single multidimensional database. The connection object is the object through which the application interacts with the multidimensional database. It provides for metadata navigation, description of the data source capabilities and policies, and the maintenance of the connection itself.

We create multidimensional analysis spaces, or hypercubes , with the MDAPI via the connection object. We could in fact think of the connection object as a hypercube. It is composed of dimensions of which one and only one is a measures dimension. Each cell contains values identified by value descriptors and defined by a combination of one member from each of the dimensions. To query this cube, we create member queries .

To understand a member query a bit better, let's stop and consider a dimension. A dimension binds an analysis space and is represented, when drawn, by an axis. The axis represents a scale, which can be either a nominal scale or an interval scale. All possible values of this scale are the domain of dimension values. In forming a member query, we specify a subset of values from this domain to return for each of the dimensions. If we specify a set of members when creating the member query, the results of the query are returned in the order specified. We may also define the ordering of the returned values.

The MDAPI is not JSR-69, although it approximates what we expect to see in the JSR. The MDAPI applies to a variety of languages, not just Java. However, some basic considerations had to be made in establishing the MDAPI to map it to the Java environment. For example, although Java methods may receive zero to many parameters, they return only one value. The MDAPI therefore has been defined so that there are no methods that return multiple output parameters.

In the past, one of the values returned by a procedure was a status code. Java, as well as several other modern languages, provides for raising an exception. An application can build a set of exceptions that provide specific information on the nature of the error, which is passed to the exception handler. The Java MDAPI method will always raise an exception to indicate that an error has occurred.

As we can see in Figure 8.8, there are many instances in which there are one-to-many relationships. Consider for example the relationship between the hierarchy object and the level object. In both cases, the relationship is one-to-many. In such cases, the application has to know which and how many objects there are in the relationship. The Java MDAPI implementation uses a collection class to represent these relationships. A collection class is a type of container that holds other objects. It collects the other objects. The collection can collect the objects in the one-to-many relationships. Once the objects are contained in the collection, the application can then interact with the objects as necessary.


Team-Fly

Top