Connectivity and Data Formats | Microsoft Visual J# .NET (Core Reference) (Pro-Developer)

I l @ ve RuBoard

One major feature of distributed systems is that they are distributed ” that is, components reside at different locations on different computers. These components and other pieces of software, such as a Web browser or a database server, need to communicate with each other across process boundaries. This communication is made possible by the middleware, formats, and protocols that components use. Another interesting challenge is that you might have to integrate a number of legacy components that still perform their tasks perfectly well ” discarding and rewriting them in order to use new technologies might not be a cost-effective option.

Sockets

The great-granddaddy of connection mechanisms is the socket. Most modern connectivity solutions use sockets somewhere under the hood, but these days they're well hidden. Sockets themselves are an abstraction designed to hide the complexities of transmitting data over a network using TCP/IP, the protocol that underpins the Internet.

In a socket-based architecture, one process (often the server) creates a network endpoint on a machine, which is associated with a process, and waits for clients to connect to the endpoint. When the client connects to the server using the socket, the associated process receives data sent by the client, performs some piece of processing, and then possibly sends a response back to the client. A socket endpoint comprises the TCP/IP address of the computer and a port number (a positive integer). Any client that knows the TCP/IP address of the server machine and the port number can attempt to connect to the server process. The server process can examine the address of the client making the request (the TCP/IP address is passed as part of the protocol) and can accept or reject the request as it sees fit.

If you've written a program that uses sockets, you'll know that a lot of decisions are left up to the programmer's discretion. For example, what happens if a second request is sent to a server while the server is already handling a request? The answer is that the server should be implemented using multiple threads. One thread should sit and wait for a client request, and when a request arrives it should create and dispatch a new thread to handle the client while the original thread waits for the next request.

Another issue that the programmer is expected to handle is that the client must know the format of the data that the server requires. If the server expects an integer and the client sends it a string, the server will try to interpret the string as an integer and get it wrong. Even if the client knows that the server expects an integer, there is still plenty of room for confusion. How big (in bytes) is an integer? Are the client and server running on machines that use a big endian or little endian processor architecture?

Big Endian vs. Little Endian

The terms big endian and little endian refer to the byte ordering used by processors. A processor using big endian byte ordering stores the most significant byte of any multibyte value in the lowest memory address used by the data, the second most significant byte in the next location, and so on. A little endian processor stores the least significant byte in the lowest memory address used by the data, and so on.

For example, suppose a processor uses two-byte integers and the little endian byte-ordering scheme. The decimal integer 32000 will be represented in binary as 01111101 00000000. If the processor uses big endian byte ordering, the two bytes will be reversed : 00000000 01111101. If the value 32000 is transmitted from a little endian to a big endian computer, the big endian machine will wrongly interpret it as 125. Therefore, when you send data over a network, you should always be sure that it's transmitted using an agreed-upon byte- ordering scheme understood by both the sending and receiving computers.

Many UNIX machines are big endian, and big endian byte ordering is the convention used for transmitting data over the Internet. The Intel processors used by PCs running Microsoft Windows are little endian.

For those who are curious , the terms big endian and little endian can be traced back to Jonathan Swift's Gulliver's Travels. The Big-endians of Lilliput were so called because they cracked their boiled eggs at the big end; they were considered rebels by the Small-endians, who were commanded by the King of Lilliput to crack their eggs at the small end.

In general, binary data can be awkward to handle at the socket level because of the many ways it can be interpreted. Many designers and developers avoid using binary data and instead prefer to convert data into character streams, although streams present their own problems.

Another major concern with sockets is a lack of atomicity. When a client sends a large volume of data to a server over a socket, it might appear to the client as if it is sending a single piece of data. However, the vagaries of most modern operating system schedulers allow processes to be interrupted and suspended while they're performing input/output (I/O) operations. Behind the scenes, therefore, a single client operation that sends several kilobytes of data to a server might actually be broken down by the operating system into a series of smaller transmissions.

The client is not aware of this fact, but the server might be. The first chunk of data will be read, and the server might not realize that more is to follow; it will begin processing using the incomplete information it has received, once again making a mess of things! To counter this problem, designers end up defining data streams that contain additional information indicating how big a request is so the server can make sure it reads all of it. In short, when you use sockets, you can end up spending more time and effort worrying about the mechanics of data transmission than defining the actual business logic that needs to be performed.

Remote Procedure Calls

To free designers from the cumbersome task of handling sockets, a different mechanism was needed. One such mechanism is the Remote Procedure Call (RPC), which is a further abstraction of the network. (See Figure 1-6.) The purpose of the RPC is to make a request to a remote server look exactly the same as a call to a local procedure.

RPCs work by intercepting procedure calls using a proxy object (described by the Gang of Four, although proxies predate their design patterns book by a number of years ) that packages up any parameters into a format suitable for transmission over the network and sends this data to a server (probably via a socket). A stub on the server receives the request, unpacks the data, and then invokes the corresponding procedure in the server process. Any return values are packed, sent from the server through the stub and the proxy back to the client, and then unpacked.

Figure 1-6. The RPC architecture (with proxy and stub pseudocode)

Most platforms that support RPCs provide tools that allow the proxy and stub code to be generated from a specification of the procedure that is to be remoted . For example, Microsoft supplies the MIDL compiler, which takes the definition of remote procedures in an Interface Definition Language (IDL) file and produces source code in C for the proxy and stub objects. Client code is compiled and linked with the proxy code, and the server is compiled and linked with the stub. The server itself must supply a real implementation of the remote procedure specified in the IDL file.

For transparency, RPCs often use a name service that allows the proxy to locate the RPC server by name rather than having to have hardcoded network addresses built in to the application. Using a name service permits the server to be relocated (for example, in the event of hardware failure) without needing to rebuild the client. Another advantage of a name service is that it can support advanced features such as load balancing by redirecting client requests to one of a number of servers that are implementing the same service.

The act of converting data into a portable format suitable for transmission over a network is called marshaling . The act of unpacking the data at the other end and converting it back into its original binary representation is called unmarshaling . IDL allows you to define how data structures should be marshaled if you don't want to use the default format. You should understand one important point at this juncture: Several vendors and consortia have implemented their own RPC mechanisms and marshaling schemes. They are all similar, but there are incompatibilities between them, so using RPCs does not guarantee portability across platforms.

Object RPCs

The original RPC mechanism was designed when procedural languages such as C predominated and object orientation was struggling to gain acceptance. As a result, RPCs were closely aligned with procedural semantics rather than objects. Object RPCs (ORPCs) extend RPCs into the world of objects ”they allow entire objects, rather than just individual procedure calls, to be accessed remotely. ORPCs bring in a whole new range of opportunities and design issues.

CORBA

Having a set of similar but incompatible frameworks for performing RPCs was clearly not good for portability. The Object Management Group (OMG) decided to avoid making the same mistakes with ORPCs. It worked with its member organizations to come up with a common set of principles and techniques for allowing objects to be accessed remotely. The result was the Common Object Request Broker Architecture (CORBA), shown in Figure 1-7.

CORBA defines its own dialect of IDL. It extends the use of interfaces and adds support for objects in a language-independent manner. CORBA also dictates how objects should communicate, using an object request broker (ORB). An ORB is a piece of middleware that locates CORBA-compliant objects and is responsible for marshaling and unmarshaling data as the data is transmitted over the network.

Figure 1-7. CORBA

With the advent of language bindings that map common (and some not-so-common) programming languages into IDL, cross-language interoperability was achieved. A server application can be written in one language, and the client can be written in another. The ORB handles the communication between the client and the server and the marshaling and unmarshaling of data, and the proxy (actually called a client stub by CORBA) and stub (called a server skeleton ) code generated from the IDL definition of the server handles communication with the ORB. All you need to do when you generate the stub and skeleton is to specify which language (or languages) you want to generate code for.

The original CORBA specification defined how clients and servers should communicate with an ORB, but it was vague about how one ORB should communicate with another if it were ever necessary to do so. (ORBs originally used their own proprietary over-the-wire format). This rendered ORBs incompatible with one another. (Who says history never repeats itself?) The CORBA 2.0 specification plugged this gap by defining the Internet Inter-ORB Protocol (IIOP), which specifies the wire formats and messages that ORBs should use when communicating with other ORBs over TCP/IP (the network protocol used by the Internet and most intranets ). Most modern ORBs implement IIOP.

An additional feature that CORBA provides is server activation. In the original world of RPCs, a server process that implemented an RPC had to be started manually in order for a client to find it, connect to it, and use it. CORBA defines activation policies that allow the ORB to start an object server on demand.

Distributed COM

Microsoft began its foray into the world of ORPCs with Distributed COM (DCOM). DCOM's functionality is similar to that of CORBA, except it is highly tuned and optimized for the Windows family of operating systems. (The German company Software AG created a version of DCOM for Linux platforms.) DCOM is essentially incompatible with CORBA, although COM-CORBA bridges are available if you need to combine the two systems. DCOM has provided the foundation for COM+, which is an important technology for building distributed applications under Windows and Microsoft .NET.

Remote Method Invocation

Java has its own native ORPC mechanism called Remote Method Invocation (RMI). RMI is optimized for Java and uses its own internal formats and mechanisms for marshaling data, and it is incompatible with most other RPC mechanisms (apart from CORBA using the RMI-IIOP protocol, which is an IIOP-conformant implementation of RMI). As a result, it can be difficult (but not impossible ) to mix Java objects developed using the standard Java Developers Kit (JDK) with non-Java objects in a distributed system.

Serialization

The JDK includes its own name service ”the RMI registry. (Version 1.2 of the JDK also includes tnameserv , which is a simple CORBA name server.) RMI uses serialization to marshal and unmarshal data. Serialization is just another term for the conversion of data into a portable binary representation. The binary version of the data can then be transmitted and reconstituted at the receiving end. Serialization is also used by Java for saving objects to a disk file or a database. Java is designed to work on multiple platforms, so even though RMI does not always interoperate well with other RPC services, it does guarantee cross-platform compatibility for applications developed using Java. A Java client running under Windows can communicate with an RMI server running under UNIX, and you don't need to worry about issues such as big endian versus little endian byte ordering.

Reference and Value Objects

Although Java objects can be complex, you can serialize most of them with little or no difficulty. Java has the Serializable interface, which a class must implement in order for its objects to allow serialization. The Serializable interface is actually just a marker interface that indicates that the class supports serialization; you do not need to write any additional code.

Objects instantiated from classes that implement the Serializable interface are copied by value when they're referenced as parameters or return types to RMI method calls. This means that if an RMI client obtains a serializable object as a return value from an RMI method call, the client actually receives a copy of the original data. Any changes that the client makes to this copy will not be reflected in the original object in the server. On the other hand, if a class is descended from the java.rmi.server.RemoteObject class, objects of that class are passed by reference between an RMI server and the client. Changes made by the client will be transmitted as RMI method calls to the original object residing in the address space of the server, thereby changing the state of the original object.

The decision about whether to pass remote objects by value or by reference is an important one that you should make on a case-by-case basis. It might make sense for objects whose data does not change (and that the client will want to browse) to be marshaled by value because this will result in a single transfer of data over the network. Objects whose state can be changed are better marshaled by reference, although this can result in numerous small network exchanges as individual pieces of state are modified. Design patterns are available if you need an object to be mutable but want to avoid the network overhead of repeated network calls. One example of such a pattern is using a value object with a batch update method that propagates the entire object back to the RMI server for updating after a number of changes have been made.

The Web

The explosion of the Internet and the increased access to network bandwidth has allowed more and more companies to consider using the Web to provide a transport for engaging in e-commerce. The same distributed design principles can be applied to local intranet solutions.

HTTP

Hypertext Transfer Protocol (HTTP) is the network protocol of the World Wide Web. HTTP is most commonly used by Web servers for receiving a request from a Web browser and responding with an HTML stream that contains information that the Web browser can render and display to the user . Much of the information that passes over HTTP is text-based, although this protocol can also be used to transmit binary data if the Web browser and the Web server use an agreed-upon format. (HTTP specifies a number of common formats that can be used.)

Although HTTP is suitable for transmitting text and surfing the Web, it has its limitations. An increasing number of companies that want to use the Internet as a conduit for sending business data require RPC calls over the Internet, which raw HTTP is not so good at.

Web Services

Web services are one way of addressing the need to make RPC calls over the Internet. You can think of a Web service as a component, or black box, that provides some useful facility to clients or consumers. Just as DCOM is often thought of as "COM with a longer wire," you can think of a Web service as a component with a truly global reach.

A Web service can be implemented in a variety of languages. Currently, the .NET Framework allows you to develop Web services using C++, Microsoft JScript, C#, J#, and Visual Basic .NET, and other languages will likely be available in the future. The Web server listens for incoming Web service requests and directs them to the appropriate Web service code. The Web server is also responsible for converting these requests (which arrive using HTTP) into a form that appears to the Web service code to be a local procedure call. In other words, the Web server acts like a server-side stub or CORBA skeleton.

As far as the consumer is concerned , the language used by the Web service, and even how the Web service performs its tasks, is not important. The consumer's view of a Web service is as an interface that exposes a number of well-defined methods. The consumer calls these methods using the standard Internet protocols, passing parameters in eXtensible Markup Language (XML) format and receiving responses in XML format.

XML has become a widely accepted standard for data transmission. It is well understood and extremely portable. The fact that XML is text-based makes it convenient to transmit over HTTP. There are the issues of marshaling and unmarshaling data to and from the required XML format, but the complexity of this process can be hidden using client-side proxies.

SOAP

SOAP is the protocol used by Web service consumers for sending requests to and receiving responses from Web services. SOAP is a lightweight protocol built on top of HTTP. It is possible to exchange SOAP messages over other protocols, but as of summer 2002 only the HTTP bindings for SOAP have been defined. SOAP defines an XML grammar for specifying the names of methods that a consumer wants to invoke on a Web service, for defining the parameters and return values, and for describing the types of parameters and return values.

When a client calls a Web service, it must specify the method and parameters using this XML grammar. Most tools for building applications that consume Web services create a client-side proxy that makes a call to a Web service appear to the client like a local procedure call. The proxy converts any parameters into the appropriate XML format and then calls the Web service using HTTP on behalf of the client. Any return values (which are passed back as XML) are unmarshaled into the native format expected by the client. Figure 1-8 shows a Web service consumer invoking a Web service using SOAP.

Figure 1-8. The Web service and the consumer

SOAP is becoming an industry standard. Its function is to improve cross-platform interoperability. The strength of SOAP lies in its simplicity, as well as the fact that it is based on two other industry-standard technologies, HTTP and XML.

I l @ ve RuBoard