I l @ ve RuBoard |
Marshalling is a generic term for gathering data from one process and converting it into a format that can be used either for storage or for transmission to another process (correspondingly, unmarshalling involves taking the converted data and recreating the objects). In RMI, marshalling is done either via serialization or externalization. Marshalling and unmarshalling occupy a strange role in designing a distributed application. On the one hand, the means by which you perform marshalling and unmarshalling is a technical detail: once you've decided to send information to another process, how you do so shouldn't be a primary design consideration. On the other hand, it's a very important technical detail, and the way you do it can often make or break an application. 6.1.1 Use Value Objects to Separate Marshalling Code from Your Application LogicValue objects are objects that contain data and very little behavior, aside from a few constructors that are convenient . They are encapsulations of information intended solely for communication between processes (they are more like structs in C or C++ than full-fledged objects). The idea behind using value objects is that by specifying remote interfaces, and then using data objects that play no further computational role, you separate the protocol definition (e.g., how the processes communicate) from the computational class structures (e.g., the objects and classes that the client or server needs to use to function effectively). Building separate objects for data transmission might seem oppositional to standard object-oriented practices. And, to some extent, it is. But it's not as contrary as you might think. Consider writing a stock-trading application. Your stock-trading application probably has an idea for a purchase order. But it's not a single class ”each part of the application deals with different aspects of the purchase order:
These are all aspects of the "purchase order idea." But they're very different, and each layer of the application deals with only one of them. Even object-oriented purists might find using three different classes to represent "purchase orders" a resonable design decision. Defining value objects gives you five main benefits:
6.1.2 Use Flat Hierarchies When Designing Value ObjectsThe first rule of thumb when designing a suite of value objects is this: avoid inheritance. There are two basic reasons for this. The first is efficiency. When you send an instance of a class over the wire, you also send information about the class hierarchy. If you run Example 6-1, you'll see that in Java Development Kit (JDK) 1.4 the cost of one extra level of inheritance is 44 bytes (regardless of whether you use serialization or externalization). Example 6-1. FlatHierarchies.javapublic class FlatHierarchies { public static void main(String[] args) { try { System.out.println("An instance of A takes " + getSize(new A( ))); System.out.println("An instance of B takes " + getSize(new B( ))); System.out.println("An instance of C takes " + getSize(new C( ))); System.out.println("An instance of D takes " + getSize(new D( ))); } catch(Exception e) { e.printStackTrace( ); } } private static int getSize(Object arg) throws IOException { ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream( ); ObjectOutputStream oos = new ObjectOutputStream(byteArrayOutputStream); oos.writeObject(arg); oos.close( ); byte[] bytes = byteArrayOutputStream.toByteArray( ); return bytes.length; } protected static class A implements Serializable { } protected static class B extends A { } protected static class C implements Externalizable { public void readExternal(ObjectInput oi) {} public void writeExternal(ObjectOutput oo) {} } protected static class D extends C { public void readExternal(ObjectInput oi) {} public void writeExternal(ObjectOutput oo) {} } } Forty-four bytes (and the extra CPU time to marshal and demarshal them) might not seem significant. But this overhead happens with every single remote call, and the cumulative effects are significant. The second reason for avoiding inheritance hierarchies is simple: inheritance hierarchies rarely mix well with any form of object persistence. It's just too easy to get confused , or not account for all the information in the hierarchy. The deeper the hierarchy is, the more likely you will run into problems. 6.1.3 Be Aware of How Externalization and Serialization Differ with Respect to SuperclassesOne of the biggest differences between externalization and serialization involves how they handle superclass state. If a class implements Serializable but not Externalizable and is a subclass of another class that implements Serializable , the subclass is only responsible for marshalling and demarshalling locally declared fields (either a method code in the superclass or the default serialization algorithm will be called to handle the fields declared in the superclass). If, on the other hand, a class implements Externalizable , it is responsible for marshalling and demarshalling all object state, including fields defined in any and all superclasses. This behavior can be very convenient if you want to override a superclass's marshalling code (for example, this lets you repair broken marshalling code in a library). Implementing Externalizable will let you do so quite easily. But externalization's requirement that you explicitly marshal and demarshal superclass state doesn't really mesh well with inheritance. Either you break encapsulation and let the subclass handle the superclass's state, or the subclass has to call a superclass method to marshal and demarshal the superclass state. Both options can lead to error-prone code. In the first case, if you add or remove a field in the superclass, you then have to remember to modify the subclass. In the second case, you have to rely on programmer discipline to call the superclass method; the compiler doesn't enforce this at all (and forgetting to call the superclass method can be a source of subtle bugs ). And finally, there's one related problem in the way externalization is defined. Consider the following code: public class A implements Externalizable { // Some class state is declared. public A( ) { } public void writeExternal(ObjectOutput oo) throws IOException{ // Correct implementation } public void readExternal(ObjectInput oi) throws IOException, ClassNotFoundException{ // Correct implementation } } protected static class B extends A { // New state variables are defined, // but methods from Externalizable are not implemented. } This code will compile. And instances of B can be marshalled and demarshalled. The only problem is that the only state that will be written out is defined in A . B has the public methods the Externalizable interface requires; it's just that implementations are inherited from A . 6.1.4 Don't Marshal Inner ClassesInner classes aren't as simple as they appear. When an inner class is compiled it gains at least one new variable, which is a reference to the outer class, and it can potentially gain many more, depending on which local variables are accessed and whether they're declared as final. For example, consider the following class definition which declares an inner class inside the test method: public class InnerClasses { public static void main(String[] args) { new InnerClasses( ).test("hello"); } public void test(final String string) { class StringPrinter { public void print( ) { System.out.println(string); System.exit(0); } } new StringPrinter().print( ); } } When you compile this, and then decompile the resulting inner class, [2] you find the following:
class InnerClasses$StringPrinter { private final String val$string; private final InnerClasses thisclass InnerClasses$1$StringPrinter { private final String val$string; private final InnerClasses this$0; InnerClasses$1$StringPrinter(InnerClasses p0, String p1) { } public void print( ) { } }; InnerClasses$StringPrinter(InnerClasses p0, String p1) { } public void print( ) { } } The inner class has two "additional" fields. Having automatically generated fields (with automatically generated names) in a class you will marshal is a recipe for disaster ”if you don't know the variable names or types in advance, all you can do is rely on default serialization and hope everything works out. What makes it even worse in this case is that the names of the fields aren't part of the Java specification (the names of the fields depend on the compiler). 6.1.5 Always Explicitly Initialize Transient Fields Inside Marshalling MethodsThe transient keyword is very useful when you're writing marshalling code. Very often, a first pass at making an object implement the Serializable interface consists of the following two steps:
Many programmers will look at this and wonder if it's efficient enough. But very few will wonder whether it's correct. And the sad truth is that it is often incorrect. The reason is that the transient keyword really means "Serialization shouldn't pay attention to this field at all." [3] And that has serious consequences when you combine it with the notion of serialization's extralinguistic constructor (as discussed earlier). If you implement the Serializable interface by following the previous two methods, here's what will happen when your object is deserialized:
The net result is that none of the transient fields will be initialized . This is rarely the expected outcome, and can result in bugs that are subtle and hard to track down. The SerializationTest class illustrates the problem in detail. It has four integers ( a , b , c , and d ), which are initialized in various ways. Of these integers, only a is not declared as a transient variable. When an instance of SerializationTest is serialized and then deserialized, only a will have the "correct" value. b , c , and d will all be set to (which is the default value for integers in the Java language specification): public class SerializationTest implements Serializable { private int a = 17; // Value is preserved by serialization private transient int b = 9; // Value is not preserved by serialization private transient int c; private transient int d; { // Initialization blocks are ignored by the deserialization algorithm. c = 12; } private SerializationTest( ) { // Won't be called by the deserialization // algorithm d = 421; } public void printState( ) { System.out.println("a is " + a); System.out.println("b is " + b); System.out.println("c is " + c); System.out.println("d is " + d); } } 6.1.6 Always Set the serialVersionUIDserialVersionUID is a class invariant that the RMI runtime uses to validate that the classes on both sides of the wire are the same. Here's how it works: the first process marshals an object and sends it over the wire. As part of the marshalling process, the serialVersionUID of all relevant classes is also sent. The receiving process compares the serialVersionUID s that were sent with the serialVersionUID s of the local classes. If they aren't equal, the RMI runtime will throw an instance of UnmarshalException (and the method call will never even reach your code on the server side). If you don't specify serialVersionUID , you will run into two problems. The first is that the system's value for serialVersionUID is generated at runtime (not at compile time), and generating it can be expensive. The second, more serious problem is that the automatically generated values of serialVersionUID are created by hashing together all the fields and methods of the class and are therefore extraordinarily sensitive to minor changes. For these reasons, whenever you define a class that will be marshalled, you should always set the serialVersionUID , as in the following example: public class ClassWhichWillBeMarshalled implements Externalizable{ public static final long serialVersionUID = 1L; // . . . } 6.1.7 Set Version Numbers Independently of serialVersionUIDserialVersionUID is a very coarse-grained versioning control. If one Java Virtual Machine (JVM) is using a class with a serialVersionUID that has been set to 1 , and the other JVM is using a later version of the class with a serialVersionUID that has been set to 2 , the call never reaches your server code because the RMI runtime in the server's JVM will throw an instance of UnmarshalException (as demonstrated earlier). This level of protection is often overkill. Instead of having the RMI runtime reject the call, you should have your code look at the data that was sent over the wire, realize that there is a versioning problem, and behave appropriately. The following scenario illustrates the problem:
This problem is easy to solve: simply use a second static variable to indicate the "actual version" of the class, and then use it to implement a robust versioning scheme, as in the following code: public class ClassWhichWillBeMarshalled implements Externalizable{ public static final long serialVersionUID = 1L; public static final int actualVersion = 1; // . . . } 6.1.8 Never Use Default SerializationThe serialization algorithm is a very simple and robust algorithm. In pseudocode, it consists of the following five steps:
This last step is often referred to as default serialization. It's what you get if you do nothing beyond adding the words "implements Serializable " to your class definition. And it's such a bad idea that you should never use it. [4]
The problem is that default serialization encodes the exact structure of your class, down to the names of the fields, into the output stream, and it does so in a way that completely prevents any form of versioning. Suppose you want to change the internal representation of your data inside the object, but you still want to maintain some level of backward compatibility. For example, "instances serialized with the old program can still be read in with the new program." If you just use default serialization, this is actually quite hard to achieve. Suppose, on the other hand, you implement a very simple versioning scheme such as the one in the following code snippet: public class ClassWhichWillBeMarshalled implements Serializable { public static final long serialVersionUID = 1L; public static final int actualVersion = 1; // . . . private void writeObject(java.io.ObjectOutputStream out) throws IOException { out.writeInt(actualVersion); out.defaultWriteObject( ); } private void readObject(java.io.ObjectInputStream in) throws IOException { in.readInt( ); in.defaultReadObject( ); } } This actually does use the default serialization algorithm. But before it invokes the default serialization algorithm, it handles the reading and writing of a version number (which is actually a class-level static variable). And the nice thing is that when you version the object, it becomes quite easy to read everything in and handle the data appropriately. The code to do so looks a lot like the following: private void readObject(java.io.ObjectInputStream in) throws IOException { int version = in.readInt( ); switch (version) { case 1: handleVersionOne(in); // . . . } There is one slight fly in the ointment here. Suppose version two is grossly incompatible with version one. For example, suppose you renamed all the variables to conform to a new corporate variable naming standard (or switched from being Serializable to Externalizable ). The solution is simple: you can use the readFields( ) method on ObjectInputStream to read in the name/value pairs for your original fields (and then handle setting the values yourself). The following code example shows you how to read in the serialization information using this technique: private static class A implements Serializable { private static final int DEFAULT_VALUE_FOR_A = 14; private int a = DEFAULT_VALUE_FOR_A; private void readObject(java.io.ObjectInputStream in) throws IOException, ClassNotFoundException { ObjectInputStream.GetField getFields = in.readFields( ); a = getFields.get("a", DEFAULT_VALUE_FOR_A); } }
6.1.9 Always Unit-Test Marshalling CodeBecause marshalling in RMI is always based on streams, it's very easy to write unit tests for your marshalling code. And doing so can save you a lot of headaches as you modify your codebase. All you need to do is use streams that map to byte arrays in memory. For example, the following code creates a deep copy of an object and then makes sure the deep copy is equal to the original instance: public static boolean testSerialization(Serializable object) throws Exception { Object secondObject = makeDeepCopy(object); boolean hashCodeComparison = (object.hashCode() == secondObject.hashCode( )); boolean equalsComparison = object.equals(secondObject); return hashCodeComparison && equalsComparison; } private static object makeDeepCopy(Serializable object) throws Exception{ ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream( ); ObjectOutputStream objectOutputStream = new ObjectOutputStream(byteArrayOutputStream); objectOutputStream.writeObject(object); objectOutputStream.flush( ); byte[] bytes = byteArrayOutputStream.toByteArray( ); ByteArrayInputStream byteArrayInputStream= new ByteArrayInputStream(bytes); ObjectInputStream objectInputStream = new ObjectInputStream(byteArrayInputStream); SerializationTest deepCopy = (SerializationTest) objectInputStream.readObject( ); return deepCopy; } If you insert tests for each value object in your codebase, and then run them occasionally (for example, as unit tests in either Cactus or JUnit), you'll catch errors as they occur (and before they cost you significant time and energy). 6.1.10 Profile Before CustomizingFrom an engineering perspective, any project contains two very different types of costs: people costs and performance costs. People costs are what working on a particular piece of code implies for the development team; they're measured in terms of features not implemented, extra testing that might be required, and (in the long term) maintenance overhead. Performance costs deal with the runtime overhead of a program; they're measured in terms of application performance, network overhead, and machine resource utilization. The rule of thumb for getting projects in on time is very simple: don't write bad code. But don't trade people costs for performance costs unless you have to. Following this rule doesn't guarantee that your project will finish on time. But if you willfully ignore it, I guarantee that your project won't. Thus, whenever you're tempted to customize a working piece of code, some variant of the following scenario should play inside your head:
This applies to marshalling and demarshalling in a very straightforward way. You should always make sure you can version your marshalled objects because it's very hard to retrofit versioning into your codebase (therefore, not doing so is bad code). And you should write unit tests because they're immediately valuable and will save you time even in the short term. But you shouldn't customize your code any more than that. Use either serialization or externalization (whichever is appropriate), and use defaultWriteObject( ) and defaultReadObject( ) until it's absolutely clear that you have to perform further customization. Unless your marshalling code is wedging the network or the CPU, you probably won't need to do either of these things, which should make your Internal Product Manager very happy. 6.1.11 Consider Using Byte Arrays to Store Marshalling ResultsMarshalling converts an object graph into a set of bytes. The bytes are usually simply pushed into a stream and sent over a wire (or to a file or database). But there's no reason why another level of indirection can't be inserted into the process, as in the following code example: public class SerializeIntoBytes implements Externalizable { private byte[] _bytes; public void writeExternal(ObjectOutput out) throws IOException { if (null == _bytes) { createByteArray( ); } out.writeInt(_bytes.length); out.write(_bytes); } public void readExternal(ObjectInput in) throws IOException, ClassNotFoundException { int byteArrayLength = in.readInt( ); _bytes = new byte[byteArrayLength]; in.read(_bytes); restoreStateFromByteArray( ); } protected void createByteArray( ) { // The "real" marshalling goes on in here. } protected void restoreStateFromByteArray( ) { // The "real" demarshalling goes on in here. } } The first thing this idiom does is enable you to reuse the end result of marshalling an object more than once. Consider, for example, the following scenario:
The authentication key in this scenario has two crucial features: it's being sent over the wire many times, and it doesn't change. Session keys have similar properties. And distributed event systems frequently have objects that aren't as long-lived but are sent to many recipients. In cases such as these, the savings that result from not having to marshal the object repeatedly can often be significant.
Another scenario in which byte arrays can be useful is when you want to postpone demarshalling for a while. Suppose we change the previous code a little by removing restoreStateFromByteArray( ) from the demarshalling process. The new version of readExternal looks like the following: public void readExternal(ObjectInput in) throws IOException, ClassNotFoundException { int byteArrayLength = in.readInt( ); _bytes = new byte[byteArrayLength]; in.read(_bytes); } This new class isn't very useful as is; it requires some other instance, somewhere, to call readStateFromByteArray before it can be said to be fully demarshalled. In essence, you've decoupled the "interpretation" of the bytes from the "transmission" of the bytes. This type of decoupling can be useful in two distinct scenarios. The first scenario involves asynchronous messaging. If the client application doesn't need a return value, and doesn't need to find out about demarshalling errors, an RMI call can return a little faster because it won't have to wait for demarshalling to complete before returning. And the server can postpone interpreting the bytes until it needs to handle the fully demarshalled object (or it can perform demarshalling in a background thread). But, to be honest, supporting asynchronous messaging, while important for certain classes of applications, is not a major reason for postponing interpretation. The major reason for postponing interpretation is that interpreting the bytes might involve loading classes that aren't available in the local process. When you interpret the bytes, you're creating instances of classes. If you don't have those classes in the process that's interpreting the bytes, instances of ClassNotFoundException will be thrown. Let's consider a distributed event service again. Suppose you've built an event service based on a publish/subscribe metaphor using event channels. The event service is a server that lives on your network and is available on a 24/7 basis. Client applications register for certain types of events (based on an event channel), and server applications drop off events in a particular channel. For any given event, the following processing sequence occurs:
There are two important points here. The first is that the event service doesn't change the event object at all. In which case, repeatedly marshalling the object (each time it's sent to another client) is just painful; you should avoid that performance hit, and that means a piece of infrastructure such as an event service should use byte arrays to avoid performance problems. The second point is that the event service might have been written two years ago, and might not actually have most of the event objects on its classpath. This is perfectly OK: the event service doesn't need to know about those objects to deliver the events. If you were actually building an event service, you might consider creating an envelope class. The envelope class would contain information about the event (that helps the event service deliver the event) and has a byte array with event-specific information that the recipient will demarshal. Using an envelope class makes the interface a little cleaner but doesn't alter the main idea (that explicitly storing and handling the byte array and controlling when the "interpretation step" happens can be useful).
|
I l @ ve RuBoard |