These utilities present significant enhancements to the ones we developed earlier in the book. We can use our own choice of Element names , handle multiple logical documents, and have data types automatically converted for us. However, there are still a few things we could improve.
Additional Data Types
Although I add more types in later chapters, there's still a good possibility that you might like the utilities to support additional data types. For example, you might want the utilities to handle dates in MM/DD/YY or DD/MM/YYYY formats. The architecture and implementation are both designed to allow you to add new derived DataCell classes without much effort. In addition to coding the class, you need only change the RecordHandler's createDataCell method and the BBCommonFileDescription.xsd schema.
Variety of Record Types
What if your CSV file has rows with different formats instead of all being the same? For example, what if you have a row for a purchase order header followed by rows for line items, instead of one row per line item with the header information repeated in each row? This leads us quite nicely into the next chapter, which deals with this type of file organization, only with fixed length records. We'll discuss that enhancement a bit at the end of Chapter 8.
Efficiency and Performance
As I have said repeatedly, while efficiency and performance are important, in this book I'm putting more emphasis on simplicity, clarity, maintainability, and reusability. However, if we need to wring some efficiency out of the code, there are a few prime targets we can examine first.
The perceptive will notice that when processing source CSV files we scan the input record twice. We scan it once to find the record terminator(s) as we read the input row, then a second time as we parse the columns out of the row. We could modify the parsing algorithm in the CSVRecordReader's parseRecord method so that the input came directly from our file read operation rather than as a record buffer we have already read from disk. However, in doing so we would lose the ability to reuse the RecordReader's readRecordVariableLength method and we would add a bit of complexity to the parseRecord method.
Another area of inefficiency is in repeated copying of Cell Buffer contents. This is probably most notable in the Java implementation, where we store the data as a byte array but frequently convert to and from String for formatting and using the DOM API methods. However, we also do conversions in C++ because the MSXML DOM API methods deal with string data as VARIANTs or BSTRs. For the Java implementation we could maintain both a byte array and a String or StringBuffer for the Cell Buffer contents. We could read data into the byte array from disk and convert to String as part of the toXML method. We would do the reverse operation when processing XML as the source. This would require changes only to the DataCell class and its derived classes.
For C++ and MSXML, perhaps the most direct way to eliminate many of the conversions is to deal natively with the string data as VARIANTs and BSTRs rather than converting to and from char arrays. I have not taken that approach since one of my goals has been to try to keep the code at least somewhat portable. However, if you want to try to optimize the code (or your code) for a Microsoft environment, this is certainly one thing to investigate.
However, all of these efficiency tweaks are probably insignificant compared with the DOM overhead, and I'm not sure the performance gains would be worth the effort. The DOM API is an essential part of the architecture and the implementations , and there isn't much we can do about DOM performance without radically changing the design. I think in most cases it would probably be more cost effective to buy faster hardware than to redesign the utilities. The economics of using a DOM API with a royalty-free license, as opposed to trying to develop a more efficient approach that doesn't use one, are just too compelling.