XML::SAXDriver::CSV Perl Module


CSV format used to be the format of choice for small information interchange tasks between two applications. Just insert the data with each field separated by a comma and send. Here is a small sample of a CSV file that contains accounts receivable information:

 Account Num,Name,Address,Balance  1,Mark Riehl,1600 Main Street,.95  2,Ilya Sterin,1299 Pine Street,.95 

Note that each CSV record is on a separate line delimited by a newline. Usually (but not always), CSV files have a column title that appears on the first line of the file.

The receiving application just splits this data based on its delimiter , a comma, and proceeds to process this information. Several potential problems associated with using CSV files exist. First, the application must know the order of the fields in the file to make any sense out of this data. With XML, because every field is described by its tag name (and possibly attributes), the data does not necessarily have to be in any order, as long as both ends are communicating with data formatted using XML. Second, CSV files can be confusing if the data contains any embedded commas. The parsing application (whether it is Perl, C/C++, or Java) assumes that all commas separate fields (even if they should appear in the field). XML also has a facility for specifying data types (for example, text, float, integer, and so forth) by using XML schemas. Finally, CSV files don't have the notion of a data type, so applications that use CSV files treat all columns as plain text. Even with these disadvantages, CSV files are widely used and will continue to be for some time. So, we'll need to support them. Let's take a look at how easily we can develop a Perl program to convert a CSV file to XML.

Converting a CSV File to XML

Let's assume that you've recently developed an application that processes XML data and updates the address book for the corporate mail server. It has been widely accepted throughout the company. Recently, your company merged with another company and you've been asked to consolidate the address book data from several of their legacy mail servers into one corporate database. The problem is that the legacy mail servers have been in place long before anyone heard of XML, so they can only export data in CSV format.

So, for this task, you will be given all the exported CSV address book data, but you need to convert all the CSV data to XML before your new application can process it. What's the best way to approach this problem? There are several ways to solve this problem. One solution is to use the XML::SAXDriver::CSV module, so let's take a closer look at how to do this.

XML::SAXDriver::CSV Perl Module-Based Example

This example demonstrates the use of the XML::SAXDriver::CSV module that was written by Ilya Sterin (one of the authors) to covert our data from CSV to XML. This module supports fast stream-based conversions using a simple SAX-like interface. Because the XML::SAXDriver::CSV module is SAX2-compliant, the object properties resemble that of any SAX2 parser. That is one of the benefits of a SAX interface in the Perl modulesafter you're familiar with the interface, you'll find that there are numerous applications for the interface. This module also provides options that can be utilized to customize the CSV to XML conversion. These module options are listed at the end of this section.

To solve this problem, we'll need to perform the following steps:

  1. Identify the input data format (that is, fields in the CSV file).

  2. Design the format of the output XML document.

  3. Develop the Perl program to convert between the two defined formats.

Granted, these steps seem like common sense (and for the most part, they are), but you would be surprised how often people jump right into writing code. Things go a lot smoother when both the input and output formats are defined. Let's take a look at the format of the incoming CSV file.

CSV Input Data Format

The CSV file exported from our legacy mail system contains customer address book information. Sometimes in an application such as this, you won't have the opportunity to design the content of the CSV file. It may come in only one format that may or may not support the format you're planning to use. So, you may need to perform some manipulation (for example, delete fields, reorder fields, and so forth). Listing 7.1 shows the input CSV file from the legacy mail server that contains two sample address book entries.

Listing 7.1 Sample address book records in CSV format. (Filename: ch7_address_book.csv)
 First Name,Last Name,Nick,Title,Business Name,Address,City,State,Zip, Phone Number 1, graphics/ccc.gif Phone Number 2  Ilya,Sterin,listerin,CTO,Unravelnet Software,3044 Perl Dr., Farmington Hills,MI,48334, graphics/ccc.gif 247-555-1212,247-555-1213  Mark,Riehl,mark,Systems Developer,Software Company,4488 XML Street, New Jersey,NJ,08736, graphics/ccc.gif 255-545-8585,255-886-1432 

The first row of the file contains the field names that identify the data in each column in the CSV file. As mentioned earlier, each row of the CSV column represents one record. Now that we know the format of the input data, our next step is to design the structure of the XML output file.

CSV files and column headings

Note that our example has column headings, however, this varies from application to application (that is, there aren't any rules that say they're required). So, don't count on always having column headings to define the fields; sometimes, the column heading titles are defined in separate files. I've included column headings to illustrate a particular feature of the Perl module.

XML Output File Format

Our application that converts between CSV and XML uploads this data based on the field names that appear in the column names in the first row of the CSV file. These column names will be used as the names of our elements in the generated XML file. So, we need to verify that the column heading names are the element names that we'd like to use in the generated XML file. If not, we can either change the names of the column headings in the CSV file or perform a mapping in the conversion program.

We're almost at the point where we can start discussing the program that performs the conversion between CSV and XML. Before we do that, let's take a look at a sample record in XML, based on the format we just discussed. The sample XML file is shown in Listing 7.2. Note that the root element named address_book has one child record element. The record element has multiple children. Remember, this is considered a single record because the file contains only one record element.

Listing 7.2 Sample XML file containing one record from the CSV file. (Filename: ch7_address_book.xml)
 <address_book>     <record>      <First_Name>Ilya</First_Name>      <Last_Name>Sterin</Last_Name>      <Nick>listerin</Nick>      <Title>CTO</Title>      <Business_Name>Unravelnet Software</Business_Name>      <Address>3044 Perl Dr.</Address>      <City>Farmington Hills</City>      <State>MI</State>     <Zip>48334</Zip>      <Phone_Number_1>247-555-1212</Phone_Number_1>      <Phone_Number_2>247-555-1213</Phone_Number_2>    </record>  </address_book> 

As you can see, all the column names in the CSV file have spaces, which are replaced with underscores (_) when using them as tag names. This is the default behavior of XML::SAXDriver::CSVit replaces any illegal XML element name character with an underscore . The substitution character is user -defined (as you'll see in Listing 7.3).

CSV to XML Conversion Using the XML::SAXDriver::CSV Perl Module

Let's now take a look at a simple Perl program that performs the CSV to XML conversion required of the address book data. The program is shown in Listing 7.3. Let's take a closer look at the program and walk through each of the major sections.

Listing 7.3 CSV to XML conversion program. (Filename: ch7_csv_xml_app.pl)
 1.   use strict;  2.   use XML::SAXDriver::CSV;  3.   use XML::Handler::YAWriter;  4.   use IO::File;  5.  6.   my $input_file = shift;  7.  8.   my $csv = XML::SAXDriver::CSV->new(); 9.  10.  my $writer = XML::Handler::YAWriter->new(11.            Output => IO::File->new(">ch7_csv_to_xml.xml"),  12.                      Pretty => {PrettyWhiteIndent => 1,  13.                      PrettyWhiteNewline => 1});  14.  15.  $csv->parse(Source => {SystemId => $input_file},  16.              Handler => $writer,  17.              Declaration => {Version => '1.0'},  18.              Dynamic_Col_Headings => 1); 
Initialization

16 The opening section of the program has the standard pragma statement ( use strict ). For this program, we need to use the following three modules:

  • XML::SAXDriver::CSV

  • XML::Handler::YAWriter

  • IO::File

All three modules are required because they work together in this particular application. The XML::SAXDriver::CSV Perl module is a SAX driver, so it requires a SAX2 handler to process the SAX-generated events. The XML::Handler::YAWriter Perl module serves as that handler and is a writer module that will also format and output the XML data.You can also write your own custom handlers if this one does not serve your purpose. Finally, the IO::File module is used to create a file object that is provided as input to the XML::Handler::YAWriter module.

After all the modules are loaded, we use shift() to retrieve the name of the input file that will be provided as a command-line argument for this program.

 1.   use strict;  2.   use XML::SAXDriver::CSV;  3.   use XML::Handler::YAWriter;  4.   use IO::File;  5.  6.   my $input_file = shift; 
Creating the Required Objects

813 Now that we have the input file handle, we next need to create an XML::SAXDriver::CSV object calling the new() function.You can initialize the properties of the XML::SAXDriver::CSV by just simply passing them in a hash of typical key/value pairs. Any properties set at this time will be global for this object instance, and any method that is called using this instance of the XML::SAXDriver::CSV object will use those global values. Global values can be reset or changed by modifying the value that was assigned to the property.

 8.   my $csv = XML::SAXDriver::CSV->new();  9.  10.   my $writer = XML::Handler::YAWriter->new(11.             Output => IO::File->new(">ch7_csv_to_xml.xml"),  12.                       Pretty => {PrettyWhiteIndent => 1,  13.                       PrettyWhiteNewline => 1}); 

Also, you can override the global value with a call to another function that will localize the value until the end of that particular function. This relationship is illustrated in Figure 7.1.

Figure 7.1. The Object property's scope.

graphics/07fig01.gif

After we initialize our XML::SAXDriver::CSV object, the next step is to create and initialize an XML::Handler::YAWriter object, so that we can use it as a handler. The XML::Handler::YAWriter module's new method is where we do our customization to output the XML data in a particular format. For example, we can specify the output type (such as, file or scalar), custom escaping, and formatting.

The Output property is assigned the handle to a file, out.xml , which is used to output the generated XML data and is opened for writing. Pretty is an anonymous hash that contains information for pretty printing, and we set PrettyWhiteIndent and PrettyWhiteNewline to true ( 1 ), to place each element on its own line as well as indent it based on the depth level.

Converting from CSV to XML

1518 Now, all the setup work has been finished and we're actually ready to call the XML::SAXDriver::CSV parse function to perform the conversion.

 15.   $csv->parse(Source => {SystemId => $input_file},  16.               Handler => $writer,  17.               Declaration => {Version => '1.0'},  18.               Dynamic_Col_Headings => 1); 

As I mentioned before, most properties resemble those of SAX2 parsers that were discussed in Chapter 3, "Event-Driven Parser Modules," although a few extra additional capabilities have been added to enable customization of the conversion process. One of the properties used in this example is Dynamic_Col_Headings . This property tells the conversion processor to use the values in the first row (that is, the column headings) as the XML element names. Remember, some CSV files may not have the column names in the first row, so verify this before using this property. If any of the column names contain an illegal character (for example, a space), it will be replaced with an underscore (_) by default. If, for some reason, you don't want to use an underscore as the replacement character, you can specify a different replacement character by setting the SubChar property.

CSV to XML Conversion Program Output

When we run the CSV to XML conversion program, the file ch7_csv_to_xml.xml is created. The contents of the ch7_csv_to_xml.xml file are shown in Listing 7.4. Remember, the conversion program takes the name of the input CSV file as a command-line argument.

Listing 7.4 Results of the conversion from CSV to XML. (Filename: ch7_csv_to_xml.xml)
 <?xml version="1.0" encoding="UTF-8"?>  <records>    <record>      <First_Name>Ilya</First_Name>      <Last_Name>Sterin</Last_Name>      <Nick>listerin</Nick>      <Title>CTO</Title>      <Business_Name>Unravelnet Software</Business_Name>      <Address>3044 Perl Dr.</Address>      <City>Farmington Hills</City>      <State>MI</State>     <Zip>48334</Zip>      <Phone_Number_1>247-555-1212</Phone_Number_1>      <Phone_Number_2>247-555-1213</Phone_Number_2>    </record>    <record>      <First_Name>Mark</First_Name>      <Last_Name>Riehl</Last_Name>      <Nick>mark</Nick>      <Title>System Developer</Title>      <Business_Name>Software Company</Business_Name>      <Address>4488 XML Street</Address>      <City>New Jersey</City>      <State>NJ</State>      <Zip>08736</Zip>      <Phone_Number_1>255-545-8585</Phone_Number_1>      <Phone_Number_2>255-886-1432</Phone_Number_2>    </record>  </records> 

As you can see, with less than 10 lines of Perl source code, we converted a CSV file to an XML file. These drivers were designed to be this easy to use; however, they are also very flexible and can support more complex situations if required. For example, if the XML::SAXDriver::CSV's options don't satisfy your requirements, you can easily write your own customized handler class. This will be demonstrated later in this chapter.



XML and Perl
XML and Perl
ISBN: 0735712891
EAN: 2147483647
Year: 2002
Pages: 145

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net