GOTCHA 21 Default performance of Data.ReadXMLData.ReadXML

GOTCHA #21 Default performance of Data.ReadXMLData.ReadXML

The System.Data.DataSet class provides great flexibility for disconnected access to data. Its capabilities to transform data into XML and to read data from XML come in very handy.

One major problem with using XML is performance. Consider the simple XML document in Example 3-3.

Example 3-3. A simple XML document

 <root>     <row>       <item1>0</item1>       <item2>0.703552246887028</item2>       <item3>0.993569961746023</item3>       <item4>0.147870197961046</item4>       <item5>0.740130904009627</item5>     </row>     <row>       <item1>1</item1>       <item2>0.378916004383432</item2>       <item3>0.143134204737439</item3>       <item4>0.419504510434114</item4>       <item5>0.403854837363518</item5>   </row> ...

The root element contains a number of row elements. Each row contains five elements named item1, item2, etc. Each item contains a value of type double.

If I have 100 rows in this document, it takes 90 milliseconds to read the XML document into the DataSet using ReadXML().^[*] If I have 1,000 rows, it takes 200 milliseconds. Not too bad. But if I have 5,000 rows, it takes 5,700 milliseconds. Finally, if I have 10,000 rows, it takes an objectionable 24,295 milliseconds (about 25 seconds).

^[*] Thanks to Ruby Hjelte for bringing this to my attention during a recent project.

Interestingly, if I use the System.Xml.XmlDocument parser class to parse the XML document, it doesn't take that long. So what's the problem with ReadXML()?

It turns out that ReadXML() spends most of its time not in parsing the XML document, but in analyzing it to understand its format. In other words, it tries to infer a schema from the XML. So you can achieve a significant speedup by preloading the schema into the DataSet before reading the XML. You can obtain the schema in several ways. For instance, you can ask the sender of the document to provide you with the schema; you can create it manually; or you can use the xsd.exe tool to generate it.

Example 3-4 shows the optimization realized when reading an XML document with 10,000 rows. It alternates between reading the XML without knowing its format and loading the format from an .xsd (XML Schema Definition) file before reading the data.

Example 3-4. Speedup due to preloading schema

C# (DataSetXMLSpeed)

 using System; using System.Data; namespace ReadingXML {     class Test     {         private static void timeRead(bool fetchSchema)         {             DataSet ds = new DataSet();             int startTick = Environment.TickCount;              if (fetchSchema)             {                  ds.ReadXmlSchema(@"..\..\data.xsd");             }              ds.ReadXml(@"..\..\data.xml");                       int endTick = Environment.TickCount;             Console.WriteLine(                 "Time taken to read {0} rows is {1} ms",                 ds.Tables[0].Rows.Count,                 (endTick - startTick));         }         [STAThread]         static void Main(string[] args)         {             Console.WriteLine("Reading XML into DataSet");             timeRead(false);             Console.WriteLine(             "Reading XML into DataSet after reading Schema");             timeRead(true);         }     } }

VB.NET (DataSetXMLSpeed)

 Module Test     Private Sub timeRead(ByVal fetchSchema As Boolean)         Dim ds As DataSet = New DataSet         Dim startTick As Integer = Environment.TickCount          If fetchSchema Then             ds.ReadXmlSchema("..\data.xsd")         End If          ds.ReadXml("..\data.xml")         Dim endTick As Integer = Environment.TickCount         Console.WriteLine( _          "Time taken to read {0} rows is {1} ms", _          ds.Tables(0).Rows.Count.ToString(), _          (endTick - startTick).ToString())     End Sub     Sub Main()         Console.WriteLine("Reading XML into DataSet")         timeRead(False)         Console.WriteLine( _             "Reading XML into DataSet after reading Schema")         timeRead(True)     End Sub End Module

In this example you read the data.xml file containing 10,000 rows in the format discussed in Example 3-3. In the first run, you load the DataSet with the raw XML document. In the second run, you preload the DataSet with the data.xsd schema file, then ask the program to read the XML document. The data.xsd file was generated using the xsd.exe tool from the .NET command prompt as follows:

 xsd data.xml

The time taken for each of these approaches is shown in the output in Figure 3-2.

Figure 3-2. Output showing the speedup from Example 3-4

Reading the XML document cold takes about 25 seconds, while reading it after preloading the schema takes just over half a second.

How does this differ in .NET 2.0 Beta 1? The speed of execution of ReadXML() has significantly improved in .NET 2.0. For the case of 10,000 rows without preloading the schema, it takes only around 1,000 ms. The time taken after preloading the schema was less than 420 ms. It still helps to preload the schema.

IN A NUTSHELL

Preload the schema into the DataSet before calling ReadXML(). It makes a significant difference in performance as the XML file size grows. This eliminates the time taken by ReadXML() to infer the schema from the XML document.

GOTCHA 21 Default performance of Data.ReadXMLData.ReadXML