Creating a Data File | NetBeansв„ў IDE Field Guide: Developing Desktop, Web, Enterprise, and Mobile Applications (2nd Edition)

Our next task is to create an XML file containing the Kennedy family tree in the appropriate format. I started by entering the data in a genealogy package, taking the information from public sources such as the web site of the Kennedy museum. The package I use is called The Master Genealogist, and like all such software it is capable of outputting the data in GEDCOM 5.5 format. This is a file containing records that look something like this (it's included in the downloads for this chapter as kennedy.ged ):

  0 @I1@ INDI   1 NAME John Fitzgerald/Kennedy/ 1 SEX M   1 BIRT   2 DATE 29 MAY 1917   2 PLAC Brookline, MA, USA   1 DEAT   2 DATE 22 NOV 1963   2 PLAC Dallas, TX, USA   2 NOTE Assassinated by Lee Harvey Oswald.   1 NOTE Educated at Harvard University.   2 CONT Elected Congressman in 1945   2 CONT aged 29; served three terms in the House of Representatives.   2 CONT Elected Senator in 1952. Elected President in 1960, the   2 CONT youngest ever President of the United States.   1 FAMS @F1@   1 FAMC @F2@

This isn't XML, of course, but it is a hierarchic data file containing tagged data, so it is a good candidate for converting into XML that looks like the document below. This doesn't conform to the GEDCOM 6.0 data model or schema, but it's a useful starting point.

  <INDI ID="I1">   <NAME>John Fitzgerald/Kennedy/</NAME>   <SEX>M</SEX>   <BIRT>   <DATE>29 MAY 1917</DATE>   <PLAC>Brookline, MA, USA</PLAC>   </BIRT>   <DEAT>   <DATE>22 NOV 1963</DATE>   <PLAC>Dallas, TX, USA</PLAC>   <NOTE>Assassinated by Lee Harvey Oswald.<BR/></NOTE>   </DEAT>   <NOTE>Educated at Harvard University.   Elected Congressman in 1945   aged 29; served three terms in the House of Representatives.   Elected Senator in 1952. Elected President in 1960, the   youngest ever President of the United States.   </NOTE>   <FAMS REF="F1"/>   <FAMC REF="F2"/>   </INDI>

Each record in a GEDCOM file has a unique identifier (in this case I1 - that's letter I, digit one), which is used to construct cross-references between records. Most of the information in this record is self-explanatory, except the <FAMS> and <FAMC> fields: <FAMS> is a reference to a <FAM> record representing a family in which this person is a parent, and <FAMC> is a reference to a family in which this person is a child.

The first stage in processing data is to do this conversion into XML, a process which we will examine in the next section.

Converting GEDCOM Files to XML

The obvious way to translate GEDCOM to XML is to write a program that takes a GEDCOM file as input and produces an XML file as output. However, there's a smarter way: why not write a GEDCOM parser which looks just like a SAX-compliant XML parser, so that any program that can handle SAX input can read GEDCOM directly, just by switching parsers? In particular, many XSLT processors can take input from a SAX-compliant parser, so this enables you to feed GEDCOM straight into a stylesheet.

Equally, many XSLT processors can send the result tree to a user -specified ContentHandler in the form of a stream of SAX events, so if we write a SAX2-compatible ContentHandler , our XSLT processor can also output GEDCOM files. This suddenly means we can write stylesheets to transform one GEDCOM file into another, without the hassle of creating an XML file as an intermediate form. An example of such a transformation would be one that removes all the living people from a GEDCOM file.

A SAX2 parser for GEDCOM 5.5 is supplied with the sample files for this chapter on the Wrox web site; it is named GedcomParser . GEDCOM 5.5 uses an archaic character set called ANSEL, so along with GedcomParser is another class, AnselInputStreamReader to translate the ANSEL characters into Unicode.

Similarly, on the output side, there is a SAX2 ContentHandler called GedcomOutputter, which in turn translates Unicode to ANSEL using an AnselOutputStreamWriter. We won't be using this in any of our examples, however. This structure is shown in Figure 11-2.

Figure 11-2

If you do want to see the XML, you can always feed the GEDCOM into a stylesheet that does an identity transformation, the simplest being identity.xsl:

  <xsl:transform   xmlns:xsl="http://www.w3.org/1999/XSL/Transform"   version="1.0" >   <xsl:template match="/">   <xsl:copy-of select="."/>   </xsl:template>   </xsl:transform>

This is how I created the data file kennedy55.xml which is supplied in the download files.

If you want to convert your own GEDCOM file mytree.ged into XML, you can do it using Saxon by entering the command (all on one line)

  java net.sf.saxon.Transform -x GedcomParser mytree.ged   identity.xsl >mytree.xml

The -x option on the command line causes Saxon to use the class GedcomParser as its XML parser. It doesn't matter that this isn't actually parsing XML, it's enough that it implements the SAX2 interface and thus pretends to be an XML parser.

The XML this produces is a direct mechanical translation of the original GEDCOM 5.5 file into XML. It is included in the download files for this chapter as kennedy55.xml. The next stage in the processing is to convert this so that it conforms to the GEDCOM XML 6.0 schema. This is obviously another job for XSLT. Let's now look at the stylesheet that achieves this conversion, which is in the download as ged55-to-6 .xsl.

Converting From GEDCOM 5.5 to 6.0

The stylesheet ged55-to-6.xsl doesn't handle the full job of GEDCOM conversion, but it does handle the subset that we're using in this application. It starts like this:

  <xsl:transform   xmlns:xsl="http://www.w3.org/1999/XSL/Transform"   xmlns:xs="http://www.w3.org/2001/XMLSchema"   version="2.0"   >   <!-- This stylesheet converts from the XML representation GEDCOM 5.5   to the GEDCOM 6.0 XML beta specification -->   <xsl:strip-space elements="*"/>   <xsl:output method="xml" indent="yes" encoding="iso-8859-1"/>   <!-- import the schema for the result vocabulary -->   <xsl:import-schema namespace="" schema-location="gedSchema.xsd"/>

I'm going to use a schema-aware stylesheet to tackle this conversion. I won't be using a schema for the input vocabulary (because I haven't written one), but I will be using the schema for the result document. I will also be validating the result document against this schema. The main effect of this (for the time being) is that mistakes in the stylesheet that cause incorrect output to be generated are reported immediately, and pinpointed to the line in the stylesheet that caused the error. As I developed this stylesheet, this happened dozens of times before I got it right, and diagnosing the errors proved far easier than using the conventional approach of generating the output, inspecting it for obvious faults, and then running it through a separate validation phase. I'll give some examples of this later on.

This does mean that to run this example yourself, you will need to install a schema-aware processor. At the time of writing, the only schema-aware XSLT 2.0 processor available is the commercial version of Saxon, version 8.0 or later, which you can find at http://www.saxonica.com/ . You can easily edit the stylesheet to remove the <xsl:import-schema> declaration and the «validation="strict" » attribute on the <xsl:result-document> instruction, and it will then work with a basic XSLT 2.0 processor. However, later stylesheets in this chapter make rather deeper use of schema-aware transformation.

There is no «namespace » attribute on the <xsl:import-schema> declaration, because the schema has no target namespace.

Top-Level Processing

We can now get on with the top-level processing logic:

  <xsl:param name="submitter" select="'Michael Kay'"/>   <xsl:template match="/">   <xsl:result-document validation="strict">   <GEDCOM>   <HeaderRec>   <FileCreation   Date="{format-date(current-date(), '[D1] [MN,*-3] [Y0001]')}"/>   <Submitter>   <Link Target="ContactRec" Ref="Contact-Submitter"/>   </Submitter>   </HeaderRec>   <xsl:call-template name="families"/>   <xsl:call-template name="individuals"/>   <xsl:call-template name="events"/>   <ContactRec Id="Contact-Submitter">   <Name><xsl:value-of select="$submitter"/></Name>   </ContactRec>   </GEDCOM>   </xsl:result-document>   </xsl:template>

This template rule establishes the outline of the result tree. The containing <GEDCOM> element will contain: a header record, which we generate here and now; then a set of family records, a set of individual records, and a set of events, which must appear in that order; and finally a contact record to indicate the originator of the data set, which must be present because the mandatory <Submitter> element in the header refers to it. The name of the submitter is defined by a stylesheet parameter, so you can set a different value if you use this stylesheet on your own data files. (The reason this field is called «Submitter » is historic: GEDCOM was originally designed so that members of the LDS church could submit details of their ancestors to the church authorities.)

The instruction <xsl:result-document validation="strict"> causes the result tree to be validated . The system will do this by looking in the imported schemas for an element declaration of the outermost element in the result tree (the <GEDCOM> element) and then ensuring that the rest of the result tree conforms to this element declaration. In the case of Saxon, this validation is done on the fly: each element is validated as soon as it is written to the result tree, which means that any validation errors can be reported in relation to the stylesheet instruction that wrote the incorrect data.

In the header I have generated only those fields that are mandatory. These include the file creation date, which must be in the format «DD MMM YYYY ». This can easily be generated in XSLT 2.0 using the combination of the current-date() and the format-date() functions. The current-date() function is in XPath, and is described in Chapter 10 of XPath 2.0 Programmer's Reference; the format-date() function is in XSLT, and is in Chapter 7 of this book.

Creating Family Records

The <FamilyRec> elements in the result document correspond one-to-one with the <FAM> elements in the input, except that the event information is not included (it is output separately in <EventRec> elements, later). For example, the input element

 <FAM ID="F4">    <HUSB REF="I3"/>    <WIFE REF="I4"/>    <CHIL REF="I2"/> </FAM>

is translated to the output element

 <FamilyRec Id="F4">    <HusbFath>       <Link Target="IndividualRec" Ref="I3"/>    </HusbFath>    <WifeMoth>       <Link Target="IndividualRec" Ref="I4"/>    </WifeMoth>    <Child>       <Link Target="IndividualRec" Ref="I2"/> </Child> </FamilyRec>

Here is the code to do this:

  <xsl:template name="families">   <xsl:apply-templates select="/*/FAM"/>   </xsl:template>   <xsl:template match="FAM">   <FamilyRec Id="{@ID}">   <xsl:apply-templates select="HUSB, WIFE, CHIL"/>   </FamilyRec>   </xsl:template>   <xsl:template match="FAM/HUSB">   <HusbFath>   <Link Target="IndividualRec" Ref="{@REF}"/>   </HusbFath>   </xsl:template>   <xsl:template match="FAM/WIFE">   <WifeMoth>   <Link Target="IndividualRec" Ref="{@REF}"/>   </WifeMoth>   </xsl:template>   <xsl:template match="FAM/CHIL">   <Child>   <Link Target="IndividualRec" Ref="{@REF}"/>   </Child>   </xsl:template>

One point worth noting here is the use of «select="HUSB, WIFE, CHIL" » to ensure that the elements of the family appear in the right order in the output. The GEDCOM 6.0 schema is very strict about the order of elements, whereas GEDCOM 5.5 was more liberal . This expression selects a sequence containing zero-or-one HUSB elements, zero-or-one WIFE elements, and zero-or-more CHIL elements, and processes them in that order.

If the input GEDCOM file is invalid, for example if a FAM contains more than one WIFE element, then the output file will also be invalid, and this will cause a validation error to be reported by the XSLT processor.

Creating Individual Records

The code for mapping <INDI> records in the source to <IndividualRec> records in the result tree is similar in principle to the code for family records, though a little bit more complicated.

  <xsl:template name="individuals">   <xsl:apply-templates select="/*/INDI"/>   </xsl:template>   <xsl:template match="INDI">   <IndividualRec Id="{@ID>">   <xsl:apply-templates select="NAME, SEX, REFN, NOTE, CHAN"/>   </IndividualRec>   </xsl:template>   <xsl:template match="INDI/NAME">   <IndivName>   <xsl:analyze-string select="." regex="/(.*?)/">   <xsl:matching-substring>   <xsl:text> </xsl:text>   <NamePart Type="surname" Level="1">   <xsl:value-of select="regex-group(1)"/>   </NamePart>   <xsl:text> </xsl:text>   </xsl:matching-substring>   <xsl:non-matching-substring>   <xsl:value-of select="."/>   </xsl:non-matching-substring>   </xsl:analyze-string>   </IndivName>   </xsl:template>

Note the code here for extracting the surname from the name using the new <xsl:analyze-string> instruction in XSLT 2.0. In GEDCOM 5.5 the surname is tagged by enclosing it between «/ » characters; in 6.0, it is enclosed in a nested <NamePart> element. The 6.0 specification also allows tagging of other parts of the name, for example as a given name, a title, a generation suffix (such as «Jr » ) and so on; but as such fields aren't marked up in our source data, we can't generate them.

  <xsl:template match="INDI/SEX">   <Gender>   <xsl:apply-templates/>   </Gender>   </xsl:template>   <xsl:template match="INDI/REFN">   <ExternalID Type="REFN" Id="{.}"/>   </xsl:template>   <xsl:template match="INDI/CHAN">   <Changed Date="{DATE}" Time="00:00"/>   </xsl:template>   <xsl:template match="NOTE">   <Note>   <xsl:apply-templates/>   </Note>   </xsl:template>   <xsl:template match="CONT">   <xsl:text>&#x0a;</xsl:text>   <xsl:value-of select="."/>   </xsl:template>

The rules for NOTE elements apply to such elements wherever they appear in a GEDCOM file, which is why the patterns specify «match="NOTE" » rather than «match="INDI /NOTE" »; for other elements, the rules may be specific to their use within an <INDI> record.

In the original GEDCOM file a NOTE can contain multiple lines, which are arranged like this:

 1 NOTE Educated at Harvard University. Elected Congressman in 1945 2 CONT aged 29; served three terms in the House of Representatives. 2 CONT Elected Senator in 1952. Elected President in 1960, the 2 CONT youngest ever President of the United States.

In the direct conversion to XML, the note appears like this (except that there is no newline before the first <CONT> start tag):

 <NOTE>Educated at Harvard University. Elected Congressman in 1945 <CONT>aged 29; served three terms in the House of Representatives</CONT> <CONT>Elected Senator in 1952. Elected President in 1960, the</CONT> <CONT>youngest ever President of the United States.</CONT>

The GEDCOM 6.0 specification allows only plain text in a <NOTE> element (it provides other elements for more complex information, such as a transcript of a will). So the ged55-to-6 conversion stylesheet preserves the line endings by inserting a newline character wherever a <CONT> element appeared. The final result is:

 <Note>Educated at Harvard University. Elected Congressman in 1945  aged 29; served three terms in the House of Representatives.  Elected Senator in 1952. Elected President in 1960, the  youngest ever President of the United States. </Note>

The result isn't always satisfactory, because different genealogy packages that produce GEDCOM 5.5 vary widely in how they handle newlines and whitespace: but it works in this case.

A typical individual record after conversion looks like this:

 <IndividualRec Id="I2">    <IndivName>Jaqueline Lee       <NamePart Type="surname" Level="1">Bouvier</NamePart>    </IndivName>    <IndivName>       <NamePart Type="surname" Level="1">Kennedy</NamePart>    </IndivName>    <IndivName>       <NamePart Type="surname" Level="1">Onassis</NamePart>    </IndivName>    <Gender>F</Gender>    <ExternalID Type="REFN" Id="2"/>    <Changed Date="13 JAN 2004" Time="00:00"/> </IndividualRec>

GEDCOM 6.0 allows all the parts of an individual's name to be tagged indicating the type of the name, but it doesn't require it, and in our source data, there isn't enough information to achieve this. The <ExternalID> allows external reference numbers to be recorded: for example, it might be a stable reference number used to identify this record in a particular database. As with names , there's no limit on how many reference numbers can be stored-the idea is that the «Type » attribute distinguishes them.

Creating Event Records

The event records in the result tree correspond to events associated with individuals and families in the source data. As we've seen, the 6.0 data model treats events as first-class objects, which are linked to the individuals who participated in the event.

Our sample data set only includes a few different kinds of event: birth, marriage , divorce, death, and burial , and in the stylesheet we'll confine ourselves to handling these five, plus the other common event of baptism. We also handle the general EVEN tag which is used in GEDCOM 5.5 for miscellaneous events. It should be obvious how the code can be extended to handle other events.

  <xsl:template name="events">   <xsl:apply-templates mode="event"   select="/GED/INDI/(BIRTBAPMDEATBURI)  /GED/FAM/(MARRDIV)" />   <xsl:apply-templates select="/GED/(INDIFAM)/EVEN"/>   </xsl:template>   <xsl:template match="*" mode="event">   <xsl:variable name="id">   <xsl:value-of select="../@ID"/>   <xsl:text>-</xsl:text>   <xsl:number count="*"/>   </xsl:variable>   <EventRec Id="{$id}">   <xsl:copy-of select="$event-mapping/*[name()=name(current())]/@*"/>   <xsl:apply-templates select="." mode="participants"/>   <xsl:apply-templates select="DATE, PLAC, NOTE"/>   </EventRec>   </xsl:template>   <xsl:variable name="event-mapping">   <BIRT Type="birth" VitalType="birth"/>   <BAPM Type="baptism" VitalType="birth"/>   <DEAT Type="death" VitalType="death"/>   <BURI Type="burial" VitalType="death"/>   <MARR Type="marriage" VitalType="marriage"/>   <DIV Type="divorce" VitalType="divorce"/>   </xsl:variable>   <xsl:template match="EVEN">   <xsl:variable name="id">   <xsl:value-of select="../@ID"/>   <xsl:text>-</xsl:text>   <xsl:number count="*"/>   </xsl:variable>   <EventRec Id="{$id}" Type="{TYPE}">   <xsl:apply-templates select="." mode="participants"/>   <xsl:apply-templates select="DATE, PLAC, NOTE"/>   </EventRec>   </xsl:template>

This code identifies all the subelements of <INDI> and <FAM> that refer to events, and then processes these, creating one <EventRec> in the output for each. The identifier for the event is computed from the identifier of the containing <INDI> or <FAM> element plus a sequence number, and the attributes of the event are obtained from a look-up table based on the original element name. In the 6.0 model, the type of event (for example death or burial) is indicated by the «Type » attribute, whose values are completely open -ended. The optional «VitalType » attribute allows each event to be associated with one of the four key events of birth, death, marriage, and divorce: this means, for example, that the date of publication of an obituary can be used as an approximation for the date of death if no more accurate date is available, and that the announcement of banns can similarly be used to estimate the date of marriage.

The next two templates are used to generate the particpants in an event. The first handles events associated with an individual, the second events associated with a couple (which come from the FAM record):

  <xsl:template match="INDI/*" mode="participants">   <Participant>   <Link Target="IndividualRec" Ref="{../@ID}"/>   <Role>principal</Role>   </Participant>   </xsl:template>   <xsl:template match="FAM/*" mode="participants">   <Participant>   <Link Target="IndividualRec" Ref="{../HUSB/@REF}"/>   <Role>husband</Role>   </Participant>   <Participant>   <Link Target="IndividualRec" Ref="{../WIFE/@REF}"/>   <Role>wife</Role>   </Participant>   </xsl:template>

This leaves the handling of the date and place of the event. Both are potentially very complex information items. Dates, however, have changed little between GEDCOM 5.5 and 6.0, so they can be carried over unchanged.

  <xsl:template match="DATE">   <Date><xsl:apply-templates/></Date>   </xsl:template>

For the places where events occurred, we can try to be a bit more clever. Many of the events in our data set occurred in the United States, and have a PLAC record of the form «somewhere, XX, USA » where XX is a two-letter code identifying a state. This format is predictable because The Master Genealogist captures place names in a structured way and generates this comma-separated format on output. We can recognize places that follow this pattern, and use the regular-expression handling capability of XSLT 2.0 to generate a more structured <Place> attribute. This records the country as USA, and the state as the two-letter code preceding the country name; anything before the state abbreviation is tokenized using commas as the delimiter , and the sequence of tokens is output in reverse order-note the calls on reverse() and tokenize() -using individual <PlacePart> elements in the output.

  <xsl:template match="PLAC">   <Place>   <xsl:choose>   <xsl:when test="matches(., '^.*,\s*[A-Z]{2},\s*USA\s*$')">   <xsl:analyze-string select="."   regex="^(.*),\s*([A-Z]{{2}}),\s*USA\s*$">   <xsl:matching-substring>   <PlaceName>   <PlacePart Type="country" Level="1">USA</PlacePart>   <PlacePart Type="state" Level="2">   <xsl:value-of select="regex-group(2)"/>   </PlacePart>   <xsl:for-each select="reverse(tokenize(regex-group(1), ','))">   <PlacePart Level="{5+position()}">   <xsl:value-of select="normalize-space(.)"/>   </PlacePart>   </xsl:for-each>   </PlaceName>   </xsl:matching-substring>   <xsl:non-matching-substring>   <xsl:message>Error: string "<xsl:value-of select="."/>"   does not match regex</xsl:message>   </xsl:non-matching-substring>   </xsl:analyze-string>   </xsl:when>   <xsl:otherwise>   <PlaceName><xsl:value-of select="."/></PlaceName>   </xsl:otherwise>   </xsl:choose>   </Place>   </xsl:template>

The effect of these rules is that we end up with event records of the form:

 <EventRec Id="I2-5" Type="birth" VitalType="birth">    <Participant>       <Link Target="IndividualRec" Ref="I2"/>       <Role>principal</Role>    </Participant>    <Date>28 JUL 1929</Date>    <Place>       <PlaceName>          <PlacePart Type="country" Level="1">USA</PlacePart>          <PlacePart Type="state" Level="2">NY</PlacePart>          <PlacePart Level="6">Long Island</PlacePart>          <PlacePart Level="7">Southampton</PlacePart>       </PlaceName>    </Place> </EventRec>

The names "Long Island" and "Southampton" are classified as levels 6 and 7 because we don't know enough about them to classify them more accurately: levels up to 5 have reserved meanings, whereas 6 and above are available for arbitrary purposes. The ordering of levels is significant: higher levels are intended to represent a finer granularity of place name, which is why we have reversed the order of the original components of the name.

Debugging the Stylesheet

This completes the presentation of the stylesheet used to convert the data from GEDCOM 5.5 to 6.0 format. I'd like to add some notes, however, from my experience of developing this stylesheet. This was the first time I had developed a real stylesheet using a schema-aware schema-aware XSLT processor, and it may be worth sharing what I learnt.

Actually, it was the first time anyone had developed a schema-aware stylesheet; the necessary features in Saxon were still hot from the oven while I was writing it.

The vast majority of my errors in coding this stylesheet, unless they were basic XSLT or XPath errors, were detected as a result of the on-the-fly validation of the result document against its schema. These errors included:

Leaving out required attributes
Misspelling element names (for example, ExternalId for ExternalID )
Generating elements in the wrong order
Placing an element at the wrong level of nesting
Generating an invalid value for an attribute

With the current version of Saxon, none of these errors are detected at stylesheet compile time, but they are all reported while executing the stylesheet, and in nearly all cases the error message identifies exactly where the stylesheet is wrong. For example, if the code in the initial template is changed to read

  <Submitter>   <Link Ref="Contact-Submitter"/>   </Submitter>

then the transformation fails with the message:

 Error at element Link on line 24 of ged55-to-6.xsl:   Required attribute Target is missing

This process caught quite a few basic XSLT coding errors. For example, I originally wrote:

  <xsl:template match="FAM/CHIL">   <Child>   <Link Target="IndividualRec" Ref="@REF"/>   </Child>   </xsl:template>

in which the curly braces around «@REF » have been omitted. This resulted in the error message:

 Error at element Link on line 61 of ged55-to-6.xsl:   The value '@REF' is not a valid NCName

The error message arises because in the absence of curly braces, the system has tried to use «@REF » as the literal value of the «Ref » attribute, and this is not allowed because the attribute is defined in the schema to have type IDREF , which is a subtype of NCName . An NCName cannot contain an «@ » character.

Similarly, errors in the picture of the format-date() function call were picked up because they resulted in a string that did not match the picture defined in the schema for the StandardDate type.

However, schema validation of the result tree will not pick up all errors. I had some trouble, for example, getting the regular expression for matching place names right, but the errors simply resulted in the output file containing an empty <Place> element, which is allowed by the schema.