Modeling a Family Tree | NetBeansв„ў IDE Field Guide: Developing Desktop, Web, Enterprise, and Mobile Applications (2nd Edition)

Genealogical data is complex for two main reasons:

We want to record all the facts that we know about our ancestors , and many of these facts will not fit into a rigidly predefined schema. For those facts that follow a regular pattern, however, we want to use a structured representation so that we can analyze the data.
The information we have is never complete, and it is never 100% accurate. Genealogy is always work-in-progress, and the information we need to manage includes everything from original source documents and oral evidence to the conjectures of other genealogists (not to mention Aunt Maud) whom we may or may not trust. In this respect it is similar to other investigative applications like crime detection and medical diagnosis.

One caveat before we start. Throughout this book I have been talking about tree models of XML, and I have been using words like parent and child, ancestor and descendant, in the context of these data trees. Don't imagine, though, that we can use this tree structure to represent a family tree directly. In fact, a family tree is not really a tree at all, because most children in real life have two parents, unlike XML elements where one parent is considered sufficient.

The structure of the family tree is quite different from the document tree used to represent it. And in this chapter, words like parent and child have their everyday meaning!

The GEDCOM Data Model

The established standard for representing genealogical data is known as GEDCOM, and data in this format is routinely exchanged between software packages and posted on the Internet. I will show some examples of this format later in the chapter.

In earlier editions of this book I devised my own way of translating this into XML. However, in December 2002 the LDS Church (which maintains the GEDCOM specification) published a beta-release specification of GEDCOM XML, version 6.0. Although this is not yet widely supported by software products, it is this vocabulary that I shall use in this chapter. The specification is available at http://www.familysearch.org/GEDCOM/GedXML60.pdf . A great deal of further information about the use of XML in genealogy can be found on the XML Cover Pages at http://xml.coverpages.org/genealogy.html .

The GEDCOM XML spec includes a DTD rather than a schema. The DTD has been extracted from the specification and published as a freestanding file at http://xml.coverpages.org/GEDCOMv60 BetaDTD-Brown.txt. I have copied this for convenience as file gedXML.dtd in the download file for this chapter.

In defining version 6 of GEDCOM, the designers decided to do two things at the same time: to change the syntax of the data representation, so that it used XML instead of GEDCOM's earlier proprietary tagging syntax, and to change the data model, to fix numerous problems that had inhibited accurate data exchange between different software packages for years .

The three main objects in the new model are individuals, events, and families.

It might seem obvious what an individual is, but serious genealogists know that identifying individuals is actually one of the biggest problems: is the Henry Kay who was born in Stannington in 1833 the same individual as the Henry Kay who married Emma Barber in Rotherham in 1855? (If you happen to know, please tell me.)

For this reason, the data is actually centered around the concept of an Event. The main events of interest are births, marriages, and deaths, but there are many others: for example, emigration, writing a will, and a mention in a published book can all be treated as events. In earlier times, births and deaths were not systematically recorded, but baptisms and burials were, so these events assume a special importance. Events have a number of attributes:

The date of the event. There are many complexities involved in recording historical dates, due to the use of different calendars, partial legibility, and varying precision.
The place of the event. Again, this is not a simple data element. Places change their names over time, and place names are themselves structured information, with a structure that varies from one country to another. (Some software packages like to pretend that every event happens in a "city", but they are wrong. Even in the limited data used in this chapter, we have two deaths that occurred in the air, over international waters).
The participants in the event. There may be any number of participants , and each has a role. For example, if the event is a marriage , then everyone who is known to have been present at the wedding can be regarded as a participant. Obvious roles include that of the bride, the groom, and the witnesses; but many records also record the names of the father of the bride and the father of the groom, and this information has obvious genealogical significance. Moreover, it's important to record it even if it seems redundant, because it may help to resolve questions that are raised later when conflicting evidence emerges,
Evidence for the event. This includes references to source information recording the event, and may include copies or transcripts of original documents.

Here is an example of an event from the Kennedy data set. I have included some additional information beyond that in the data we are using, to show some of the additional possibilities in the data model.

 <EventRec Id="F1-6" Type="marriage" VitalType="marriage">    <Participant>       <Link Target="IndividualRec" Ref="I1"/>       <Role>husband</Role>    </Participant>    <Participant>       <Link Target="IndividualRec" Ref="I2"/>       <Role>wife</Role>    </Participant>    <Participant>       <Link Target="IndividualRec" Ref="I19"/>       <Role>best man</Role>    </Participant>    <Date>12 SEP 1953</Date>    <Place>       <PlaceName>          <PlacePart Type="country" Level="1">USA</PlacePart>          <PlacePart Type="state" Level="2">RI</PlacePart>          <PlacePart Type="city" Level="4">Newport</PlacePart>       </PlaceName>    </Place> </EventRec>

This event is the marriage of John F. Kennedy to Jacqueline Lee Bouvier. Of course, the record only makes sense by following the links to the participating individuals.

The properties of an individual include:

Name (another potentially very complex data element, given the variety of conventions used in different places at different times). This element can be repeated, because a person can have different names at different times.
Gender (male, female , or unknown: the model does not recognize this as an attribute that can change over time)
Personal information: an open -ended set of information items about the person, each tagged with the type of information, optional date and place fields, and the actual information content. Certain types of personal information such as occupation , nationality , religion, and education are specifically recognized in the specification, but the list is completely open-ended.

The third fundamental object in the GEDCOM model is the family. A family is defined as a social group in which one individual takes the role of husband/father, another takes the role of wife/mother, and the others take the role of children. Any of the individuals may be absent or unknown, and the model is flexible as to the exact nature of the relationships: the parents, for example, are not necessarily married, and the children are not necessarily the biological children of the parents. An individual may be a member of several families, either consecutively or concurrently (membership in a family is not governed by dates).

There are actually three ways of representing relationships in the model. One way is through families, as described above. The second is through events: a birth event may record the person being born, the mother, and the father as participants in the event with corresponding roles. For certain key events, there are fixed roles with defined names ( principal, mother, and father in this case). The third way is to use the properties of an individual: one can record as a property of an individual, for example, that his godfather was Winston Churchill. These variations are provided to reflect the variety of ways in which genealogical data becomes available. The genealogical research process starts by collecting raw data, which is usually data either about events or about individuals, and gradually builds from this to draw inferences about the identity of individuals and the way in which they relate to each other in families. The model has the crucial property that it allows imprecise information to be captured: for example, you can record that A and B were cousins without knowing precisely how they were related , and you can record that someone was the second of five children without knowing who the other children were. The ability to record this kind of information makes XML ideally suited to genealogical data management.

Apart from individual, event, and family, there are five other top-level object types in the GEDCOM model, but we won't be dealing with them in this chapter:

A group is a collection of individuals related in some arbitrary way (for example, the individuals who were staying at a particular address on the night of a census)
A contact is typically another genealogist, for example one who collaborates in the research on the individuals in this data set.
A source is a document from which information has been obtained, such as a parish register or a will. It might also be a secondary source such as a published obituary.
A repository is a place where source documents may be found, for example a library or a web site, or the bottom drawer of your filing cabinet.
An LDS Ordinance is an event of specific interest to the Church of Jesus Christ of Latter-day Saints (often called the Mormons), which is the organization that created the GEDCOM standard.

I'm not going to spend time discussing whether this is the perfect way of representing genealogical information. Many people have criticized the data model, either on technical grounds or from the point of view of political correctness. The new model in version 6 corrects many of the faults of the established version, without departing from it as radically as some people would have liked .

I would have liked to see some further changes-for example, some explicit ability to associate personal names with events rather than with individuals (I have an ancestor who is named Ada on her birth certificate, but who was baptized as Edith). But with luck, the amount of change in the GEDCOM model is enough to fix the worst faults, but not so extensive that software products will need wholesale rewriting before they can support it.

Creating a Schema for GEDCOM 6.0

Because genealogical data is a perfect example of semi-structured data (it includes the full spectrum from raw images and sound recordings, through transcribed text, to fully structured and hyperlinked data) it is an ideal candidate for using an XML Schema to drive validation of the data and to produce XSLT stylesheets that are schema-aware. I have therefore produced a schema for a subset of this DTD, which I introduce in the next section.

My first step was to load the DTD into XMLSpy and convert it to a schema. This required a bit of pre-processing-I found that XMLSpy didn't like the xml:lang attributes defined in the DTD, so I edited them to change the name to xml-lang. The first cut schema produced by XMLSpy is included as rawschema1.xsd. The options I used for the conversion were as shown in Figure 11-1.

Figure 11-1

However, having chosen these options, I then made many changes to the schema, and in retrospect, I might well have got to the final result just as quickly by choosing a different option. I've included the full schema in the download file gedXML.xsd, but in this chapter I'm only going to show those parts that we are actually using in this application.

The modifications I made to the automatically-generated schema are of two kinds:

Structural changes: for example promoting anonymous types to top-level named types, and using a common type where two elements have the same structure. In particular, I extracted the elements defined as children of the document element <GEDCOM>, for example <IndividualRec> and <EventRec>, and made these into top-level element declarations.
Defining additional constraints that can be expressed in a schema but not in a DTD. I have concentrated my efforts on those elements and attributes that are actually used in this example application.

An interesting feature of this data is that the schema is very permissive. For example, it specifies a default format for dates in the form «DD MMM YYYY » (such as «18 APR 1924 »), which has long been the convention used by genealogists. However, it doesn't insist that the date of an event takes this form. It's quite OK, for example, to replace the last digit of the year by a question mark, perhaps to reflect the fact that the digit is difficult to decipher on an original manuscript. There are certain approved conventions such as preceding the date with «ABT » to indicate that the date is approximate, or «EST » to say that it is estimated, but there are no absolute rules. The golden rule in genealogy is that when you find information in a source document, you should be able to transcribe it as faithfully to the original as you possibly can, and a schema that imposes restrictions on your ability to do this is considered a bad thing. If you find an old church register in which a date of baptism is recorded as Septuagesima 1582, then you should be able to enter that in your database. I'll come back to the modeling of dates in the schema on page 700.

In GEDCOM, there is no formal way of linking one file to another. XML, of course, creates wonderful opportunities to define how your family tree links to someone else's. But the linking isn't as easy as it sounds (nothing is, in genealogy) because of the problems of maintaining version integrity between two datasets that are changing independently. So I'll avoid getting into that area, and stick to the model that the whole family tree is in one XML document.

The GEDCOM 6.0 Schema

Let's now take a quick look at some aspects of the XML Schema which I created for GEDCOM 6.0. In principle, because it's converted from the DTD, it covers all aspects of the specification; however in improving the schema to describe the specification more precisely and more usefully, I concentrated on the parts that we are actually using in ths application in this chapter: in particular, the three main object types individual, event, and family, and the three main properties, namely date, place, and personal name.

Individuals

Here is the element declaration for an <IndividualRec>:

  <xs:element name="IndividualRec">   <xs:complexType>   <xs:sequence>   <xs:element name="IndivName" type="IndivNameType"   minOccurs="0" maxOccurs="unbounded"/>   <xs:element name="Gender" type="GenderType" minOccurs="0"/>   <xs:element name="DeathStatus" type="xs:string" minOccurs="0"/>   <xs:element name="PersInfo" type="PersInfoType"   minOccurs="0" maxOccurs="unbounded"/>   <xs:element name="AssocIndiv" minOccurs="0" maxOccurs="unbounded">   <xs:complexType>   <xs:sequence>   <xs:element name="Link" type="LinkType"/>   <xs:element name="Association" type="xs:string"/>   <xs:element name="Note" type="NoteType"   minOccurs="0" maxOccurs="unbounded"/>   <xs:element name="Citation" type="CitationType"   minOccurs="0" maxOccurs="unbounded"/>   </xs:sequence>   </xs:complexType>   </xs:element>   <xs:element name="DupIndiv" minOccurs="0" maxOccurs="unbounded">   <xs:complexType>   <xs:sequence>   <xs:element name="Link" type="LinkType"/>   <xs:element name="Note" type="NoteType"   minOccurs="0" maxOccurs="unbounded"/>   <xs:element name="Citation" type="CitationType"   minOccurs="0" maxOccurs="unbounded"/>   </xs:sequence>   </xs:complexType>   </xs:element>   <xs:element name="ExternalID" type="ExternalIDType"   minOccurs="0" maxOccurs="unbounded"/>   <xs:element name="Submitter" type="SubmitterType" minOccurs="0"/>   <xs:element name="Note" type="NoteType"   minOccurs="0" maxOccurs= "unbounded"/>   <xs:element name="Evidence" type="EvidenceType"   minOccurs="0" maxOccurs="unbounded"/>   <xs:element name="Enrichment" type="EnrichmentType"   minOccurs="0" maxOccurs="unbounded"/>   <xs:element name="Changed" type="ChangedType"   minOccurs="0" maxOccurs="unbounded"/>   </xs:sequence>   <xs:attribute name="Id" type="xs:ID" use="required"/></xs:complexType>   </xs:element>

IndivName gives the name of the individual. Gender has the obvious meaning; DeathStatus is for recording information such as "died in infancy" when no specific death event is known. PersInfo allows recording of arbitrary personal information such as occupation and religion. AssocIndiv is for links to related individuals where the relationships cannot be expressed directly through Family objects (for example, links to godparents). DupIndiv is interesting: it allows an assertion that this IndividualRec refers to the same individual as another IndividualRec. This is very useful when combining data sets compiled by different genealogists; merging the two records into one can be very difficult if there are inconsistencies in the data, and it can prove very difficult to unmerge the data later if they are found to be different individuals after all. ExternalID is for reference numbers that identify the individual in external databases; Submitter is the person who created this record; Note is for arbitrary comments; Evidence says where the information came from; Enrichment is for inline documentation such as photographs or transcripts of original documents, and Changed is for a change history of this record.

Most of these fields are optional and repeatable. Something I haven't captured in this schema is that the GEDCOM spec also says the structure is extensible; arbitrary namespaced elements may be inserted at any point in the structure. This is typically used to contain information specific to a particular product vendor, so that GEDCOM can be used to exchange data between users of that product with no loss of information.

I chose to make IndividualRec a top-level element declaration in the schema. This isn't needed for validation, since in a GEDCOM file the IndividualRec will always be a child of the <GEDCOM> element. However, it makes this type available in stylesheets, which is a great convenience: for example, I can write a function whose parameter is declared as <xsl:param name="indi" as="schema-element (IndividualRec)"/>.

Having made IndividualRec a top-level element declaration, there seems to be nothing that would be gained by naming its complex type as a top-level type definition. In general, the only types that are worth naming as top-level types are those that are used in more than one place, or at least look likely to be used in more that one place, and that isn't the case here.

For most of the child elements of IndividualRec, I chose to use a local element declaration referring to a global type. There's nothing absolute about this. In many cases I could equally have used a reference to another global element declaration. Since many elements such as Date and Note appear in more than one place, referring to a global element declaration would make sense, and an accident of the DTD conversion is that this has been done for simple types but not for complex types. When it comes to writing an XSLT stylesheet, it's important that where a data element such as Date appears in several places, it should either use a global element declaration or a global type definition, but one of these is probably sufficient, and either will do.

There are no substitution groups in this model. They aren't needed, because the model has chosen to use generic elements like <PersInfo> rather than specialized types such as <occupation> and <religion>. The need for substitution groups generally arises when there are many elements that are structurally interchangeable.

Events

An event record has this structure:

  <xs:element name="EventRec">   <xs:complexType>   <xs:sequence>   <xs:element name="Participant" type="ParticipantType"   maxOccurs="unbounded"/>   <xs:element name="Date" type="DateType" minOccurs="0"/>   <xs:element name="Place" type="PlaceType" minOccurs="0"/>   <xs:element name="Religion" type="xs:string" minOccurs="0"/>   <xs:element name="ExternalID" type="ExternaIDType"   minOccurs="0" maxOccurs="unbounded"/>   <xs:element name="Submitter" type="SubmittarType" minOccurs="0"/>   <xs:element name="Note" type="NoteType"   minOccurs="0" maxOccurs="unbounded" / >   <xs:element name="Evidence" type="EvidenceType"   minOccurs="0" maxOccurs="unbounded"/>   <xs:element name="Enrichment" type="EnrichmentType"   minOccurs="0" maxOccurs="unbounded"/>   <xs:element name="Changed"  type="ChangedType"   minOccurs="0" maxOccurs="unbounded"/>   </xs:sequence>   <xs:attribute name="Id" type="xs:ID" use="required"/>   <xs:attribute name="Type" type="xs:string" use="required"/>   <xs:attribute name="VitalType" type="VitalTypeType"/>   </xs:complexType>   </xs:element>

Note how many of the fields are the same as those for an IndividualRec. Doing bottom-up data analysis, you would probably come to the conclusion that IndividualRec and EventRec should be defined as extensions of some common abstract type. It wouldn't do any harm to inherit all eight GEDCOM objects from some base type (called GEDCOMObject, say), but I can't say that the stylesheets in this chapter would have benefited from it. Really, an abstract type like this is only useful if there are operations that you want to perform at this level. A practical difficulty is that with XML Schema, types can only be extended by adding fields at the end, whereas here, the shared fields come after the type-specific fields.

Families

The third object type we will look at is the family. Here is the definition:

  <xs:element name="FamilyRec">   <xs:complexType>   <xs:sequence>   <xs:element name="HusbFath" type="ParentType" minOccurs="0"/>   <xs:element name="WifeMoth" type="ParentType" minOccurs="0"/>   <xs:element name="Child" type="ChildType"   minOccurs="0" maxOccurs="unbounded"/>   <xs:element name="BasedOn" type="BasedOnType" minOccurs="0"/>   <xs:element name="ExternalID" type="ExternalIDType"   minOccurs="0" maxOccurs="unbounded"/>   <xs:element name="Submitter" type="SubmitterType" minOccurs="0"/>   <xs:element name="Note" type="NoteType"   minOccurs="0" maxOccurs="unbounded"/>   <xs:element name="Evidence" type="EvidenceType"   minOccurs="0" maxOccurs="'unbounded"/>   <xs:element name="Enrichment" type="EnrichmentType"   minOccurs="0" maxOccurs="unbounded"/>   <xs:element name="Changed" type="ChangedType"   minOccurs="0" maxOccurs="unbounded"/>   </xs:sequence>   <xs:attribute name="Id" type="xs:ID" use="required"/>   </xs:complexType>   </xs:element>

Again, many of the fields are common with the other two object types. The types ParentType and ChildType play a crucial role in linking the data, so we'd better open them up:

  <xs:complexType name="ChildType">   <xs:sequence>   <xs:element name="Link" type="LinkType"/>   <xs:element name="ChildNbr" type="xs:positiveInteger" minOccurs="0"/>   <xs:element name="RelToFath" type="xs:string" minOccurs="0"/>   <xs:element name="RelToMoth" type="xs:string" minOccurs="0"/>   </xs:sequence>   </xs:complexType>   <xs:complexType name="ParentType">   <xs:sequence>   <xs:element name="Link" type="LinkType"/>   <xs:element name="FamilyNbr" type="xs:positiveInteger" minOccurs="0"/>   </xs:sequence>   </xs:complexType>

A <ChildType> element represents the participation of an individual in a family in the role of child. The <Link> identifies the individual concerned . The <ChildNbr> represents the position of that child in the family (1 for the eldest child, and so on): this allows for the fact that some of the children may be unknown. <RelToFath> and <RelToMoth> elements allow for detail about the relationship of the child to the father and mother, for example the child may be the natural child of one parent and the adopted child of the other.

The <ParentType> element represents the participation of an individual in a family in the role of parent. The <FamilyNbr> element provides a sequence number, for example it allows you to say that this family is the man's second marriage, which is useful if the dates of the marriages are not known.

Now let's look quickly at the three most common (and difficult) data types used for properties of these objects: dates, places, and personal names.

Dates

As we've seen, GEDCOM allows any character string to be used as a date. However, much of the presentation of data depends on analyzing dates wherever possible. How is this dilemma resolved?

The DateType referenced from the Event record is a complex type, defined like this:

  <xs:complexType name="DateType">   <xs:simpleContent>   <xs:extension base="GeneralDate">   <xs:attribute name="Calendar" type="xs:string"/>   </xs:extension>   </xs:simpleContent>   </xs:complexType>

That is to say, it is a complex type with simple content: the content is a GeneralDate, and the optional attribute indicates which calendar is used. The GeneralDate can be any character string, but certain formats such as «DD MMM YYYY » are recommended.

As far as validation is concerned, there isn't much point in defining a schema data type for the pattern «DDMMMYYYY ». However, it turns out that it can be useful to define this type even if it isn't used for validation. We can define the GEDCOM date format as a union type like this:

  <xs:simpleType type="GeneralDate">   <xs:union memberTypes="StandardDate xs:string">   </xs:simpleType>   <xs:simpleType type="StandardDate">   <xs:restriction base="xs:string">   <xs:pattern value=   "[0-9]?[0-9]\s(JANFEBMARAPRMAYJUNJULAUGSEPOCTNOVDEC)\s[0-9]{4}"/>   </xs:restriction>   </xs:simpleType>

This type is meaningless from the point of view of validation-all strings will be considered valid. But the effect is that a date that conforms to the «DD MMM YYYY » pattern will be labeled as a StandardDate, while one that doesn't will be labeled only as an xs:string. This will prove useful when we write our stylesheets, because it becomes very easy to separate standard dates from non-standard dates when we want to perform operations like date formatting and sorting. In fact, I could have usefully split dates into three categories: simple exact dates like «4 MAR 1920 »; inexact dates that conform to the GEDCOM syntax, such as «BEF JAN 1866 » (meaning some time before January 1866); and arbitrary character strings whose interpretation is left purely to the reader.

Places

Place names have an internal structure, but the structure is highly variable. In many cases components of the place name may be missing, and the part that is missing may be the major part rather than the minor part. For example, you might know that someone was born in Wolverton, England, without knowing which of the three towns of that name it refers to. The GEDCOM schema allows the place name to be entered as unstructured text, but also allows individual components of the name to be marked up using a <PlacePart> element which can carry two attributes: Type, which can take values such as Country, City, or Parish to indicate what kind of place this is, and Level, which is a number that represents the relationship of this part of the place name to the other parts.

Personal Names

As with place names, personal names have a highly variable internal structure. The name can be written simply as a character string (within an <IndivName> ) element, or the separate parts can be tagged using <NamePart> elements. As with place names, these have a completely open-ended structure. The Type attribute can be used to identify the name part as, for example, a surname or generation suffix, and the Level attribute can be used to indicate its relative importance, for example when used as a key for sorting and indexing.