XML applications should be designed around elements and attributes. You can use a schema or a DTD to constrain which elements and attributes are allowed where and what their legal content is. You may also impose additional constraints that affect the content of the document but cannot be expressed in a schema. For example, you might require that the ID attribute of an Employee element must be the actual ID of a current or past employee. All of these are constraints on the content and structure of the document. They generally reflect the semantics of a particular application domain.
Despite some early hype about search engines that understood web pages because you used a Shoe tag instead of an LI tag (some of which I was guilty of myself , I freely admit), XML is not a semantic language. It is a syntactic language. Semantics are properly defined by the individual applications built out of XML rather than by XML itself. With the almost negligible exception of xml:lang , there is nothing semantic in XML. XML is only syntax.
It is your role as a developer to define the semantics that are appropriate for your application using the underlying XML syntax. However, it is not your role as a developer to change, add to, or restrict XML's underlying syntax. Doing so destroys XML's value proposition of a compatible interoperable data format. Once you have decided to use XML, you have committed to supporting all of XML: tags, PCDATA, attributes, CDATA sections, document type declarations, comments, processing instructions, entity references, character references, and so on. You do not have the right to throw away any of this. Your application must handle all of it. Fortunately, this is not hard to do because the XML parser handles all this for you. Changing the definition of XML actually requires a lot more work because you can't rely on standard parsers. You have to write your own.
It takes no more effort on your part to allow CDATA sections, comments, processing instructions, and so on in your documents than it does to forbid them. If you don't care about these, you can freely ignore them when processing a document that contains them. You just shouldn't say that they are disallowed .
The classic example of what not to do is SOAP. SOAP explicitly requires that documents contain neither a document type declaration nor any processing instructions. There are a number of problems with these requirements, most notably:
Perhaps worst of all is the pollution of the XML environment such subsetting engenders. Because SOAP is so completely broken with respect to normal XML processing, vendors are pushing special purpose parsers that process SOAP documents but not all well- formed XML 1.0 documents. Furthermore, as I write this, some members of the SOAP community are lobbying the W3C to bless their subset and not require XML parsers to support the pieces of XML syntax they disapprove of. This is an interoperability disaster.
There are reasons the SOAP specification chose to forbid processing instructions and document type declarations. Forbidding document type declarations means that all content is present in a single document. This eliminates the possibility of external entities that launch multiple connections to remote servers or even enable a denial-of-service attack. Forbidding processing instructions helps eliminate covert channels. It means that all information must be passed through the SOAP vocabulary all processors should understand. However, forbidding these constructs also eliminates many important uses. There are other ways to mitigate these problems that don't require limiting the syntax of XML.
A better approach, in my opinion, is taken by XML-RPC, which neither requires nor forbids document type declarations and processing instructions. It is (perhaps unintentionally) agnostic about these constructs. You can use them if you wish, but generic XML-RPC servers and clients will mostly ignore them. The parser may read the document type declaration and use it to resolve external entity references and supply default attribute values, but this is completely transparent to the program receiving data from the parser, as it should be.
Restricting the syntax an application is willing to parse confuses the role of parser and client application. The parser is responsible for working with tags, entity references, CDATA sections, document type declarations, and so forth and translating all of this into labeled structures for the client program. The client program should only operate on the output of the parser. It should not require the parser to do other than it would normally do. Restricting the syntax an application is willing to accept (as opposed to the structure it will accept) prevents it from using general purpose tools and makes your job as a developer much harder for no good reason. Properly designed XML applications neither notice nor care how the information is syntactically encoded because the parser handles all that work for them. Why make your job harder? Allow all legal XML syntax in your XML applications.