Importing XML

Functions for Importing XML

Import

You can import XML data into the Wolfram Language using the standard Import function, which has the following syntax.

Import[file]import format determined by file extension
Import[file,format]import from a specific format

Importing files.

The first argument specifies the file to be imported. You can also specify an optional second argument to control the form of the output. For importing XML data, the relevant file formats are "XML", "ExpressionML", and "MathML".

With "XML" as the import format, all XML formats are returned as a symbolic XML expression, including ExpressionML and MathML.

With "ExpressionML" format, ExpressionML is returned as the corresponding cell expression.

With "MathML" format, MathML is returned as the corresponding typeset box expression.

A simple MathML equation.
Importing this file returns the equation as a box expression.
With "XML" format, the equation is imported as a symbolic XML expression.

If Import is used with only one argument, the Wolfram Language processes the data in the file based on its file extension. Any file with a .xml extension is imported as XML. For ExpressionML or MathML, formats supported by the Wolfram Language, the file will be interpreted in the appropriate way. All other XML formats are imported as symbolic XML.

Import a file with the .mml extension.
Display the box expression as conventional mathematical notation using DisplayForm.

Control the details of the import process, such as how to treat whitespace, whether to recognize entities, or whether to validate against a DTD, by specifying options to Import.

ImportString

Use the standard ImportString function to import XML data from a string.

ImportString[string,format]import from a string using a specific format

Importing strings.

For importing XML data, the relevant file formats are "XML", "ExpressionML", and "MathML".

With "XML" as the import format, all XML formats are returned as a symbolic XML expression, including ExpressionML and MathML.

With "ExpressionML" format, ExpressionML is returned as the corresponding cell expression.

With "MathML" format, MathML is returned as the corresponding typeset box expression.

A simple XML expression converted to symbolic XML using ImportString.

With "ExpressionML" format, ExpressionML is returned as the corresponding cell expression. With "MathML" format, MathML is returned as the corresponding typeset box expression.

Import a simple MathML expression. The MathML markup is automatically converted to a Wolfram Language box expression.
Using "XML" format will prevent the MathML markup from being interpreted.

Control the details of the import process, such as how to treat whitespace, whether to recognize entities, or whether to validate against a DTD, by specifying options to ImportString.

XMLGet

The XMLGet function can be used to import an XML document as symbolic XML. XMLGet[file] is equivalent to Import[file,"XML"].

XMLGet exists only in the XML`Parser` context. You must use the full name of the function, XML`Parser`XMLGet, when doing an evaluation. To use the function without the context name prefix, add the XML`Parser` context to your context path.

The advantage to using XMLGet is that it accepts a pre-initialized parser object as its second argument.

XMLGet[file,xmlParserObject]import using a pre-initialized parser

Initializing the parser involves loading a DTD into memory either from a URL or a local file. This only needs to be done once in each kernel session. Subsequent references to the DTD are processed much faster. For more information on initializing the parser, see InitializeXMLParser.

You can also specify options for XMLGet. The options for XMLGet are the same as the ones for Import. However, the syntax is slightly different. The option can be specified directly in the XMLGet function, such that

XMLGet[file,option1->value1,option2->value2,]

is equivalent to

Import[file,"XML",option1->value1,option2->value2,].

XMLGetString

The XMLGetString function can be used to import an XML string as symbolic XML. XMLGetString[string] is equivalent to ImportString[string,"XML"].

XMLGetString exists only in the XML`Parser` context. Use the full name of the function, XML`Parser`XMLGetString, when doing an evaluation. To use the function without the context name prefix, add the XML`Parser` context to your context path.

The advantage of using XMLGetString is that it accepts a pre-initialized parser object as its second argument.

XMLGetString[string,xmlParserObject]import from a string using a pre-initialized parser

Initializing the parser involves loading a DTD into memory either from a URL or a local file. This only needs to be done once in each kernel session. Subsequent references to the DTD are processed much faster. For more information on initializing the parser, see InitializeXMLParser.

Pre-initialize the parser, XHTMLParser, according to the XHTML DTD located at the specified URI.
Import an XML string. The string is validated with respect to the DTD stored in XHTMLParser by setting "ValidateAgainstDTD"->True. Valid->True in the output indicates that the input string was valid XML with respect to the XHTML DTD.

You can also specify options for XMLGetString. The options for XMLGetString are the same as those for ImportString. However, the syntax is slightly different. The option can be specified directly in the XMLGet function such that

XMLGetString[string,option1->value1,option2->value2,]

is equivalent to

ImportString[string,"XML",option1->value1,option2->value2,].

Entities and Validation

An XML document can contain any characters included in the Unicode character set.

When importing an XML document into the Wolfram Language, all numeric Unicode character entity references are automatically resolved into the corresponding Wolfram Language character.
Entities that are not built into XML are resolved according to the rules present in the DTD.

Import can also validate the XML data to ensure that it conforms to a content model defined by a DTD. If the document is well formed, a symbolic XML expression will be returned. If the document is not valid, warning messages will be issued and the document wrapper will indicate the invalid nature of the document with the option Valid->False.

You can control the aspects of how entities are treated and whether the document is validated or not by using the options for Import.

Import Options

Introduction

The standard options of Import give you more control over the import process. The syntax for specifying an option is

Import[file,option->value].

The following options are available for importing XML data:

"NormalizeWhitespace"

This option controls how whitespace is processed. Whitespace is defined as a space, tab, or newline character.

option
value
effect
"NormalizeWhitespace"Trueall the whitespace inside an element is normalized (default)
Falseall the whitespace in the original XML document is preserved
Automaticignorable whitespace is removed and non-ignorable whitespace is preserved

Values for "NormalizeWhitespace".

Normalizing whitespace means that all leading and trailing whitespace is stripped and any interior whitespace is reduced to a single whitespace character. "NormalizeWhitespace"->True is the default setting for this option.

Whitespace is ignorable when it occurs in places where character data is not permitted according to the content model specified by the DTD. The primary use of ignorable whitespace is to add indentation for formatting purposes.

Whitespace handling with the default setting "NormalizeWhitespace"->True.
"NormalizeWhitespace"->False preserves the whitespace as it appears in the original string.

If "NormalizeWhitespace"->False is specified, pattern matching on the resulting symbolic XML expression may become problematic because of the intervening whitespace.

"AllowRemoteDTDAccess"

This option controls whether the parser may access the network in order to retrieve DTDs.

option
value
effect
"AllowRemoteDTDAccess"Truethe parser will automatically access the network to retrieve DTDs
Falseremote DTDs will not be retrieved, but local DTDs can still be used

Values for "AllowRemoteDTDAccess".

If "AllowRemoteDTDAccess"->False and the document refers to a remote DTD, the parse will fail and an error message will be generated, unless the option "ReadDTD" is also set to False.

"AllowUnrecognizedEntities"

This option determines what the parser will do if undefined entity references are encountered in the XML document.

option
value
effect
"AllowUnrecognizedEntities"Trueany undefined entities are wrapped in special entity delimiter characters, and no error messages are reported
Falsean error message is reported and the parse fails
Automatican error message is reported for any unrecognized entity, and the entity is wrapped in special entity delimiter characters (default)

Values for "AllowUnrecognizedEntities".

This contains an undefined entity called "dogs". If "AllowUnrecognizedEntities" is False, then an error message is reported and the parse fails.
With the default setting Automatic, an error message is reported, and the entity is wrapped in special entity delimiter characters. This does not interrupt the importing and parsing of the XML data.
With "AllowUnrecognizedEntities"->True, any undefined entities are wrapped in special entity delimiter characters and no error messages are reported.

"ReadDTD"

This option determines whether an external DTD subset is read or not. The most important uses of a DTD are to define a content model for validation and to define character entities.

option
value
effect
"ReadDTD"Trueexternal DTDs are read (default)
Falseexternal DTDs are ignored

Values for "ReadDTD".

Since reading the DTD can directly affect the contents of the document, "ReadDTD"->True is the default setting. Setting "ReadDTD"->False can improve the efficiency, but only make this change if you are certain that no information is required from the DTD.

Setting "ReadDTD"->False is the only way to prevent the parser from attempting to read the DTD. "AllowRemoteDTDAccess"->False will prevent network access and "ValidateAgainstDTD"-> False will prevent validation from happening, but neither will prevent an error caused by the parser failing to read the DTD.

"ReadDTD" is ignored if you are using a pre-initialized parser. For more information on pre-initialized parsers, see InitializeXMLParser.

"ValidateAgainstDTD"

This option determines whether the XML document is validated or not.

option
value
effect
"ValidateAgainstDTD"Truea validation attempt will be made on import even if there is no DOCTYPE declaration in the XML document
Falseno validation attempt will be made on import
Automatica validation attempt will be made on import only if there is a DOCTYPE declaration in the XML document (default)

Values for "ValidateAgainstDTD".

If the document is valid, the parser will set the XMLObject["Document"] option "Valid"->True. If the document is invalid, the parser will generate validity error messages and will set "Valid"->False.

Parse a document that is not valid by setting "ValidateAgainstDTD" to True. The parser generates error messages.
If the document is valid, then no messages are generated and "Valid"->True is included in the output.
Parsing with "ValidateAgainstDTD" set to False generates no error messages, nor does it add a "Valid" option to XMLObject["Document"].
With "ValidateAgainstDTD" set to True, validation is attempted even if there is no DOCTYPE declaration.
For validation only when there is a DOCTYPE declaration, use "ValidateAgainstDTD"->Automatic. When no DTD is specified, the parser does not attempt to validate the XML string.
Here the parser tries to validate the input string because a DTD is specified explicitly.

Even when using a pre-initialized parser, "ValidateAgainstDTD"->Automatic will not validate unless there is a DOCTYPE declaration in the document.

"IncludeDefaultedAttributes"

This option determines whether attributes that are specified by the DTD as default attributes are included in the symbolic XML expression. "IncludeDefaultedAttributes"->False is the default setting because the default values for attributes are known to application developers and it is unnecessary to include the values in the symbolic XML expression. Setting "IncludeDefaultedAttributes"->True will include the values.

option
value
effect
"IncludeDefaultedAttributes"Truedefault attributes in the DTD are included in the symbolic XML expression
Falsedefault attributes are not included (default)

Values for "IncludeDefaultedAttributes".

Assign a variable to represent the XML fragment.
Convert the XML fragment into symbolic XML.
To include default attributes in the imported symbolic XML, set "IncludeDefaultedAttributes" to True.
Including default attributes in the expression is not the same as validation; thus, they can be included even with "ValidateAgainstDTD"->False.

"IncludeEmbeddedObjects"

This option determines the treatment of comments and processing instructions that occur inside the document tree.

option
value
effect
"IncludeEmbeddedObjects"Allall the embedded objects will be included in the document tree
Noneno embedded objects are included (default)
Commentsonly embedded comments are included
ProcessingInstructionsonly embedded processing instructions are included

Values for "IncludeEmbeddedObjects".

Set a variable to represent a simple XML fragment to facilitate further examples.
"IncludeEmbeddedObjects"->All includes all the embedded objects in the document tree.
The default setting of "IncludeEmbeddedObjects" is None since comments and processing instructions are not intended to affect applications using the XML document. Including them may hamper pattern matching.
Using the "ProcessingInstructions" or "Comments" settings will include only the embedded processing instructions or comments, respectively. Setting "IncludeEmbeddedObjects" to {"Comments","ProcessingInstructions"} includes a list of the embedded comments and processing instructions.

"IncludeNamespaces"

This option determines how namespaces are handled.

option
value
effect
"IncludeNamespaces"Truespecify the explicit namespace for each element and attribute
Falseno namespace information is reported
Automaticthe namespace is determined by scoping (default)
Unparsedused for handling documents that use namespaces in a nonstandard way

Values for "IncludeNamespaces".

Set a variable to represent a simple XML fragment with namespaces.

<root xmlns="http://mynamespace.com"
xmlns:same="http://mynamespace.com"
xmlns:foo="http://anothernamespace.com">
<child attr1="a" same:attr2="b" foo:attr3="c"/>
<foo:child/>
<same:child/>
</root>

True

"IncludeNamespaces"->True reports the namespace information for each element and attribute via a list, {namespace,localname}. This form is more verbose, but more faithful to the data model of the XML document. This form may also be easier to use for pattern matching.

False

"IncludeNamespaces"->False only reports the local name of each element or attribute. This setting makes the symbolic XML expression easier to read, but restricts use of it for applications with only a single namespace. The names of all the child elements appear to be identical when parsed this way, so this option value cannot be trusted whenever multiple namespaces are used.

Automatic

With the default value "IncludeNamespaces"->Automatic, the namespace is determined by means of scoping. If the namespace of an element is the same as the default namespace, then the name is represented as a single string for the local name. If the namespace of an element is different, then the name is represented by a list with the structure {namespace,localname}.

For example, the only element whose name is represented by a two-string list is the one in namespace http://anothernamespace.com. The other elements are implicitly contained in the http://mynamespace.com namespace. Attributes are not compacted since, according to the W3C specification, the attributes and the elements have different namespace scoping.

Unparsed

Some documents use names in a non-namespace-compliant fashion, because the XML namespace recommendation, which extends XML, was made after the initial XML recommendation. "IncludeNamespaces"->"Unparsed" is provided to allow parsing of these documents. The name is always represented as the exact single string that appears in the XML file. Unless absolutely necessary, this option value should not be used.

"PreserveCDATASections"

This option controls whether the distinction between CDATA sections and regular character data is maintained on import. CDATA sections are meant as a convenience for document authors; for most applications they should not be treated differently from ordinary data. Preserving CDATA sections can make pattern matching difficult so the default setting is False.

option
value
effect
"PreserveCDATASections"Trueinformation about CDATA sections is preserved
Falseinformation about CDATA sections is removed

Values for "PreserveCDATASections".

Here is an example of the default behavior of "PreserveCDATASections".
To preserve CDATA sections, specify "PreserveCDATASections"->True.