.net - C# - XmlDocument vs XDocument behaviour on invalid hexadecimal character

We are receiving xml documents that we process using XDocument, but which contain hex entity expressions. The documents have likely been created by a process invoking XmlDocument. The XDocument.Parse() method rejects these documents.

XmlDocument seems to handle these without complaining - generating xml documents and encoding control code chars it is given, and parsing them from xml documents.

XDocument throws an exception in both cases.

Example:

static void Main(string[] args)
{
    string message = "Hello, \x1EWorld!"; // string with control code 1E encoded.

    // This block completes - create an xml document incorporating the message string
    XmlDocument xmlDoc = new XmlDocument();
    XmlElement root = xmlDoc.CreateElement("greeting");
    xmlDoc.AppendChild(root);
    root.InnerText = message;
    Console.WriteLine(xmlDoc.OuterXml);

    // Outputs: <greeting>Hello, &#x1E;World!</greeting>

    // This block fails - XDocument creation of document containing control-code character x1E
    try
    {
        XDocument xdoc = new XDocument(
            new XElement("greeting", message)
        );
        Console.WriteLine(xdoc.ToString());
    }
    catch (Exception ex)
    {
        Console.WriteLine($"XDocument creation error: {ex}");
    }

    // This block completes - XmlDocument load document containing an &#x1E; entity expression
    string xmlWithEscapedHexEntity = xmlDoc.OuterXml; // <greeting>Hello, &#x1E;World!</greeting>";
    xmlDoc = new XmlDocument();
    xmlDoc.LoadXml(xmlWithEscapedHexEntity);
    Console.WriteLine(xmlDoc.OuterXml);

    // This block fails - XDocument parse document containing an &#x1E; entity expression
    try
    {
        XDocument xDoc = XDocument.Parse(xmlWithEscapedHexEntity);
        Console.WriteLine(xDoc.ToString());
    }
    catch (Exception ex)
    {
        Console.WriteLine($"XDocument parse failure: {ex}");
    }

    Console.ReadLine();
}

Is there a way that we can have XDocument ignore any hex entity codes, or replace with single space chars? Otherwise, we will have to pre-process the documents with a RegEx to replace such expressions with spaces.

Why does this difference exist?

XmlDocument seems to handle these without complaining - generating xml documents and encoding control code chars it is given, and parsing them from xml documents.

XDocument throws an exception in both cases.

Example:

static void Main(string[] args)
{
    string message = "Hello, \x1EWorld!"; // string with control code 1E encoded.

    // This block completes - create an xml document incorporating the message string
    XmlDocument xmlDoc = new XmlDocument();
    XmlElement root = xmlDoc.CreateElement("greeting");
    xmlDoc.AppendChild(root);
    root.InnerText = message;
    Console.WriteLine(xmlDoc.OuterXml);

    // Outputs: <greeting>Hello, &#x1E;World!</greeting>

    // This block fails - XDocument creation of document containing control-code character x1E
    try
    {
        XDocument xdoc = new XDocument(
            new XElement("greeting", message)
        );
        Console.WriteLine(xdoc.ToString());
    }
    catch (Exception ex)
    {
        Console.WriteLine($"XDocument creation error: {ex}");
    }

    // This block completes - XmlDocument load document containing an &#x1E; entity expression
    string xmlWithEscapedHexEntity = xmlDoc.OuterXml; // <greeting>Hello, &#x1E;World!</greeting>";
    xmlDoc = new XmlDocument();
    xmlDoc.LoadXml(xmlWithEscapedHexEntity);
    Console.WriteLine(xmlDoc.OuterXml);

    // This block fails - XDocument parse document containing an &#x1E; entity expression
    try
    {
        XDocument xDoc = XDocument.Parse(xmlWithEscapedHexEntity);
        Console.WriteLine(xDoc.ToString());
    }
    catch (Exception ex)
    {
        Console.WriteLine($"XDocument parse failure: {ex}");
    }

    Console.ReadLine();
}

Why does this difference exist?

Share Improve this question edited Feb 7 at 16:02 dbc 117k26 gold badges262 silver badges386 bronze badges asked Feb 7 at 14:32 Neil Moss 6,8382 gold badges31 silver badges47 bronze badges

3 "XmlDocument seems to handle these without complaining - generating xml documents and encoding control code chars it is given" - well, you say "generating xml documents", but really "generating files that look like XML, but aren't actually valid XML". It sounds like XDocument is actually complying with the spec here... en.wikipedia.org/wiki/Valid_characters_in_XML – Jon Skeet Commented Feb 7 at 15:19
I concur with @JonSkeet. Additionally, the most reliable format for data feeds is XML enforced by an XSD. An XSD plays a role of a data contract between sender and receiver. It will guarantee proper data format/shape, data types, cardinality, encoding, and enforce data quality. – Yitzhak Khabinsky Commented Feb 7 at 15:40
@jonskeet Is this a defect in XmlDocument ? It seems odd that it doesn't adhere to the spec, and that it behaves differently to XDocument. A quick check confirms it's the same behaviour in .NET framework 4.7.2 and .NET8. – Neil Moss Commented Feb 7 at 15:49
1 @NeilMoss: Yes, I'd consider it a defect in XmlDocument. – Jon Skeet Commented Feb 7 at 16:29
2 To two additional remarks to what has been said, (a)  is allowed in XML 1.1 but not in XML 1.0. However, as far as I'm aware, Microsoft has never implemented XML 1.1 in any of its products. (b) You are showing us an XmlDocument (essentially, a DOM) being constructed programmatically, not being parsed from source XML. Generally, when you construct a DOM programmatically, you get much less validation than when you parse from source XML. – Michael Kay Commented Feb 7 at 17:11

| Show 2 more comments

1 Answer 1

Sorted by: Reset to default 4

The difference here is that LINQ to XML strictly enforces the Character Range constraint of the Extensible Markup Language (XML) 1.0 (Fourth Edition), while the older XmlDocument apparently does not:

Character Range

[2]       Char       ::=      #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]  /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

As you can see, #x1E is not in the above range, meaning an XML document that contains this character is not, strictly speaking, well-formed according to the version of the XML standard supported by .NET.^[1].

If you don't want this, you could create versions of XDocument.Parse() and XDocument.ToString() that set XmlReaderSettings.CheckCharacters and XmlWriterSettings.CheckCharacters to false:

public static class XDocumentExtensions
{
    static readonly XmlReaderSettings noCheckedCharacterParseSettings = new() { CheckCharacters = false, };
    static readonly XmlReaderSettings checkedCharacterParseSettings = new() { CheckCharacters = true, };

    public static XDocument Parse(string xml, bool checkCharacters) =>
        Parse(xml, checkCharacters ? checkedCharacterParseSettings : noCheckedCharacterParseSettings);
    
    public static XDocument Parse(string xml, XmlReaderSettings settings)
    {
        using var reader = new StringReader(xml);
        using var xmlReader = XmlReader.Create(reader, settings);
        return XDocument.Load(xmlReader);
    }

    static readonly XmlWriterSettings noCheckedCharacterToStringSettings = new() { CheckCharacters = false, Indent = true, OmitXmlDeclaration = true, };
    static readonly XmlWriterSettings checkedCharacterToStringSettings = new() { CheckCharacters = true, Indent = true, OmitXmlDeclaration = true, };
    
    public static string ToString(this XNode node, bool checkCharacters) =>
        node.ToString(checkCharacters ? checkedCharacterToStringSettings : noCheckedCharacterToStringSettings);
    
    public static string ToString(this XNode node, XmlWriterSettings settings)
    {
        using var writer = new StringWriter();
        using (var xmlWriter = XmlWriter.Create(writer, settings))
        {
            node.WriteTo(xmlWriter);
        }
        return writer.ToString();
    }
}

Then modify your code as follows:

XDocument xdoc = new XDocument(
    new XElement("greeting", message)
);
Console.WriteLine(xdoc.ToString(checkCharacters : false));

And

XDocument xDoc = XDocumentExtensions.Parse(xmlWithEscapedHexEntity, checkCharacters : false);
Console.WriteLine(xDoc.ToString(checkCharacters : false));

And you will be able to parse and format XML that is malformed purely due to containing invalid XML characters without any exceptions being thrown. Demo fiddle here.

That being said, I don't really recommend doing this, as the XML you generate will not be accepted by any receiving system that requires strict conformance with the XML standard. If you would prefer to remove invalid characters from your XML text, see:

How do you remove invalid hexadecimal characters from an XML-based data source prior to constructing an XmlReader or XPathDocument that uses the data?
Remove all hexadecimal characters before loading string into XML Document Object?

Now, as for why this difference exists? Hard to say for sure, but a check of the reference source shows that XmlDocument uses XmlTextReader to parse its XML. Both these types are very old (dating to .NET 1.1), and XmlTextReader was deprecated in .NET 2.0:

Starting with the .NET Framework 2.0, we recommend that you use the XmlReader class instead.

My guess is that Microsoft simply didn't implement character range checking in the initial .NET 1.1 XmlTextReader and XmlTextWriter implementations, then later did so when they introduced XmlReader and XmlWriter in .NET 2, and LINQ to XML in .NET 3.5. And, while a fair amount of guidance about early .NET versions was lost when MSDN links were retired, I did find the MSDN page Creating XML Readers, archived in 2013 but no longer actively available, that alludes to incomplete conformance checking by XmlTextReader:

By using the Create method and the XmlReaderSettings class you get the following benefits:

Take full advantage of all the new features added to the XmlReader class in the .NET Framework 2.0 release. There are certain features, such as better conformance checking and compliance to the XML 1.0 recommendation, that are available only on XmlReader objects created by the Create method.

So it may be that guidance from Microsoft that character conformance checking was not fully implemented in the .NET 1.1 existed 10 or 15 years ago, but has since been lost.

Update

If you need a Parse() function that strips invalid XML characters whether embedded directly in the XML text or hex-encoded as character entities, you could use the following extension methods:

public static class XDocumentExtensions
{
    static readonly XmlReaderSettings noCheckedCharacterParseSettings = new() { CheckCharacters = false, };
    
    public static XDocument ParseAndRemoveInvalidXmlCharacters(string xml, char? fallback = null)
    {
        ArgumentNullException.ThrowIfNull(xml);

        // From testing it seems that CheckCharacters=false only allows invalid character entities whose value falls outside the range from the standard
        // https://www.w3.org/TR/2006/REC-xml-20060816/#NT-Char
        //    [2]       Char       ::=      #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]  /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
        // Invalid characters directly embedded in the character stream must be stripped out manually.
        using var reader = new StringReader(xml.RemoveInvalidXmlCharacters(fallback));
        using var xmlReader = XmlReader.Create(reader, noCheckedCharacterParseSettings);
        return XDocument.Load(xmlReader).RemoveInvalidXmlCharacters(fallback);
    }
    
    public static TXObject RemoveInvalidXmlCharacters<TXObject>(this TXObject node, char? fallback = null) where TXObject : XObject
    {
        switch (node)
        {
            case XText text:
                text.Value = text.Value.RemoveInvalidXmlCharacters(fallback);   
                break;
            case XAttribute attribute:
                attribute.Value = attribute.Value.RemoveInvalidXmlCharacters(fallback);
                break;
            case XComment comment:
                comment.Value = comment.Value.RemoveInvalidXmlCharacters();
                break;
            case XDocument doc:
                doc.Root?.RemoveInvalidXmlCharacters();
                break;
            case XElement element:
                foreach (var attr in element.Attributes())
                    attr.RemoveInvalidXmlCharacters();
                foreach (var child in element.Nodes())
                    child.RemoveInvalidXmlCharacters();
                break;
            case XContainer container: // XDocument
                foreach (var child in container.Nodes())
                    child.RemoveInvalidXmlCharacters();
                break;
            // Not done: XDocumentType, XProcessingInstruction
        }
        return node;
    }
    
    public static string RemoveInvalidXmlCharacters(this string xmlText, char? fallback = null)
    {
        ArgumentNullException.ThrowIfNull(xmlText);

        StringBuilder? sb = null;

        for (int i = 0; i < xmlText.Length; i++)
        {
            if (XmlConvert.IsXmlChar(xmlText[i]))
            {
                if (sb != null)
                    sb.Append(xmlText[i]);
            }
            else if (i < xmlText.Length - 1 && XmlConvert.IsXmlSurrogatePair(xmlText[i+1], xmlText[i])) // Yes this order is correct.
            {
                if (sb != null)
                    sb.Append(xmlText, i, 2);
                i++;
            }
            else
            {
                if (sb == null)
                {
                    sb = new();
                    sb.Append(xmlText, 0, i);
                }
                if (fallback != null)
                    sb.Append(fallback.Value);
            }
        }
        
        return sb?.ToString() ?? xmlText;
    }
}

And then do:

var xdoc = new XDocument(
    new XElement("greeting", message)
).RemoveInvalidXmlCharacters();

var xDoc = XDocumentExtensions.ParseAndRemoveInvalidXmlCharacters(xmlWithInvalidCharacters);

Demo fiddle #2 here.

^{^[1] While .NET officially only supports the XML 1.0 (4th Edition) standard, the 5th Edition has a similar constraint:
[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

Now, as noted by Michael Kay in comments, escape characters such as  are allowed by XML 1.1, however .NET never implemented support for this XML version.}

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

.net - C# - XmlDocument vs XDocument behaviour on invalid hexadecimal character - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)