Efficient algorithm for comparing XML nodes


Question

I want to determine whether two different child nodes within an XML document are equal or not. Two nodes should be considered equal if they have the same set of attributes and child notes and all child notes are equal, too (i.e. the whole sub tree should be equal).

The input document might be very large (up to 60MB, more than a 100000 nodes to compare) and performance is an issue.

What would be an efficient way to check for the equality of two nodes?

Example:

<w:p>
  <w:pPr>
    <w:spacing w:after="120"/>
  </w:pPr>
  <w:r>
    <w:t>Hello</w:t>
  </w:r>
</w:p>
<w:p>
  <w:pPr>
    <w:spacing w:after="240"/>
  </w:pPr>
  <w:r>
    <w:t>World</w:t>
  </w:r>
</w:p>

This XML snippet describes paragraphs in an OpenXML document. The algorithm would be used to determine whether a document contains a paragraph (w:p node) with the same properties (w:pPr node) as another paragraph earlier in the document.

One idea I have would be to store the nodes' outer XML in a hash set (Normally I would have to get a canonical string representation first where attributes and child notes are sorted always in the same way, but I can expect my nodes already to be in such a form).

Another idea would be to create an XmlNode object for each node and write a comparer which compares all attributes and child nodes.

My environment is C# (.Net 2.0); any feedback and further ideas are very welcome. Maybe somebody even has already a good solution?

EDIT: Microsoft's XmlDiff API can actually do that but I was wondering whether there would be a more lightweight approach. XmlDiff seems to always produce a diffgram and to always produce a canonical node representation first, both things which I don't need.

EDIT2: I finally implemented my own XmlNodeEqualityComparer based on the suggestion made here. Thanks a lot!!!!

Thanks, divo

1
13
7/31/2013 9:02:02 PM

Accepted Answer

I'd recommend against rolling your own hash creation function and instead rely on the in-built XNodeEqualityComparer's GetHashCode method. This guarantees to take account of attributes and descendant nodes when creating the result and could save you some time too.

Your code would look like the following:

XNodeEqualityComparer comparer = new XNodeEqualityComparer();
XDocument doc = XDocument.Load("XmlFile1.xml");
Dictionary<int, XNode> nodeDictionary = new Dictionary<int, XNode>();

foreach (XNode node in doc.Elements("doc").Elements("node"))
{
    int hash = comparer.GetHashCode(node);
    if (nodeDictionary.ContainsKey(hash))
    {
        // A duplicate has been found. Execute your logic here
        // ...
    }
    else
    {
        nodeDictionary.Add(hash, node);
    }
}

My XmlFile1.xml is:

<?xml version="1.0" encoding="utf-8" ?>
<doc>
  <node att="A">Blah</node>
  <node att="A">Blah</node>
  <node att="B">
    <inner>Innertext</inner>
  </node>
  <node>Blah</node>
  <node att="B">
    <inner>Different</inner>
  </node>
</doc>

nodeDictionary will end up containing a unique collection of Nodes and their hashes. Duplicates are detected by using the Dictionary's ContainsKey method, passing in the hash of the node, which we generate using the XNodeEqualityComparer's GetHashCode method.

I think this should be fast enough for your needs.

11
12/5/2008 2:00:47 PM

It is very challenging even to define correctly the problem of

"When two xml documents are equal?"

There are many reasons for this:

  1. An XML document is a tree that may have different textual representations.
  2. Whitespace-only nodes may or may not be considered in a comparison
  3. Comment nodes may or may not be considered in a comparison
  4. PI nodes may or may not be considered in a comparison
  5. Lexical differences: or
  6. Different prefixes may be associated with the same namespace in the two documents
  7. A namespace node may be shown as defined on a node of doc1 and as not defined but inherited from the parent of the corresponding node in doc2
  8. Quotes may be used around an attribute in doc1 but apostrophes may be used in doc2
  9. Entities may be used in doc1 but they may be pre-expanded in doc2
  10. The two documents may have different but semantically equivalent DTDs
  11. Etc.

Therefore it seems naive and unrealistic to try to produce a correct implementation of a function for equality comparison of two XML documents.

My recommendation is to use the deep-equal() function with a compliant XPath 2.0 engine.


Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Icon