-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NodesManager::FullBottomUp #4
Comments
No need to apologize. I uploaded this project to git precisely with the hope that it would be useful to others, so I'm glad that you were able to benefit from some of my work. I did not write all, or even most, of this library. That credit goes to a few French computer scientists in Inria about 15 years ago. My contribution to their efforts was to add some (relatively crude) text diffing functions and get the project to compile successfully on a modern C++ toolchain, along with patching a few memory leaks and making general bug fixes. At the time I only superficially dove into the underlying algorithm, though I do have a fair handle on it. There are two issues here. The first is the code that I introduced that you mentioned in #3. But that is not the only reason it does not detect the id attribute as unique and use that to its advantage. The source code claims to use the DTD to extract unique ids, but this functionality was never actually implemented: the original code from Inria had this functionality commented out. It's partially my fault that I did not take the time to see why it was commented out and whether it could be fixed, but in any case it's not relevant to your use case. There is no DTD for HTML5 and there never will be. DTDs are insufficiently expressive to be able to validate even the strictest XHTML5 polyglot markup. Some people have gone through the trouble of writing an XML Schema for XHTML5 though I can't comment on how good it is. Briefly skimming the Xerces-C++ documentation and code, it looks like it is possible to extract uniqueness constraint information from an XML Schema, and that could theoretically be used to produce better diffs. For that matter, the structure, even neglecting the uniqueness constraints, would vastly improve the diffing process: In a valid XHTML5 document there can be only one All that said, I don't know that I will ever have the time to implement such extensive changes to improve the algorithm. Are you able to compile the source? If not, I could make a build for you that naively assumes that all attributes named |
Thank you very much for the quick reply! I will just have to assume certain cases during my port. If at any time you feel inclined to improve on this library let me know, I still believe this library is a treasure which should be maintained and updated. You may close this issue, the function itself logical-wise works, my problem seems to be with the underlying XML parser or the registerSubTree function as mentioned in #3. |
Is this function https://github.com/fdintino/xydiff/blob/master/src/Diff_NodesManager.cpp#L626
If a matched has been found https://github.com/fdintino/xydiff/blob/master/src/Diff_NodesManager.cpp#L652 due to unique id attribute, the function will attempt to match to all parent nodes of that element until the root of the documents. This leads to uneven matches, for example:
tested agaisnt
The algorithm will correctly match the node entrar but because the trees are of different sizes, doing the parent matching will lead to a match between the body element of the first document to the html element of the second document.
Because of #3, the current implementation does not detect the id attribute as unique, hence no match will be done here. Does this not lead to an incorrect diff output?
I know this project is old and hard to understand, im sorry for this inconvenience and thank you very much for taking the time to read this.
The text was updated successfully, but these errors were encountered: