We encountered Issues very similar in nature to https://bugs.openjdk.java.net/browse/JDK-8207760, with the processing of UTF-16 astral characters/surrogates such that when they happened to be split across buffer boundaries, they were not being resolved to the correct character, causing exceptions and preventing the indexing of some documents, with errors messages like:
ERROR: 'org.xml.sax.SAXException: java.io.IOException: Invalid UTF-16 surrogate detected: d83c ?
java.io.IOException: Invalid UTF-16 surrogate detected: d83c ?'
We moved to a fork of a fork of Xalan, and crafted some additional patches:
SHA-512 sums:
470879d3e5397fe716e344d046eafa8022187b3c3d4abe414f2f119bbbf600e78f310a052f3e4813893df7a372a02b6bbb96ad64c836a7014eff4cd951a08436 *fedoragsearch-2.9.1-src.zip
bb8b17fef72b06c0f6f391d95f4e03add6891b8563a1be07f43bae679d2eeb70d0ef750ba6626e43545963c2af44825c23f50a32ac56e33a185bf486bd2cc672 *fedoragsearch-2.9.1.zip
b55af6e9804a335816ede04e9f10e4020d29510a8b5674a092a3444870222c33209c6310d8eb17e325c5935d7d8531beccd2897619dcf48a5d3538f5236eb9e5 *fedoragsearch.war