[Plone5.2-rc2/Python3.6/c.solr8.0.0a1] Parsing error xmlSAX2Characters: huge text node #239

NicolasGoeddel · 2019-09-17T15:19:25Z

There is problem with parsing huge XML outputs of the extraction handler of Solr.
When I want to index a PDF file with nearly 3000 pages of text, Solr extracts that text and returns with a XML response that is handled by collective.solr.indexer.BinaryAdder. The problem here is etree.parse(response) which does not work with big text nodes. It needs to be changed to etree.iterparse() I guess. But that is a bigger change.

It would be nicer if collective.solr would extract and indexing a binary object in one single step. I don't know if this is possible with Solr's API. At the moment collective.solr extracts all the text of a binary blob using Solr, saves that text into a Dictionary and sends it back to Solr to index it. That looks not very efficient in my opinion. Maybe you know of a simple change to do both things together without that step in between.

For your information this is the whole warning:

2019-09-17 17:04:24,067 WARNING [collective.solr.indexer:178][waitress] Parsing error xmlSAX2Characters: huge text node, line 160970, column 47 (<string>, line 160970) @ /bfd-db/content/mypdf.pdf.

The text was updated successfully, but these errors were encountered:

NicolasGoeddel · 2019-09-17T16:37:48Z

I was able to solve that problem using the etree.iterparse() method. Therefore I modified collective.solr.indexer.BinaryAdder.__call__() within the try-block directyl after the call to conn.doPost like so:

        try:
            response = conn.doPost(
                url, encodedPost.to_string(), headers
            )
            
            context = etree.iterparse(response, huge_tree = True)
            
            data["SearchableText"] = u""
            for event, elem in context :
                if elem.getparent() is not None and elem.getparent().tag == 'response' :
                    if elem.text is not None :
                        data["SearchableText"] += elem.text.strip()
            
        except SolrConnectionException as e:
        ....

tisto · 2019-09-20T12:59:12Z

@NicolasGoeddel thanks for reporing this and providing a fix. This is highly appreciated. I'd be more than happy to review and merge a PR if you would care to open one. :)

NicolasGoeddel · 2019-09-20T13:01:00Z

I will take a look into how PRs work. I never did one. Seems like I have to Fork first, make a branch and such things.

tisto · 2019-09-21T15:16:44Z

@NicolasGoeddel awesome! Yes, you can fork the repo and then do a pull request or checkout the repository from the collective. For the latter option, I would have to add you to the Plone collective. I'd be more than happy to do so if you are ok with it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Plone5.2-rc2/Python3.6/c.solr8.0.0a1] Parsing error xmlSAX2Characters: huge text node #239

[Plone5.2-rc2/Python3.6/c.solr8.0.0a1] Parsing error xmlSAX2Characters: huge text node #239

NicolasGoeddel commented Sep 17, 2019

NicolasGoeddel commented Sep 17, 2019

tisto commented Sep 20, 2019

NicolasGoeddel commented Sep 20, 2019

tisto commented Sep 21, 2019

[Plone5.2-rc2/Python3.6/c.solr8.0.0a1] Parsing error xmlSAX2Characters: huge text node #239

[Plone5.2-rc2/Python3.6/c.solr8.0.0a1] Parsing error xmlSAX2Characters: huge text node #239

Comments

NicolasGoeddel commented Sep 17, 2019

NicolasGoeddel commented Sep 17, 2019

tisto commented Sep 20, 2019

NicolasGoeddel commented Sep 20, 2019

tisto commented Sep 21, 2019