too large memory footprint in READER node.attributes when millions of child nodes are present #1283

robert-v-simon · 2015-05-08T17:25:02Z

Environment:

Windows 7 64-bit
Ruby 2.0.0 p353 32-bit
Nokogiri 1.6.1 x86-mingw32

I've a 5GB XML file with the following structure:

<BATCH>
    <BATCH_TYPE>ALL</BATCH_TYPE>
    <BATCH_UPDATE>RELOAD</BATCH_UPDATE>
    <BATCH_ID>0815</BATCH_ID>
    <BATCH_CHANGE TYPE="UPDATE_CONTENT_A">
        <CONTENT_A>
        ...
        </CONTENT_A>
    </BATCH_CHANGE>
    <BATCH_CHANGE TYPE="UPDATE_CONTENT_B">
        <CONTENT_B OBJECT_A="abcdefg" OBJECT_B="0123456" BEGIN="000000000" END="000000500">
        ...
        </CONTENT_B>
    </BATCH_CHANGE>
<BATCH>

total count of CONTENT_A: 1,261,642
total count of CONTENT_B: 10,707,587

I use the READER to go thru those XML files and analyse the data of CONTENT_A and CONTENT_B for which I would need to consider also values which are on node attributes within the nodes of CONTENT_A and CONTENT_B.

The following reader code will explode when it reaches the first <BATCH_CHANGE node:

xmlReader = Nokogiri::XML::Reader(fileXML)
xmlReader.each do |node|
    case node.node_type
        when 1
            @xmlTree.push(node.name)
            if node.attributes? 
                @nodeAttrib = node.attributes 
            else 
                @nodeAttrib = {} 
            end
...

The following reader code works fine for the entire document but doesn't keep the key of the attribute which would be essential for my analysis:

xmlReader = Nokogiri::XML::Reader(fileXML)
xmlReader.each do |node|
    case node.node_type
        when 1
            @xmlTree.push(node.name)
            if node.attributes? 
                g = 0
                while g < node.attribute_count do
                    @nodeAttrib[g] = node.attribute_at(g)
                end
            else 
                @nodeAttrib = {} 
            end
            ...

It appears that node.attributes is looking at sub-sequential nodes too which causes the memory footprint to grow above the limit ruby can handle while node.attribute_count and node.attribute_at() read just the local node attribute data and therefore behave as expected.

As there is no node.attribute_key_at() available I currently exclude the node which causes trouble from the node.attributes lookup which makes the reader go thru the XML file as follows:

xmlReader = Nokogiri::XML::Reader(fileXML)
xmlReader.each do |node|
    case node.node_type
        when 1
            @xmlTree.push(node.name)
            if node.attributes? && (node.name != "BATCH_CHANGE")
                @nodeAttrib = node.attributes 
            else 
                @nodeAttrib = {} 
            end
            ...

As the last code example works there seems to be a problem with node.attributes when there are millions of child-nodes present which also contain attributes. Strangely only node.attributes is affected while node.attribute_count and node.attribute_at() work fine.

ccutrer · 2015-08-03T20:02:30Z

this is because Reader#attr_nodes (and Reader#namespaces) use xmlTextReaderExpand (which reads all children nodes), and then just pull the properties (attributes) off the root node to return the array. it really should be using xmlTextReaderMoveToNextAttribute and constructing its own XML::Attr object

flavorjones · 2024-07-02T19:23:04Z

The action item here is to explore re-implementing Reader#attribute_hash to use xmlTextReaderMoveToNextAttribute to assemble the attribute hash (and also do this for Reader#namespaces).

This would be a good time to also deal with #3102

robert-v-simon changed the title ~~potential memory leak in READER node.attributes~~ too large memory footprint in READER node.attributes when millions of child nodes are present May 8, 2015

flavorjones added the topic/performance label Oct 3, 2016

flavorjones added the topic/memory Segfaults, memory leaks, valgrind testing, etc. label Feb 2, 2020

flavorjones added this to the v1.18.0 milestone Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

too large memory footprint in READER node.attributes when millions of child nodes are present #1283

too large memory footprint in READER node.attributes when millions of child nodes are present #1283

robert-v-simon commented May 8, 2015

ccutrer commented Aug 3, 2015

flavorjones commented Jul 2, 2024

too large memory footprint in READER node.attributes when millions of child nodes are present #1283

too large memory footprint in READER node.attributes when millions of child nodes are present #1283

Comments

robert-v-simon commented May 8, 2015

ccutrer commented Aug 3, 2015

flavorjones commented Jul 2, 2024