You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
total count of CONTENT_A: 1,261,642
total count of CONTENT_B: 10,707,587
I use the READER to go thru those XML files and analyse the data of CONTENT_A and CONTENT_B for which I would need to consider also values which are on node attributes within the nodes of CONTENT_A and CONTENT_B.
The following reader code will explode when it reaches the first <BATCH_CHANGE node:
It appears that node.attributes is looking at sub-sequential nodes too which causes the memory footprint to grow above the limit ruby can handle while node.attribute_count and node.attribute_at() read just the local node attribute data and therefore behave as expected.
As there is no node.attribute_key_at() available I currently exclude the node which causes trouble from the node.attributes lookup which makes the reader go thru the XML file as follows:
As the last code example works there seems to be a problem with node.attributes when there are millions of child-nodes present which also contain attributes. Strangely only node.attributes is affected while node.attribute_count and node.attribute_at() work fine.
The text was updated successfully, but these errors were encountered:
robert-v-simon
changed the title
potential memory leak in READER node.attributes
too large memory footprint in READER node.attributes when millions of child nodes are present
May 8, 2015
this is because Reader#attr_nodes (and Reader#namespaces) use xmlTextReaderExpand (which reads all children nodes), and then just pull the properties (attributes) off the root node to return the array. it really should be using xmlTextReaderMoveToNextAttribute and constructing its own XML::Attr object
The action item here is to explore re-implementing Reader#attribute_hash to use xmlTextReaderMoveToNextAttribute to assemble the attribute hash (and also do this for Reader#namespaces).
Environment:
I've a 5GB XML file with the following structure:
total count of CONTENT_A: 1,261,642
total count of CONTENT_B: 10,707,587
I use the READER to go thru those XML files and analyse the data of CONTENT_A and CONTENT_B for which I would need to consider also values which are on node attributes within the nodes of CONTENT_A and CONTENT_B.
The following reader code will explode when it reaches the first <BATCH_CHANGE node:
The following reader code works fine for the entire document but doesn't keep the key of the attribute which would be essential for my analysis:
It appears that node.attributes is looking at sub-sequential nodes too which causes the memory footprint to grow above the limit ruby can handle while node.attribute_count and node.attribute_at() read just the local node attribute data and therefore behave as expected.
As there is no node.attribute_key_at() available I currently exclude the node which causes trouble from the node.attributes lookup which makes the reader go thru the XML file as follows:
As the last code example works there seems to be a problem with node.attributes when there are millions of child-nodes present which also contain attributes. Strangely only node.attributes is affected while node.attribute_count and node.attribute_at() work fine.
The text was updated successfully, but these errors were encountered: