-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extremely large documents #85
Comments
Heh, yeah. A document-based interface will almost never work for such an XML file. 😇
That's correct. There's a pseudo-streaming interface buried in there, but it's not public and it's not really a great choice to make public. The good news is that I've been slowly working on a ground-up rewrite that I believe will be measurably faster (early benchmarks show that it has the possibility of being on par with libxml2). The bad news is that it's so early that I haven't even done any kind of release for it. If you have a bit of free time, I'd really appreciate it if you could clone it. There's a demo utility that will simply run the parser as fast as it can, counting the number of tokens it sees. I'd love to know what it says for your Very Big File:
FutureSupporting some kind of Serde-like annotations would be very neat. Combined with the concept of streaming, I'd love to have some pseudo-Rust like this: #[derive(sxd::Deserialize)]
struct Page {
#[sxd::attribute]
id: String,
content: Vec<sxd::Value>,
}
let mut p = Parser::new();
p.enter_element();
let pg = p.deserialize::<Page>(); For some pseudo-XML: <wrapper>
<page id="abc"></page> <!-- repeated -->
</wrapper> Potentially some other things to allow access to the string interning that would exist to reduce the number of copies further. |
Cool, yes I'll try to check it out this week and report back my results. I think your future design would work fine in my case:
I think it would be fun (for some definition of "fun" 😇) if we could generate a rust types straight from the xsd <!-- Our root element -->
<element name="mediawiki" type="mw:MediaWikiType">
...
</element>
<complexType name="MediaWikiType">
<sequence>
<element name="siteinfo" type="mw:SiteInfoType"
minOccurs="0" maxOccurs="1" />
<element name="page" type="mw:PageType"
minOccurs="0" maxOccurs="unbounded" />
<element name="logitem" type="mw:LogItemType"
minOccurs="0" maxOccurs="unbounded" />
</sequence>
<attribute name="version" type="string" use="required" />
<attribute ref="xml:lang" use="required" />
</complexType> I assume this means that the root is a
If it's intended to be streamed, perhaps iterable elements can be borrowed and only valid during the inner iteration which could allow the whole stream to be zero-copy by filling the data structure with pointers directly into an underlying buffer. |
Yep, completely agree. Amusingly, it shouldn't be terrible to implement the first few passes of that. Basically you'd parse the XML for the XSD and then generate some Rust code based on that. I'm sure there are gotchas (circular datastructures is the first that comes to mind).
This is how the internals of the new parser are implemented, but it's not really possible for Rust to express this in a generic manner right now (that requires generic associated types). One interesting thing is that the command I suggested you run uses two fixed buffers of 16 MiB each (input, output). There are some environment variables you can set to adjust those. I'd expect that you could set it down to 1 KiB without much performance impact, and even down to 16 bytes and still be functional. Another problem comes about exactly in your case. If your buffer is N bytes and you want to look at something that is N+1 bytes, there's no way to do it. The revised parser approaches this by yielding values like We use the string interning to handle things like ensuring close tags match open tags and that attributes aren't repeated. |
Running under WSL with
It read from the SSD drive at about 100MB/s. |
If you're generating the specific rust code for an xsd do you really need GAT? Or string interning for tags and attributes? |
I'm a bit surprised that I'm of mixed emotions here. 100 MB/s is pretty reasonable to me — how do you feel about that rough speed? For the use case of parsing XML from over the network, that should be fast enough here in 2021. It's not fully saturating the IO of a local disk, however, so there might be some tweaks to improve that speed for cases like yours.
Perhaps. It's all about at what level the generated code operates at and what it has to reimplement. For example, the validation layer (e.g. opening and closing element names match, no duplicate attribute names, etc.) uses string interning. The generated code could reimplement that at the cost of... reimplementing it. I haven't done deep thinking on this :-) |
Good observation. Perhaps due to blocking on IO? I'll try with some different buffer sizes; 16 MB is pretty big but maybe the issue is waiting for the IO to go through, perhaps there's an opportunity to queue up the next buffer read while we're consuming the current one. Another complication is reading through the WSL syscall emulation, I'll also try running directly on windows. Unfortunately I don't have a linux set up on this system. |
A friend mentioned:
I'm a macOS user, and my WSL knowledge is light at best. Sounds like straddling the boundary might cause some degradation though. I'm not sure of the |
That did seem to help. Seemed to read at about 150MB/s this time.
Either way is fast enough for me, but what I'm looking for is a way to run xpath queries on it. |
Just in case you are unaware, there is another rust streaming xml parser named xml-rs. It reads at 3MB/s though xD |
Hi! I recently got into my head an idea to play around with offline copies of Wikipedia. The Wikimedia foundation very helpfully provides downloadable dumps of the full content of Wikipedia. The dump itself is a 20GB file that unzips to one .xml file that measures 78 GB in size. You saw that right: 1 xml file, 84,602,863,258 bytes. You can probably see where this is going...
Alas, I am many gigabytes short of fitting this entire mountain of a document in memory at once, let alone twice plus overhead (as a string and parsed). If I have any hope of consuming this thing with precision (as opposed to regex shudder) I believe a streaming parser and query engine will be necessary, however I did not see a streaming interface in the
xsd_document::parser
docs or insxd_xpath
; is that a correct assessment? Have you considered building a streaming interface to handle such cases? (In my experience, opting for a streaming solution can lead to the fastest implementation even when memory pressure is not a concern; such a use-case may be a valuable just to reach for speed.)Thoughts?
The text was updated successfully, but these errors were encountered: