-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automating the collection of examples #94
Comments
That sounds brilliant |
I don't think this is quite accurate. What you get from the PMC OA subset is what PMC normalized. Nothing in that subset is as it was delivered by the publisher...nothing. PMC converts every single XML document it receives to comply with PMC style...even those submitted to us in the JATS DTD. If the recommendations are for a part of the document that PMC doesn't need to standardize for archiving or display purposes, then sure, we'll pass it through unchanged and you can monitor the uptake. But so far, I haven't seen much PMC wouldn't make some effort to standardize in our output, so I think all you'll really be monitoring is the degree of PMC's uptake. |
Good points, Laura. What about making more of those normalization steps (and the accompanying tools) public? At least for XML supplied by CC BY publishers, this would seem possible. |
If you can get it to run (it's 9 years old), Stefano Mazzocchi's Gadget is a nice tool for analysing elements/attributes and their contents in large quantities of XML (e.g. everything in the PMC OA Subset). |
So far, most of the examples we have discussed have been identified manually. I am thinking about a systematic approach to collecting examples for sets of tags that we consider.
One way to go about that would be to mine PMC's OA Subset (which can be downloaded in bulk) for uses of specific tags and to condense that (perhaps along with any manually provided examples from outside PMC's OA Subset) into some basic usage patterns (think tag-level dialects) that we could use as a basis for discussing best practices and distilling recommendations.
I can think of a number of effects that this may have:
I have started to explore this but the tools I know are not best suited for this kind of analyses on such a corpus (I am running a
grep
over night!), so I would welcome your ideas in this regard.The text was updated successfully, but these errors were encountered: