Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revision ID wrong #203

Closed
jimbowhales opened this issue Feb 14, 2020 · 2 comments
Closed

Revision ID wrong #203

jimbowhales opened this issue Feb 14, 2020 · 2 comments

Comments

@jimbowhales
Copy link

In cases like

<revision>
      <id>73841108</id>
      <parentid>73290874</parentid>
      <timestamp>2019-08-14T00:23:07Z</timestamp>
      <contributor>
        <username>Ttle-recll</username>
        <id>1525028</id>
      </contributor>

WikiExtractor takes the contributor ID as revision ID. A likely fix would be to change

        elif tag == 'id' and not id:
            id = m.group(3)
        elif tag == 'id' and id:
            revid = m.group(3)

to

        elif tag == 'id' and not id:
            id = m.group(3)
        elif tag == 'id' and not revid:
            revid = m.group(3)

but I haven't tested if this fails in some other case.

@HjalmarrSv
Copy link

Thanx!

Looks like an improvement. It should work nicely as long as the tag order is not changed, or new tags inserted before this. The downside of a partially implemented xml reader is that the order matters.

HjalmarrSv added a commit to HjalmarrSv/wikiextractor that referenced this issue Feb 28, 2020
@attardi
Copy link
Owner

attardi commented Mar 1, 2020

Accepted.

@attardi attardi closed this as completed Mar 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants