Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace Wikiextractor? #13

Open
TheOriginalSoni opened this issue Feb 17, 2024 · 2 comments
Open

Replace Wikiextractor? #13

TheOriginalSoni opened this issue Feb 17, 2024 · 2 comments

Comments

@TheOriginalSoni
Copy link
Contributor

Wikiextractor seems to have bugs and will limit us to python-3.10 or less when building the index.

Can we replace wikiextractor by either fixing the bug and using that version? Alternatively, we can look at more maintained codebases and see if they have better support

@TheOriginalSoni
Copy link
Contributor Author

import mwparserfromhell
mediawiki_text = """
== Section 1 ==
This is some [[content]] in [[link|section 1]].

== {{Section 2}} ==
This is some content in section 2.
"""
ans = mwparserfromhell.parse(mediawiki_text).strip_code().strip()
#'Section 1 \nThis is some content in section 1.\n\n  \nThis is some content in section 2.'

Code from mwparserfromhell, a library that's still under development.

@TheOriginalSoni
Copy link
Contributor Author

Pywikibot from Wikimedia Foundation might also be a solid alternate choice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant