attardi · tuxiaohui001 · Oct 26, 2022 · Oct 26, 2022 · Oct 17, 2023
diff --git a/README.md b/README.md
@@ -1,196 +1,69 @@
-# WikiExtractor
-[WikiExtractor.py](http://medialab.di.unipi.it/wiki/Wikipedia_Extractor) is a Python script that extracts and cleans text from a [Wikipedia database backup dump](https://dumps.wikimedia.org/), e.g. https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 for English.
+## 2.1 wikiextractor
 
-The tool is written in Python and requires Python 3 but no additional library.
-**Warning**: problems have been reported on Windows due to poor support for `StringIO` in the Python implementation on Windows.
+### 2.1.1 简介
 
-For further information, see the [Wiki](https://github.com/attardi/wikiextractor/wiki).
+基于维基百科的语料生成训练数据。
 
-# Wikipedia Cirrus Extractor
+### 2.1.2 GitHub链接
 
-`cirrus-extractor.py` is a version of the script that performs extraction from a Wikipedia Cirrus dump.
-Cirrus dumps contain text with already expanded templates.
+[https://github.com/attardi/wikiextractor](https://github.com/attardi/wikiextractor)
 
-Cirrus dumps are available at:
-[cirrussearch](http://dumps.wikimedia.org/other/cirrussearch/).
+### 2.1.3 安装环境
 
-# Details
+#### 2.1.3.1 Python环境
 
-WikiExtractor performs template expansion by preprocessing the whole dump and extracting template definitions.
+安装python环境，建议采用anaconda方式安装，版本3.7。
 
-In order to speed up processing:
+    conda create -n wikiextractor python=3.7
 
-- multiprocessing is used for dealing with articles in parallel
-- a cache is kept of parsed templates (only useful for repeated extractions).
+激活环境，
 
-## Installation
+    conda activate wikiextractor
 
-The script may be invoked directly:
+#### 2.1.3.2 拉取代码
 
-    python -m wikiextractor.WikiExtractor <Wikipedia dump file>
+    git clone https://github.com/attardi/wikiextractor.git
 
-It can also be installed from `PyPi` by doing:
+***注意**：*需要将项目中*./wikiextractor/extract.py*文件中的两行pdb相关的代码注释掉。
 
-    pip install wikiextractor
-
-or locally with:
+#### 2.1.3.3 安装依赖包
 
-    (sudo) python setup.py install
+    pip install wikiextractor
 
-The installer also installs two scripts for direct invocation:
+### 2.1.4 下载原始语料文件
 
-    wikiextractor  	(equivalent to python -m wikiextractor.WikiExtractor)
-    extractPage		(to extract a single page from a dump)
+#### 2.1.4.1 链接
 
-## Usage
+链接：[https://dumps.wikimedia.org/zhwiki](https://dumps.wikimedia.org/zhwiki)
+若我们把 zhwiki 替换为 enwiki，就能找到英文语料，如果替换为 frwiki，就能找到法语语料，依次类推。
+具体语言列表可参考，[**ISO 639-1语言列表**](https://baike.baidu.com/item/ISO%20639-1/8292914?fr=aladdin)
 
-### Wikiextractor
-The script is invoked with a Wikipedia dump file as an argument:
+#### 2.1.4.2 下载原始语料
 
-    python -m wikiextractor.WikiExtractor <Wikipedia dump file> [--templates <extracted template file>]
+以英文为例，可下载如下文件，
+[https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2](https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2)
 
-The option `--templates` extracts the templates to a local file, which can be reloaded to reduce the time to perform extraction.
+### 2.1.5 执行命令
 
-The output is stored in several files of similar size in a given directory.
-Each file will contains several documents in this [document format](https://github.com/attardi/wikiextractor/wiki/File-Format).
+将下载的语料置于项目根目录下后执行下述命令，
 
 ```
-usage: wikiextractor [-h] [-o OUTPUT] [-b n[KMG]] [-c] [--json] [--html] [-l] [-ns ns1,ns2]
-			 [--templates TEMPLATES] [--no-templates] [--html-safe HTML_SAFE] [--processes PROCESSES]
-			 [-q] [--debug] [-a] [-v]
-			 input
-
-Wikipedia Extractor:
-Extracts and cleans text from a Wikipedia database dump and stores output in a
-number of files of similar size in a given directory.
-Each file will contain several documents in the format:
-
-	<doc id="" url="" title="">
-	    ...
-	    </doc>
-
-If the program is invoked with the --json flag, then each file will                                            
-contain several documents formatted as json ojects, one per line, with                                         
-the following structure
-
-	{"id": "", "revid": "", "url": "", "title": "", "text": "..."}
-
-The program performs template expansion by preprocesssng the whole dump and
-collecting template definitions.
-
-positional arguments:
-  input                 XML wiki dump file
-
-optional arguments:
-  -h, --help            show this help message and exit
-  --processes PROCESSES
-			    Number of processes to use (default 79)
-
-Output:
-  -o OUTPUT, --output OUTPUT
-			    directory for extracted files (or '-' for dumping to stdout)
-  -b n[KMG], --bytes n[KMG]
-			    maximum bytes per output file (default 1M)
-  -c, --compress        compress output files using bzip
-  --json                write output in json format instead of the default <doc> format
-
-Processing:
-  --html                produce HTML output, subsumes --links
-  -l, --links           preserve links
-  -ns ns1,ns2, --namespaces ns1,ns2
-			    accepted namespaces
-  --templates TEMPLATES
-			    use or create file containing templates
-  --no-templates        Do not expand templates
-  --html-safe HTML_SAFE
-			    use to produce HTML safe output within <doc>...</doc>
-
-Special:
-  -q, --quiet           suppress reporting progress info
-  --debug               print debug info
-  -a, --article         analyze a file containing a single article (debug option)
-  -v, --version         print program version
+python -m wikiextractor.WikiExtractor \
+       -b 100M \
+       --processes 4 \
+       --json \
+       -o data \
+       下载的语料包.bz2
 ```
 
-Saving templates to a file will speed up performing extraction the next time,
-assuming template definitions have not changed.
-
-Option `--no-templates` significantly speeds up the extractor, avoiding the cost
-of expanding [MediaWiki templates](https://www.mediawiki.org/wiki/Help:Templates).
-
-For further information, visit [the documentation](http://attardi.github.io/wikiextractor).
-
-### Cirrus Extractor
-
-~~~
-usage: cirrus-extract.py [-h] [-o OUTPUT] [-b n[KMG]] [-c] [-ns ns1,ns2] [-q]
-                         [-v]
-                         input
-
-Wikipedia Cirrus Extractor:
-Extracts and cleans text from a Wikipedia Cirrus dump and stores output in a
-number of files of similar size in a given directory.
-Each file will contain several documents in the format:
-
-	<doc id="" url="" title="" language="" revision="">
-        ...
-        </doc>
-
-positional arguments:
-  input                 Cirrus Json wiki dump file
-
-optional arguments:
-  -h, --help            show this help message and exit
-
-Output:
-  -o OUTPUT, --output OUTPUT
-                        directory for extracted files (or '-' for dumping to
-                        stdin)
-  -b n[KMG], --bytes n[KMG]
-                        maximum bytes per output file (default 1M)
-  -c, --compress        compress output files using bzip
-
-Processing:
-  -ns ns1,ns2, --namespaces ns1,ns2
-                        accepted namespaces
-
-Special:
-  -q, --quiet           suppress reporting progress info
-  -v, --version         print program version
-~~~
-
-### extractPage
-Extract a single page from a Wikipedia dump file.
-
-~~~
-usage: extractPage [-h] [--id ID] [--template] [-v] input
-
-Wikipedia Page Extractor:
-Extracts a single page from a Wikipedia dump file.
-
-positional arguments:
-  input          XML wiki dump file
-
-optional arguments:
-  -h, --help     show this help message and exit
-  --id ID        article number
-  --template     template number
-  -v, --version  print program version
-~~~
 
-## License
-The code is made available under the [GNU Affero General Public License v3.0](LICENSE). 
+-o用来指定输出目录，--process 用来指定使用的进程数目（默认为 1），-b 选项用来控制单个生成文件的大小（默认为 1M，文件越大，包含的词条也越多），最后的参数为要处理的原始压缩语料文件名称。程序运行完成以后，在输出目录下面会生成多个子目录，每个目录下面有一些生成的文件。
 
-## Reference
-If you find this code useful, please refer it in publications as:
+| 参数    | 含义                   |
+| ------- | ---------------------- |
+| o       | 输出目录               |
+| b       | 控制单个生成文件的大小 |
+| process | 进程数                 |
+| json    | 生成json格式           |
 
-~~~
-@misc{Wikiextractor2015,
-  author = {Giusepppe Attardi},
-  title = {WikiExtractor},
-  year = {2015},
-  publisher = {GitHub},
-  journal = {GitHub repository},
-  howpublished = {\url{https://github.com/attardi/wikiextractor}}
-}
-~~~
+
diff --git a/wikiextractor/extract.py b/wikiextractor/extract.py
@@ -26,7 +26,7 @@
 from html.entities import name2codepoint
 import logging
 import time
-import pdb                      # DEBUG
+#import pdb                      # DEBUG
 
 # ----------------------------------------------------------------------
 
@@ -82,7 +82,7 @@ def clean(extractor, text, expand_templates=False, html_safe=True):
     if expand_templates:
         # expand templates
         # See: http://www.mediawiki.org/wiki/Help:Templates
-        pdb.set_trace()         # DEBUG
+        #pdb.set_trace()         # DEBUG
         text = extractor.expandTemplates(text)
     else:
         # Drop transclusions (template, parser functions)