Kanji usage frequency

Datasets built from various Japanese language corpora

https://scriptin.github.io/kanji-frequency/ - see this website for the dataset description. This readme describes only technical aspects.

Building the datasets

You'll need Node.js 18 or later.

See scripts section in package.json.

Aozora:

aozora:download - use crawler/scraper to collect the data
aozora:gaiji:extract - extract gaiji notations data from scraped pages. Gaiji refers to kanji charasters which are replaced with images in the documents, because Shift-JIS encoding cannot represent them
aozora:gaiji:replacements - build gaiji replacements file - produces only partial results, which may need to be manually completed
aozora:clean - clean the scraped pages (apply gaiji replacements)
aozora:count - create the dataset

Wikipedia:

News:

See Astro docs and the scripts section in package.json.

Name		Name	Last commit message	Last commit date
Latest commit History 179 Commits
.github/workflows		.github/workflows
.vscode		.vscode
data		data
data2015		data2015
public		public
scripts		scripts
src		src
.editorconfig		.editorconfig
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc.cjs		.prettierrc.cjs
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
astro.config.ts		astro.config.ts
package-lock.json		package-lock.json
package.json		package.json
tailwind.config.cjs		tailwind.config.cjs
tsconfig.json		tsconfig.json