KMNIST Dataset

KMNIST is a dataset, adapted from Kuzushiji Dataset, as a drop-in replacement for MNIST dataset, which is the most famous dataset in the machine learning community. Just change the setting of your software from MNIST to KMNIST. We provide three types of datasets, namely Kuzushiji-MNIST、Kuzushiji-49、Kuzushiji-Kanji, for different purposes.

KMNIST Dataset is created by ROIS-DS Center for Open Data in the Humanities (CODH), based on Kuzushiji Dataset created by National Institute of Japanese Literature. Please refer to the license.

GitHub: Repository for Kuzushiji-MNIST, Kuzushiji-49, and Kuzushiji-Kanji

The 10 classes of Kuzushiji-MNIST, with the first column showing each character's modern hiragana counterpart.

Information about Kuzushiji research is available in 2nd CODH Seminar: Kuzushiji Challenge - Future of Machine Recognition and Human Transcription and Kuzushiji Challenge!. Moreover, Kaggle has many examples about how dataset can be used.

Kaggle: Kuzushiji-MNIST | Kaggle

Datasets

Kuzushiji-MNIST Kuzushiji-MNIST is a drop-in replacement for the MNIST dataset (28x28 grayscale, 70,000 images), provided in the original MNIST format as well as a NumPy format. Since MNIST restricts us to 10 classes, we chose one character to represent each of the 10 rows of Hiragana when creating Kuzushiji-MNIST.
Kuzushiji-49 As the name suggests, Kuzushiji-49 has 49 classes (28x28 grayscale, 270,912 images), is a much larger, but imbalanced dataset containing 48 Hiragana characters and one Hiragana iteration mark.
Kuzushiji-Kanji Kuzushiji-Kanji is an imbalanced dataset of total 3832 Kanji characters (64x64 grayscale, 140,426 images), ranging from 1,766 examples to only a single example per class.

Links for downloading the datasets are summarized in the following GitHub repository.

GitHub: Repository for Kuzushiji_MNIST, Kuzushiji49, and Kuzushiji_Kanji

Citation

Please consider citing the following paper when you publish research results using KMNIST Dataset.

Papers Citing this Dataset

What is Mahalo Button?

License

Creative Commons License
"KMNIST Dataset" (created by ROIS-DS Center for Open Data in the Humanities (CODH)), adapted from "Kuzushiji Dataset" (created by National Institute of Japanese Literature and others) is licensed under a Creative Commons Attribution Share-Alike 4.0 International License.

We suggest to use the following attribution when you use the data.

"KMNIST Dataset" (created by CODH), adapted from "Kuzushiji Dataset" (created by NIJL and others), doi:10.20676/00000341

News

2019-02-04

Among KMNIST datasets, Kuzushiji-MNIST and Kuzushiji-49 was updated to fix problems in the dataset. The previous version had problematic images such as black images due to problems in image processing, but we removed those images and added new images to create new datasets.Old datasets were overwritten by new datasets.

2018-12-08

KMNIST Dataset was released and presented in Second Workshop on Machine Learning for Creativity and Design at Neural Information Processing Systems (NeurIPS 2018).

2018-12-03

The paper was submitted: Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, David Ha, "Deep Learning for Classical Japanese Literature", arXiv:1812.01718.