KMNIST Dataset

Adapted from Kuzushiji Dataset, KMNIST dataset is a drop-in replacement for MNIST dataset. If your software can read the MNIST dataset, it is easy to test the KMNIST dataset by changing the setting. We provide three types of datasets, namely Kuzushiji-MNIST、Kuzushiji-49、Kuzushiji-Kanji, for different purposes.

GitHub: Repository for Kuzushiji-MNIST, Kuzushiji-49, and Kuzushiji-Kanji

The 10 classes of Kuzushiji-MNIST, with the first column showing each character's modern hiragana counterpart.

Information about Kuzushiji research is available in 2nd CODH Seminar: Kuzushiji Challenge - Future of Machine Recognition and Human Transcription and Kuzushiji Challenge!.

Datasets

Kuzushiji-MNIST Kuzushiji-MNIST is a drop-in replacement for the MNIST dataset (28x28 grayscale, 70,000 images), provided in the original MNIST format as well as a NumPy format. Since MNIST restricts us to 10 classes, we chose one character to represent each of the 10 rows of Hiragana when creating Kuzushiji-MNIST.
Kuzushiji-49 As the name suggests, Kuzushiji-49 has 49 classes (28x28 grayscale, 270,912 images), is a much larger, but imbalanced dataset containing 48 Hiragana characters and one Hiragana iteration mark.
Kuzushiji-Kanji Kuzushiji-Kanji is an imbalanced dataset of total 3832 Kanji characters (64x64 grayscale, 140,426 images), ranging from 1,766 examples to only a single example per class.

Links for downloading the datasets are summarized in the following GitHub repository.

GitHub: Repository for Kuzushiji_MNIST, Kuzushiji49, and Kuzushiji_Kanji

License

Creative Commons License
"KMNIST Dataset" (created by ROIS-DS Center for Open Data in the Humanities (CODH)), adapted from "Kuzushiji Dataset" (created by National Institute of Japanese Literature and others) is licensed under a Creative Commons Attribution Share-Alike 4.0 International License.

We suggest to use the following attribution when you use the data.

"KMNIST Dataset" (created by CODH), adapted from "Kuzushiji Dataset" (created by NIJL and others), doi:10.20676/00000341

References

News

2018-12-08

KMNIST Dataset was released and presented in Second Workshop on Machine Learning for Creativity and Design at Neural Information Processing Systems (NeurIPS 2018).

2018-12-03

The paper was submitted: Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, David Ha, "Deep Learning for Classical Japanese Literature", arXiv:1812.01718.