Adapted from Kuzushiji Dataset, KMNIST dataset is a drop-in replacement for MNIST dataset. If your software can read the MNIST dataset, it is easy to test the KMNIST dataset by changing the setting. We provide three types of datasets, namely Kuzushiji-MNIST、Kuzushiji-49、Kuzushiji-Kanji, for different purposes.
Information about Kuzushiji research is available in 2nd CODH Seminar: Kuzushiji Challenge - Future of Machine Recognition and Human Transcription and Kuzushiji Challenge!.
|Kuzushiji-MNIST||Kuzushiji-MNIST is a drop-in replacement for the MNIST dataset (28x28 grayscale, 70,000 images), provided in the original MNIST format as well as a NumPy format. Since MNIST restricts us to 10 classes, we chose one character to represent each of the 10 rows of Hiragana when creating Kuzushiji-MNIST.|
|Kuzushiji-49||As the name suggests, Kuzushiji-49 has 49 classes (28x28 grayscale, 270,912 images), is a much larger, but imbalanced dataset containing 48 Hiragana characters and one Hiragana iteration mark.|
|Kuzushiji-Kanji||Kuzushiji-Kanji is an imbalanced dataset of total 3832 Kanji characters (64x64 grayscale, 140,426 images), ranging from 1,766 examples to only a single example per class.|
Links for downloading the datasets are summarized in the following GitHub repository.
"KMNIST Dataset" (created by ROIS-DS Center for Open Data in the Humanities (CODH)), adapted from "Kuzushiji Dataset" (created by National Institute of Japanese Literature and others) is licensed under a Creative Commons Attribution Share-Alike 4.0 International License.
We suggest to use the following attribution when you use the data.
"KMNIST Dataset" (created by CODH), adapted from "Kuzushiji Dataset" (created by NIJL and others), doi:10.20676/00000341
KMNIST Dataset was released and presented in Second Workshop on Machine Learning for Creativity and Design at Neural Information Processing Systems (NeurIPS 2018).