KMNIST is a dataset, adapted from Kuzushiji Dataset, as a drop-in replacement for MNIST dataset, which is the most famous dataset in the machine learning community. Just change the setting of your software from MNIST to KMNIST. We provide three types of datasets, namely Kuzushiji-MNIST、Kuzushiji-49、Kuzushiji-Kanji, for different purposes.
KMNIST Dataset is created by ROIS-DS Center for Open Data in the Humanities (CODH), based on Kuzushiji Dataset created by National Institute of Japanese Literature. Please refer to the license.
GitHub: Repository for Kuzushiji-MNIST, Kuzushiji-49, and Kuzushiji-Kanji
Information about Kuzushiji research is available in 2nd CODH Seminar: Kuzushiji Challenge - Future of Machine Recognition and Human Transcription and Kuzushiji Challenge!. Moreover, Kaggle has many examples about how dataset can be used.
Kaggle: Kuzushiji-MNIST | Kaggle
Datasets
Kuzushiji-MNIST | Kuzushiji-MNIST is a drop-in replacement for the MNIST dataset (28x28 grayscale, 70,000 images), provided in the original MNIST format as well as a NumPy format. Since MNIST restricts us to 10 classes, we chose one character to represent each of the 10 rows of Hiragana when creating Kuzushiji-MNIST. |
---|---|
Kuzushiji-49 | As the name suggests, Kuzushiji-49 has 49 classes (28x28 grayscale, 270,912 images), is a much larger, but imbalanced dataset containing 48 Hiragana characters and one Hiragana iteration mark. |
Kuzushiji-Kanji | Kuzushiji-Kanji is an imbalanced dataset with a total of 3,832 Kanji characters (64x64 grayscale, 140,424 images), ranging from 1,768 examples to only a single example per class. |
Links for downloading the datasets are summarized in the following GitHub repository.
GitHub: Repository for Kuzushiji_MNIST, Kuzushiji49, and Kuzushiji_Kanji
Citation
Please consider citing the following paper when you publish research results using KMNIST Dataset.
Papers Citing this Dataset
License
"KMNIST Dataset" (created by ROIS-DS Center for Open Data in the Humanities (CODH)), adapted from "Kuzushiji Dataset" (created by National Institute of Japanese Literature and others) is licensed under a Creative Commons Attribution Share-Alike 4.0 International License.
We suggest to use the following attribution when you use the data.
"KMNIST Dataset" (created by CODH), adapted from "Kuzushiji Dataset" (created by NIJL and others), doi:10.20676/00000341
News
2024-05-31
We corrected the description of Kuzushiji-Kanji dataset. The number of images is 140,424 (previously written as 140,426), and the number of examples of the largest class is 1,768 (previously written as 1,766). The dataset itself remains the same.
2019-02-04
Among KMNIST datasets, Kuzushiji-MNIST and Kuzushiji-49 was updated to fix problems in the dataset. The previous version had problematic images such as black images due to problems in image processing, but we removed those images and added new images to create new datasets.Old datasets were overwritten by new datasets.
2018-12-08
KMNIST Dataset was released and presented in Second Workshop on Machine Learning for Creativity and Design at Neural Information Processing Systems (NeurIPS 2018).
2018-12-03
The paper was submitted: Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, David Ha, "Deep Learning for Classical Japanese Literature", arXiv:1812.01718.