Datasets

List of Datasets

Dataset of Pre-Modern Japanese Text

Pre-Modern Japanese Text, owned by National Institute of Japanese Literature, consists of image and text data and was released as open data. Some books also have summary, transcription, and tagging data.

Dataset of Edo Cooking Recipes

Cooking books in the Edo period, provided from Dataset of Pre-Modern Japanese Text, were curated for creating recipe datasets through the process of transcription, translation to modern Japanese, and structuring into the recipe format.

Kuzushiji Dataset

As a by-product of transcription for the Dataset of Pre-Modern Japanese Text (PMJT), shapes and coordinates of old Japanese characters (Kuzushiji) were compiled to create another dataset for training to make machines and humans smarter.

KMNIST Dataset

Adapted from Kuzushiji Dataset, KMNIST dataset is a drop-in replacement for MNIST dataset. We provide three types of datasets, namely Kuzushiji-MNIST、Kuzushiji-49、Kuzushiji-Kanji, for different purposes.

Seal Script Dataset

Seal Script Dataset is a machine learning-friendly dataset of "Tensho" character images cropped from old dictionaries of characters from Japan and China to be used for the interpretation of seals.

Kokatsuji Dataset

Kokatsuji Dataset is the dataset of Kokatsuji (old movable types) created by data-driven approaches on Kokatsuji printed books such as "Sagabon".

Collection of Facial Expressions (KaoKore)

The project aims at making research infrastructure for art history research by collecting facial expressions for style compartive study from Japanese Emaki (illustrated scroll), or potentially from work of art across the globe.

Ukiyo-e Face Datasets

Introduce methodologies of machine learning and data science to Ukiyo-e research, and construct a new digital research infrastructure on Japanese culture.

Edo Shopping Guide

Edo shopping guide is derived from "Edo Kaimono Hitori Annai" published in the Edo Period by cropping advertisement from books and adding the name of merchants with their business, address and the logo to create a visual database of merchants in and around the city of Edo.

Edo Sightseeing Guide

Edo Sightseeing Guide is derived from tourism guides published in the Edo Period by cropping illustration from books and adding names, keywords and geographic information to create a visual tourism guide of Edo.

Edo Maps

Edo Maps is a project to construct geographic information infrastructure for the urban space of the city of Edo by extracting place names from the "Edo Kiriezu" old maps and constructing information from old documents in the Edo Period.

Han ID Dataset

The Han (Domain) ID Dataset assigns unique identifiers (IDs) to domains of the Edo and Meiji periods while also defining their names and variant names. It further incorporates a novel "attention score" based on the number of research papers published about each domain. Additionally, a new Province/Region ID Dataset was created, linking domains and provinces by identifiers.

Historical Administrative Boundary Dataset

This dataset assigns identifiers to historically existing municipalities and links them to the historical evolution of municipal boundaries. It covers municipalities established after the 1889 City and Village Systems Act and provides an “animated historical map” that visualizes boundary changes over time on web maps.

Nihon Rekishi Chimei Taikei Placename Dataset

This dataset compiles place name entries and related information from Heibonsha's “Encyclopedia of Japanese Historical Place Names” (50 volumes total), accessible at Japan Knowledge.

Premodern Village Boundary Dataset

This dataset is derived from the territorial boundaries and point data of premodern villages during the late Edo period (around the Keio era), created by Kenichi Honda. For kokudaka (rice yield), it references the Old Tax and Domain Survey Register Database (National Museum of Japanese History).

Dataset of Modern Magazines

Based on the results of digitization of magazines published in the early to mid-Meiji period (modern magazines), we release machine learning datasets for OCR and develop OCR software (Kindai-OCR).