"Ehon Musierami" Pre-Modern Japanese Text Dataset (NIJL)

"Ehon Musierami" Pre-Modern Japanese Text Dataset (NIJL)

ROIS-DS Center for Open Data in the Humanities (CODH) is developing humanities research in the era of open science. Our work includes 'data-driven humanities' to analyze humanities resources, using state-of-the-art technology from computer science and statistics, and 'big data in the humanities' utilizing datasets created from humanities research for trans-disciplinary research. [Read more..]

Important News

>> List of News (in Japanese)

X (Twitter) - Timeline / Facebook / YouTube / Github

List of Datasets

Dataset of Pre-Modern Japanese Text

Pre-Modern Japanese Text, owned by National Institute of Japanese Literature, consists of image and text data and was released as open data. Some books also have summary, transcription, and tagging data.

Dataset of Edo Cooking Recipes

Cooking books in the Edo period, provided from Dataset of Pre-Modern Japanese Text, were curated for creating recipe datasets through the process of transcription, translation to modern Japanese, and structuring into the recipe format.

Kuzushiji Dataset

As a by-product of transcription for the Dataset of Pre-Modern Japanese Text (PMJT), shapes and coordinates of old Japanese characters (Kuzushiji) were compiled to create another dataset for training to make machines and humans smarter.

KMNIST Dataset

Adapted from Kuzushiji Dataset, KMNIST dataset is a drop-in replacement for MNIST dataset. We provide three types of datasets, namely Kuzushiji-MNIST、Kuzushiji-49、Kuzushiji-Kanji, for different purposes.

Collection of Facial Expressions (KaoKore)

The project aims at making research infrastructure for art history research by collecting facial expressions for style compartive study from Japanese Emaki (illustrated scroll), or potentially from work of art across the globe.

Ukiyo-e Face Datasets

Introduce methodologies of machine learning and data science to Ukiyo-e research, and construct a new digital research infrastructure on Japanese culture.

Edo Shopping Guide

Edo shopping guide is derived from "Edo Kaimono Hitori Annai" published in the Edo Period by cropping advertisement from books and adding the name of merchants with their business, address and the logo to create a visual database of merchants in and around the city of Edo.

Edo Sightseeing Guide

Edo Sightseeing Guide is derived from tourism guides published in the Edo Period by cropping illustration from books and adding names, keywords and geographic information to create a visual tourism guide of Edo.

Edo Maps Beta

Edo Maps Beta is a project to construct geographic information infrastructure for the urban space of Edo City by extracting place names from Edo Kiriezu and econstructing information from old documents from the Edo Period.

Geoshape Repository

Geoshape repository is a data repository for sharing the geographic shape of a geographic entity. It includes "Historical Municipal Boundaries Dataset Beta Version" about the historical change of municipal boundaries since 1920 and "Village Boundaries Dataset" of 2015.

Seal Script Dataset

Seal Script Dataset is a machine learning-friendly dataset of "Tensho" character images cropped from old dictionaries of characters from Japan and China to be used for the interpretation of seals.

Dataset of Modern Magazines

Based on the results of digitization of magazines published in the early to mid-Meiji period (modern magazines), we release machine learning datasets for OCR and develop OCR software (Kindai-OCR).

List of Projects

miwo: App for AI Kuzushiji Recognition

I want to read the kuzushiji material! But I can't read them! "miwo" is an app that helps such people. Just take a picture of the material with the camera, press a button, and the AI will convert the kuzushiji into modern characters. Welcome to the world of kuzushiji documents.

KuroNet Kuzushiji Recognition

Multi-character (one-page) Kuzushiji recognition service is developed using AI (machine learning). A service for IIIF (International Image Interoperability Framework) images is also released to transcribe kuzushiji images across the world.


Soan is a service that allows users to input modern Japanese text to generate and share kuzushiji images. The software and service enables anyone to digitally typeset old movable type (kokatsuji) from "Sagabon," which are one of the most beautiful books in the history of Japanese publishing.

Historical Big Data

Historical big data is a project for the seamless analysis of environment and society from the past to the present by structuring various records written by humans.

Edo+150 Projects

On November 9, 1867, the restoration of imperial rule symbolized the end of Edo Period. 150 years have passed since then, and now is the time to revive the information space of Edo, using open data about the 260 years of Edo period, and taking advantage of the state-of-the-art technologysuch as artificial intelligence (AI).

edomi - Data Portal for the Historical Edo

To create a vantage point for the various data about the Edo as a city or the Edo as a period, we create a data portal for structuring and integrating data on the historical Edo in response to the needs of present people.

Bukan Complete Collection

The project aims at analyzing comprehensively the collection of "Bukan" books, which is the best seller through the 200 years of Edo period, and constructing core information platform about Edo period in terms of human and geospatial information about Daimyo (lords) and Shogunate government.

Image Collation Service for Differential Reading

Read two images, overlay them, and emphasize the difference. A service useful for "spot the difference" on images with partial difference, such as the collation of images between different versions of woodblock-printed books.


We develop a platform for assigning identifiers to place names and sharing gazetteers.


By integrating geographic information science (GIS) and natural language processing (NLP), we develop a geo-tagging system for automatically transforming text to maps.

North China Railway Archive

A research database on North China Railway Company by linking company's promotional stock photographs with its transportation network, and studying the activities of the company from the theme and location of photographs.


Memorygraph is a camera app that supports same-composition photography. We use now-and-then photography, before-and-after photography, fixed-point photography, and pilgrimage photography for cultural heritage, tourism, and recovery from disasters.

Digital Silk Road

Digital humanities research project about creating digital archives of cultural heritage based on collaboration between informatics and humanities.

List of Software

IIIF-based Image Delivery and Case Studies

The usage of IIIF (International Image Interoperability Framework) for image delivery in large-scale image databases ranging from humanities to natural sciences, with a long-term goal to contributing to international communities.

IIIF Curation Platform

Focusing on the concept of "curation," we build a next generation IIIF platform that is open and user-driven.

IIIF Curation Viewer

An open-source IIIF image viewer that takes advantage of IIIF Image API and IIIF Presentation API and implements our proposed specifications such as Curation API, Timeline API and Cursor API.

IIIF Curation Finder

A IIIF Search tool for searching curations created by IIIF Curation Viewer and publishing new curations by re-editing.

IIIF Curation Board

A whiteboard tool for organizing curations created by IIIF Curation Viewer.


A flask web application for storing JSON documents; with some special functions for JSON-LD.

Canvas Indexer

A flask web application that crawls Activity Streams for IIIF Canvases and offers a search API.

Curation Tracer

A flask web application for IIIF resource usage analytics with regard to IIIF Curations.

ICP Docker

Scripts for installing IIIF Curation Platform on a Docker environment.


JavaScript-based visual differencing tool.


JavaScript-based visual differencing tool for an image sequence.


OCR system for modern Japanese documents trained on the modern magazine dataset.