"Ehon Musierami" Pre-Modern Japanese Text Dataset (NIJL)

"Ehon Musierami" Pre-Modern Japanese Text Dataset (NIJL)

Center for Open Data in the Humanities (CODH) is working on research and development for enhancing access to humanities data using the state-of-the-art technology in informatics and statistics, and at the same time constructing data platforms based on the idea of open science to promote trans-disciplinary participation of people with diverse backgrounds, thereby opening up new possibilities of research framework in digital humanities from data-driven perspectives. [Read more..]

Important News

We are recruiting project researchers! (deadline January 24, 2018)

December 26, 2017 Database of Pre-modern Japanese Text has increased from 701 to 1,767 books, with images from 158,533 to 329,702.

November 9, 2017 Edo+150 Project, Bukan Complete Collection, and Old Character Challenge! were released (all are in Japanese).

JADH2018 / TEI2018 will be held at Hitotsubashi Hall (National Institute of Informatics). The detail will be released in the future.

List of Datasets

Dataset of Pre-Modern Japanese Text

Pre-Modern Japanese Text, owned by National Institute of Japanese Literature, is released image and text data as open data. In addition, some text has description, transcription, and tagging data.

Dataset of PMJT Character Shapes

As a by-product of transcription on Dataset of Pre-Modern Japanese Text (PMJT), shapes and coordinates of old Japanese characters were compiled to create another dataset for training to make machines and humans smarter.

Dataset of Edo Cooking Recipes

Cooking books in the period of Edo, included in Dataset of Pre-Modern Japanese Text were curated to create recipe datasets through the process of transcription, translation to modern Japanese, and structuring into the recipe format.

Bukan Complete Collection

The project aims at analyzing comprehensively the collection of "Bukan" books, which is the best seller through the 200 years of Edo period, and constructing core information platform about Edo period in terms of human and geospatial information about Daimyo (lords) and Shogunate government.

Dataset of Modern Magazines

Modern magazines are digitized and released as image datasets. n2i project is working on constructing the dataset of modern documents to develop OCR for those documents.

List of Projects

Digital Silk Road

Digital humanities research project about creating digital archives of cultural heritage based on collaboration between informatics and humanities.

Edo+150 Projects

On November 9, 1867, the restoration of imperial rule symbolized the end of Edo Period. 150 years have passed since then, and now is the time to revive the information space of Edo, using open data about the 260 years of Edo period, and taking advantage of the state-of-the-art technologysuch as artificial intelligence (AI).

Kuzushiji Challenge!

Old books in Edo period was written by Kuzushiji (old Japanese characters), and modern Japanese people can rarely read the characters without effort. Then, instead of humans, can AI read Kuzushiji after training? We release Character Shape dataset to the world as a training data for machine learning to tackle the grand challenge of "Kuzushiji vs. AI" with the collective power of people.

Digital Archive of North China Railway

A prototype of research database to clarify the activity of North China Railway, through the mapping of its railway network and investigation into the theme and location of photographs along the railroads taken by the company as a part of its promotional activities.

Memory Platform / Memorygraph

Memorygraph is a new photographic technique to create the layer of memories, and the project aims to develop the Memorygraph app to use it for field work of cultural heritage, tourism, and recovery from disasters.


A project that aims at integrating geographic information science (GIS) and natural language processing (NLP) to develop a geo-tagging system that transforms text to maps automatically.


Geoshape repository is a data repository of releasing geometry of geographic features. It includes "Historical Administrative Boundaries Dataset Beta Version," that shows the historical change of administrative boundaries since 1920.

List of Software

IIIF-based Image Delivery and Case Studies

The usage of IIIF (International Image Interoperability Framework) for image delivery in large-scale image databases ranging from humanities to natural sciences, with a long-term goal to contributing to international communities.

IIIF Curation Viewer

An open-source (MIT-licensed) image viewer that takes advantage of IIIF Image API and IIIF Presentation API, and proposes and implements new standards such as Curation API, Timeline API and Cursor API.


A minimal flask web application made for API access to store and retrieve JSON documents.