Center for Open Data in the Humanities / CODH, Joint Support-Center for Data Science Research, Research Organization of Information and Systems has the following missions toward the promotion of data-driven research and formation of the collaborative center in humanities research.
- 1. We establish a new discipline of data science-driven humanities, or digital humanities, and establish the center of excellence across organizations through the promotion of openness.
- 2. We develop "deep access" to the content of humanities data by state-of-the-art technologies in the area of informatics and statistics.
- 3. We aggregate, process and deliver humanities knowledge from Japan to the world through collaboration across organizations and countries.
- 4. We promote citizen science and open innovation based on open data and applications.
On April 1, 2016, Joint Support-Center for Data Science Research in Research Organization of Information and Systems started a preliminary office for Center for Open Data in the Humanities (CODH), and on April 1, 2017, Center for Open Data in the Humanities (CODH) has officially started. Asanobu KITAMOTO is the director of the center, and National Institute of Informatics and The Institute of Statistical Mathematics became core institutes for research and support activities.
In the humanities research community, data science approach based on large-scale open data is still premature. We cannot expect that the usage of open data shows a significant increase by just releasing them. CODH will therefore develop and release databases and tools by adopting the methodology of Digital Humanities (Computers and Humanities), which shows a rapid growth in the global research community, and also hold seminars and tutorials to promote the utilization of research resources.
CODH has the concept of “triadic co-creation” in which three types of stakeholders, namely humanities scholars, machines (computer scientists), and citizens, collaborate together to advance data creation, analysis and utilization. In particular, we focus on the redefinition of the division of roles between humans and machines by providing large-scale open training data for machine learning (artificial intelligence).
Thanks to the development of machine learning technology in recent years, some work can be transferred from humans to machines, but in order to ask machines to work, we need to provide training data in the first place. The lack of such open training data leads to the delay of introducing machines to humanities research, so CODH will work on constructing infrastructure for open training data on which researchers and citizens can learn about and co-create open training data.
We already released several open datasets, such as books, characters, and recipes that are collaboratively created under "Project to Build an International Collaborative Research Network for Pre-modern Japanese Texts” (NIJL-NW Project), promoted mainly by National Institute of Japanese Literature.
First, Dataset of Pre-Modern Japanese Text (PMJT) provides image data of 701 pre-modern books in a downloadable format. By assigning DOI (Digital Object Identifier) to each book, image data can be uniquely identified even when there are multiple books with the same title.
Second, Dataset of PMJT Character Shapes contains 3,999 character types and 403,242 characters of old Japanese characters (called Kuzushi-ji). The dataset can be used not only as learning material for humans but also as training data for machines to develop optical character recognition (OCR) software. These datasets are also used for our “Kuzushi-ji Challenge!” campaign to involve experts and citizens to develop best artificial intelligence software that recognizes old Japanese characters.
Third, Dataset of Edo Cooking Recipes is the collection of recipes from a cooking book “Tamago Hyakuchin,” which contains more than 100 recipes about eggs. We translated the original recipes to modern Japanese and structured for citizen cookers. Moreover, those recipe data was released in Japan's largest recipe service "Cookpad," which triggered unexpected excitement among citizens.
Sharing data as open data is one form of scholarly publication in the age of open science. To find the best practice of data publication in the humanities, we are now focusing on IIIF (International Image Interoperability Framework) as infrastructure for image-related projects.
IIIF has recently seen a rapid growth of adoption as interoperable, high-resolution image delivery from museums and libraries around the world, and CODH is one of them to use IIIF for Pre-modern Japanese Text, and Digital Silk Road Digital Archive of Toyo Bunko Rare Books. While being widely adopted, IIIF is still a premature specification without important functions for humanities research.
CODH focuses on a use case of collecting interesting images from IIIF contents all over the world. We proposed a new specification called Curation API and developed a reference implementation called IIIF Curation Viewer, applied to PMJT curation and IIIF global curation.
Cropping images and collecting them under a theme is the basic task in the humanities research, and sharing them is one form of data publication having value as the material of subsequent research. We are now studying the potential of this approach in the field of art history.
CODH deals with other types of open data such as photographs, maps (geographic information) and magazines. To organize and utilize those datasets, CODH also develops web and mobile applications for researchers and citizens. To publicize and disseminate open data in the humanities, various programs such as seminars, tutorials and training courses should be beneficial. Moreover, wider involvement of researchers and citizens into data creation and analysis activities, or citizen science, is a relevant challenge. CODH’s activities are naturally across disciplines, and beyond disciplines in the sense of trans-disciplinary science.
National Institute of Informatics
The Institute of Statistical Mathematics