Home // C2DH // News & E... // Text mining in historical newspapers to develop new research methods

Text mining in historical newspapers to develop new research methods

twitter linkedin facebook google+ email this page
Published on Tuesday, 04 July 2017

The aim of the project “Impresso: Media monitoring of the past. Mining 200 years of historical newspapers” is to link digitised corpora of newspapers from Switzerland, Luxembourg, France and Germany and to develop new methods to analyse them.

Over the next three years, the Luxembourg Centre for Contemporary and Digital History (C²DH) at the University of Luxembourg will work with the DHLAB at the École polytechnique fédérale de Lausanne (EPFL) and the Institute for Computational Linguistics at the University of Zurich on this project. The project will receive 1.7 million Swiss francs (1,55 million euros) in funding from the Swiss National Science Foundation (SNSF).

Improve digital technologies for research

Historical newspapers represent a wealth of archival material, and many have already been digitised. However, conducting research using these sources raises a number of problems, including insufficient text searchability as a result of poor text recognition and missing metadata, the relative isolation of digitised newspapers within their respective archives, search functions that are difficult to use, and poorly designed user interfaces. Recent progress in text analysis has also opened up new possibilities for conducting research on large collections of texts.

The project will develop “deep learning” method, a subfield of machine learning, in order to correct errors in text recognition, improving the identification of people, institutions and places, and enhancing this entity recognition using external data repositories. The C²DH will be responsible for developing a user interface that will incorporate new search functions and facilitate the critical analysis of the newspaper corpora. This may include providing information on the provenance of the data and the quality of automatically generated annotations, as well as indicating any gaps in the inventory.

A comprehensive and collaborative project

To boost the relevance of the project for history, the humanities and social sciences in general, the C²DH will coordinate a series of workshops that will provide a forum for users and developers to exchange their ideas. “Further links between history, computer science and design will be developed via an associated C²DH-based research project on resistance to European unification in the late 19th and early 20th centuries,” explains Dr Marten Düring, who coordinates the project at the University of Luxembourg. “Finally, the project will also be used for University teaching, giving young scholars the opportunity to explore automated methods for the extraction and representation of information from historical sources.”

The project will not only lead to academic publications; at the end of the project, the individual processing, analysis and storage systems will also be made available on an open source basis for others to reuse and develop.

Associated project partners include the Luxembourg National Library, the Swiss National Library, the Swiss newspapers Le Temps and Neue Zürcher Zeitung, Swiss archives, and researchers from the University of Lausanne. In Luxembourg the project will be coordinated by Dr Marten Düring, Dr Lars Wieneke and Prof. Andreas Fickers, in collaboration with Daniele Guido and Estelle Bunout.