System for collecting and processing biobliographic data
The Institute of Literary Research of the Polish Academy of Sciences approached ImpiCode with the task of creating an advanced tool for storing, organising and updating bibliographical data about Polish writers and literary scholars.
We developed an online platform designed to fit the complexity of modeled data, implemented workflows between different users and imported data from sources supplied by the client.
One of the aforementioned sources is an online lexicon, from which we downloaded data using web scraping tools. The data was then parsed, organised, split into atomized structures, and imported to the system database.
One of the biggest challenges in this project was parsing second supplied data source. We were given scans of a 10-volume lexicon in PDF with the text layer obtained using OCR software and a list of outdated entries in QRT format (which is a format used by a Polish text editor QR Tekst from the nineties). Because of many OCR inaccuracies, we had to extract text from QRT and merge it with its PDF version using custom merge rules drawn up by analyzing QRT/PDF differences.
After deployment, the platform now supports the Institute's team in revising and updating data by providing functionalities like tracking changes, text markup, WYSIWYG editor, and tools for reorganizing stored structures.