The Hellenic Optical Character Recognition Team, or simply Hellenic OCR Team, is an innovative crowdsourcing initiative. It represents the first distributed scientific platform that aims exclusively at the processing and analysis of parliamentary data. With time, the Team gradually expanded its scope also to cover the study of government data. For doing so, a wide spectrum of self-developed methods and software tools has been developed to advance interoperability of systems and data sources.
Established in late 2017, the Team –as of summer 2020– includes 35 experts resident in eight different countries. The technical and scientific backgrounds of its members are quite diverse, from law to engineering and from political science to linguistics. Members originate from academia, the public and the private sector, while the Team maintains a balanced gender distribution.
Team members are virtually linked through an online exchange platform and regularly meet for monthly meetings, face to face or online, where issues are tackled and best practices are exchanged. Newcomers receive basic training upon entering the group, while more experienced members, called ‘mentors’, provide peer-to-peer advice and support. The Team is organized in three distinct groups, the OCR group, the Analytics group and the Developers group.
The processing of non-digital corpora constitutes the Team’s core competency. Text processing follows a well-defined streamlined process. The resulting content is transformed in an open and structured format, such as the XML (eXtensible Markup Language), which enables the use of a wide range of novel tools and methods from the field of computational linguistics. The opportunities that arise from the study of the aforementioned corpora are tremendous. They link several –formerly distant– areas of research, such as history, political science and linguistics, thus opening up new horizons in the understanding of parliamentary data and discourse.
The Hellenic OCR Team, originally centered on parliaments, now develops more generic interoperability tools and services, based on open standards and technologies, which are distributed as open source software. These tools are shared using different licenses, such as MIT or GNU GPLv3, depending on the stage of development. Moreover, the developers group managed tο install and locally configure an ISA2 LEOS instance, which was used last semester in an innovative legal informatics lab at the National School of Government.
The Hellenic OCR Team developed a text analytics software called Xtralingua. Now in its second release, it is a tool for extracting quantitative text profiles from large collections of texts and it was presented at the Digital Humanities 2020 virtual conference. The Team assessed the necessity for the development of this tool, as existing solutions were hard to use and required advanced technical knowledge, often absent in researchers of Digital Humanities. Hence, an open source working prototype has been developed, also to be used in combination with other relevant tools, to create an automated pipeline for text analysis.
The Hellenic OCR Team is a versatile, innovative and constantly evolving crowdsourcing platform. It is not a legal entity and it does not receive any external funding. It relies only on voluntary work by its members. Driven by innovation and scientific excellence, the Team has recently entered the EU transparency register, and it can now influence decision-making processes in its fields of expertise. The Inter-Parliamentary Union considers this approach a good practice (Innovation tracker 3, 2019) and spin-offs are planned in order to implement it in a non-European environment as well.
As the Hellenic OCR Team is expanding its member base and capabilities, it is looking forward to further cooperating with the European Interoperability Framework and the community around it to develop novel concepts, design innovative solutions and implement advanced tools.