The Web today plays a crucial role in our information society: it provides information and services for seemingly all domains, it reflects all types of events, opinions, and developments within society, science, politics, environment, business, etc.Â Due to the central role the World Wide Web plays in today's life, its continuous growth, and its change rate, adequate Web archiving has become a cultural necessity in preserving knowledge. Consequently a strong growing interest in Web archiving library and archival organizations as well as emerging industrial services can be observed.
However, web preservation is a very challenging task. In addition to the â€œusualâ€ challenges of digital preservation (media decay, technological obsolescence, authenticity and integrity issues, etc.), web preservation has its own unique difficulties:
- distribution and temporal properties of online content, with unpredictable aspects such as transient unavailability,
- rapidly evolving publishing and encoding technologies, which challenge the ability to capture web content in an authentic and meaningful way that guarantees long-term preservation and interpretability,
- the huge number of actors (organizations and individuals) contributing to the web, and the wide variety of needs that web content preservation will have to serve.
The aim of the European funded project LiWA (IST FP7 216267) is to create innovative methods and services for Web content capture, preservation, analysis and enrichment.
The LiWA project, started in February 2008 and led by the L3S Research Center at the Gottfried Wilhelm Leibniz University Hannover, brings together a consortium of highly qualified researchers (L3S, Max Planck Society, Hungary Academy of Science), industrial users (European Archive, Hanzo Archives), and archiving organizations (Sound and Vision, National Library of the Czech Republic, Moravian Library). It is the intention of the project partners to turn Web archives from pure Web page storages into â€œLiving Web Archivesâ€. In order to create Living Web Archives, the LiWA project will address R&D challenges in the three areas: Archive Fidelity, Archive coherence and Archive interpretability:
- Archive Fidelity: development of effective approaches and methods for capturing all types of Web content including the Hidden and Social Web content, for detecting capturing traps as well as for filtering out Web spam and other types of noise in the Web capturing process.
- Archive Coherence:Â development of methods for dealing with issues of temporal Web archive construction, for identifying, analysing and repairing temporal gaps as well as methods for enablingÂ consistent Web archive federation for fostering synergies between Web archiving stakeholders;
- Archive Interpretability: development of methods for ensuring the accessibility, and long-term usability of Web archives, especially taking into account evolution in terminology and conceptualization of a domain
Two applications scenarios are developed in LiWA to illustrate the possible use of these technologies in real world scenario whose scope is wider than what LiWA specifically addresses.
LiWA technology for content and context in Sound and Vision archive
For audio-visual archives the Web is regarded as a valuable source for gathering contextual information that relates to their audio-visual collections. Context information is relevant for both documentalists, and also other users interested in a specific broadcast or a broadcasting related topic, such as journalists, teachers or researchers. Typically, these users have to use different interfaces for different sources to search these sources. Ideally, Sound and Vision provides these users with a single interface that allows searching both the digital asset management system of the AV archive (iMMix) and related web content. The LiWA application Streaming demonstrates how potential end users (both broadcast professionals and the general public) could access broadcast related web content. In addition, the archived content will be used as test data for the development of the Sound and Vision context data platform that specifically addresses the linking of web context to the digital asset management system of Sound and Vision. Starting off from a mock-up created in the first phase of the project, a first working prototype of the application allows users to search a archived broadcast related webpages and play-back audiovisual content based on a limited number of crawls.
Social web application
Social web sites typically contain highly inter-linked content and use dynamic linking, widgets and tools as well as high degree of personalisation. Capturing social web sites is extremely challenging and cannot be fully achieved using current methods and tools. Social web thus represents one of the greatest challenges in web archiving.
The aim of the application is to show how the LiWA technology fits in the workflow of an active Web archiving institution, by considering a real-life scenario of the National Library of the Czech Republic. The application is designed as a set of independent modules developed in LiWA as described. The modules can be readily integrated with existing Web archiving workflow management tools. A Web archiving institution can choose to deploy all of the modules or just some of them, depending on its needs and particular workflow. The application is designed as generic and can be used to enhance archiving of any type of web content, not just social web.
Main results, benefits and impacts
The most direct impact of the LIWA project results and of the capture, preservation, analysis, and enrichment services developed in this project is an improvement of the quality, comprehensiveness and accessibility of Web archives. These project results contribute to unlocking organizations' and people's ability to really master Web content - including its temporal perspective and dynamics. Furthermore, advanced Web archiving technologies, as they are developed in LiWA, clearly contribute to preserving Web content over time. The developed methods and technologies prepare Web content for long-term preservation by ensuring fidelity of the archive to the original content, improving coherence of the archived content and by taking provisions for its long term interpretability by developing methods for dealing with semantic evolution.
In general Web archiving technologies and their further evolution are important for many memory institutions. For memory institutions such as museums, which themselves create Web content as part of their service portfolios (e.g. virtual exhibitions), the use of technologies developed in LiWA eases the archiving of such content and opens up new ways of using the archived content (e.g. in later exhibitions, research, education, etc.). Other memory institutions, such as National Libraries, have an explicit archiving mandate which also includes Web content. This results in an even greater impact of improved Web archiving technologies on such memory institutions.
As a further impact of the project, adoption of the LiWA services will enable institutions and companies involved in Web archiving to provide better services and richer service portfolios to their clients. This may also provide additional motivation for institutions with an archiving mandate, like National Libraries and Archives, to invest more eagerly and successfully in Web archiving activities, since a clearer added value can be achieved.
Information on the lessons learnt section will be added by the concerned authors in the upcoming weeks.Scope: International, Pan-European