Stephen Klein, The Graduate Center, CUNY
An analysis of Archive-It and a rudimentary attempt to transcend its limitations using Webrecoder to preserve digital scholarship.
As modes of communication increasingly migrate from print to digital formats, digital preservation is an imperative within the academy to ensure the continual creation of knowledge and scholarly conversation. However, unlike print media, for which time-tested architecture for protecting and accessing content already exists, in the form of appropriate shelving, knowledge of best climatic conditions and online catalogs, sophisticated solutions for preserving and accessing complex digital objects are still very much in their infancy. Websites, for example, pose a particularly acute challenge; most solutions that preserve websites are not sophisticated enough to capture a website’s full (interactive) functionality, and original content might be lost.
As the Digital Services Librarian at the CUNY Graduate Center, the person responsible for preserving born-digital components of theses and dissertations, I became acutely aware of students’ increasing inclusion of born-digital projects, particularly websites, as elements within dissertation and thesis deposits. Capturing and preserving these digital elements is important in order for the scholarly conversation to continue.
After exploring an array of options (such as Wget, HTTrack or installing a local version of Heritrix) available for web archiving, at the Graduate Center we invested in an Archive-It subscription for several reasons. First, subscribing would relieve us of hosting a server and correctly configuring Heritrix, the default standard application that crawls sites and creates WARCs (web archive file). Second, Archive-It hosts the preserved copies (WARC files) of these sites, meaning we would not have to be concerned with backup of the WARCs. In addition, Archive-It allows for easy access to the WARC files via the Wayback Machine.
For example, Archive-It was not able to capture many important elements in Gregory Donovan’s My Digital Footprint project.
Here is the original with a timeline:
The archived version does not show the timeline:
The original also contains a SoundCloud embed:
But the archived version does not show the SoundCloud embed:
Another example of an unsuccessful crawl is Heather Spence’s ‘Dolphin Bioacoustics Policy Online Database Project.’
The ability to build complex searches is central to her project. For example, the following query (Sort by <Author> and Order by <Desc>):
Returns the following in the live version of the site:
Allowing users to see a full record by clicking on a title:
However, in the archived version, clicking on a title reveals that the search results were not actually captured by Archive-It’s crawler and the user encounters a ‘Not in Archive’ message:
Because of the desired web page is dynamically generated from a database query it is impossible for it to be captured by Archive-It, resulting in ‘Not in Archive’ page.
Many students are working on mapping projects. For example for the Graduate Center Digital Praxis course in 2015, a group of students put together this interactive mapping site:
All interactivity is lost in the archived version of the map:
The inability to create a perfect archival copy was the predominate reason motivating us to look for other tools that could supplant or supplement the Graduate Center’s Archive-It subscription. Although students have yet to send me any WARC files generated from the tool, the CUNY Graduate Center Library has decided to supplement our Archive-It captures and playback with Webrecorder WARC files generated by students themselves. Because the students are experts in their own work, the process of self-archiving seems logical because they best know how to navigate the most relevant links and interact within their site to capture content.
The interface is very simple and easy to use. Although Webrecorder is offered for local installation within a docker container, for our pilot project the Graduate Center has elected to tell students to use the Webrecorder site itself, which currently offers free creation of individual accounts with up to 10 collections and storage up to 5GB. It is also possible to record sites anonymously, without creating an account.
Note: The Webrecorder interface is under rapid development and although the following illustrated the ‘essence’ of the tool, it may not completely reflect its current manifestation.
Note: Anonymous users can create WARC files, but registered users have the advantages of 5GB of archiving storage space.
After registering, logged on users create a collection with the <New Collection> button.
The <Create a New Collection> dialog appears. Enter the collection name in the <Collection Name> field, decide if you want to make your WARC visible to the public and then click on the <Create> button.
Users are then returned to their Collection Page. On the Collection Page, users can simply click on the <New Recording> button.
The Create the New Recording button to begin the recording process.
Clicking on the <New Recording> button allows users the opportunity to enter the URL of the Website they want to record and cue to the desired location.
After clicking the start button, the desired webpage will load. Click through the desired interactive elements within the page and all pages that you want to capture. The Webrecorder page header contains useful information, for example it will show that a recording is occurring, the current size of the recording and the number of pages (bookmarks) captured. When done click on the <Finish> button.
After clicking on the <Finish> button, the user is returned to the Collection Page allowing the user to view the list of all pages captured, download the collection (WARC file), delete the collection or make the collection public.
To capture a website, the user simply enters the URL for the site and clicks <Record>. If you intend to use for more than one project, enter a recording name the <New Recording Name> field.
The Rhizome folks provide a concise video tutorial on youtube on basic Webrecorder usage.
Users can test their WARC by downloading the Web Archive Player on github.
From both a scholarly and administrative point of view, more complete preservation of content is clearly a big advantage. We are pleased with the increased proportion of dynamic content that Webrecorder captures in comparison with our experience of Archive-It. However, this tool is in development and at present offers limited functionality. For example, it currently does not offer automation for any of the process, so it may not be completely scalable for gigantic data orientated projects; it is necessary for the user to manually load every page they want to capture. It is a big advantage that Webrecoder is free, but the hosting costs for local storage and long-term preservation of the WARC files created with this method require further investigation.
Evaluating the benefits and drawbacks of these two tools for web archiving, we found that there are tradeoffs to be had in either method. At present, requiring students to use Webrecorder is a great supplemental approach to our Archive-It subscription, primarily because Webrecorder is more effective for capturing dynamic content than Archive-It. For now, asking students to self-archive their own work is manageable, given the small scale of student projects, although the manual process of capturing web content from a large (data based) project can be time consuming and may prove unsustainable for larger projects. It has been encouraging to begin making progress toward capturing the websites which are fast becoming an important part of scholarly conversations.
Appendix: Slideshow Presentation
The author thanks the Rhizome folks, especially Morgan McKeehan (a National Digital Steward), who provided insight and some guidance on the best way to understand, teach and present the potential of Webrecorder.
Stephen I. Klein, the Digital Services Librarian at the Mina Rees CUNY Graduate School Library, spends much of his work-life behind the scenes insuring that the pulse of the GC’s library systems continue to work seamlessly for library users. He also spends time ‘freaking-out’ about the crisis of how our cultural heritage is quickly disappearing, because of the acceleration of modern ephemera with the advent of the web as one of the central forums for popular conversation and academic scholarship.