Temporally Extending Existing Web Archive Collections for Longitudinal Analysis
By: Lesley Frew, Michael L. Nelson, Michele C. Weigle
Potential Business Impact:
Shows how government websites changed over time.
The Environmental Governance and Data Initiative (EDGI) regularly crawled US federal environmental websites between 2016 and 2020 to capture changes between two presidential administrations. However, because it does not include the previous administration ending in 2008, the collection is unsuitable for answering our research question, Were the website terms deleted by the Trump administration (2017--2021) added by the Obama administration (2009--2017)? Thus, like many researchers using the Wayback Machine's holdings for historical analysis, we do not have access to a complete collection suiting our needs. To answer our research question, we must extend the EDGI collection back to January, 2008. This includes discovering relevant pages that were not included in the EDGI collection that persisted through 2020, not just going further back in time with the existing pages. We pieced together artifacts collected by various organizations for their purposes through many means (Save Page Now, Archive-It, and more) in order to curate a dataset sufficient for our intentions. In this paper, we contribute a methodology to extend existing web archive collections temporally to enable longitudinal analysis, including a dataset extended with this methodology. We use our new dataset to analyze our question, Were the website terms deleted by the Trump administration added by the Obama administration? We find that 81 percent of the pages in the dataset changed between 2008 and 2020, and that 87 percent of the pages with terms deleted by the Trump administration were terms added during the Obama administration.
Similar Papers
Longitudinal Sampling of URLs From the Wayback Machine
Digital Libraries
Shows how long web pages stay online.
Examining persistence of European open repository infrastructure and its diffusion in the scholarly record
Digital Libraries
Finds old research papers that are now lost.
Examining persistence of European open repository infrastructure and its diffusion in the scholarly record
Digital Libraries
Makes old online research easy to find again.