AUT Project

Our goal is to make petabytes of historical internet content accessible to scholars and others interested in researching the recent past.

About the Project

Archives Unleashed aims to make petabytes of historical internet content accessible to scholars and others interested in researching the recent past. Supported by a grant from the Andrew W. Mellon Foundation, we will be developing web archive search and data analysis tools to enable scholars and librarians to access, share, and investigate recent history since the early days of the World Wide Web.

Growing out of a series of datathons held at the University of Toronto, Library of Congress, Internet Archive, and the British Library, our team recognized the need for better analytics tools, community infrastructure, and accessible web archival interfaces.

The three-year Archives Unleashed project has three major thrusts: First, the project will build a software toolkit that applies modern big data analytics infrastructure to scholarly analysis of web archives. Second, the toolkit will be deployed in a cloud-based environment that will provide a one-stop portal for scholars to ingest their collections and execute a number of analyses with the click of a mouse. Finally, datathons — or hackathons — will build a cohesive and sustainable user community by bringing the core project team members together with librarians, archivists, and other interested researchers.

Stay tuned for more information!

Software

The Archives Unleashed Toolkit is an open-source platform for analyzing web archives.

Warclight is Project Blacklight based Rails engine that supports the discovery of web archives held in the WARC and ARC formats.

Get Involved!

Join our Slack team if you want to see how things are developing, to discuss suggestions or other parts of our project, or to just shoot the breeze about all things web archiving.

Stay tuned for more information.

Project Team

Principal Investigators

Ian Milligan is associate professor of history at the University of Waterloo. Since 2012, he has been engaged in building tools, infrastructure, and frameworks to facilitate the historical use of web archives. In 2016, he was awarded the Canadian Society for Digital Humanities’ Outstanding Early Career Award.

Nick Ruest is the Digital Assets Librarian at York University. He is Co-PI of both the WALK project and the Social Sciences and Humanities Research Council of Canada Insight Grant with Milligan. Ruest is dedicated to building systems to ensure that valuable historical and cultural materials are preserved and made universally accessible. He has been Release Manager for five Islandora releases and two Fedora releases, both of which are open source digital asset management systems. He is Project Director for Islandora CLAW and leader of the Fedora Import-Export Initiative.

Jimmy Lin is the David R. Cheriton Chair in the David R. Cheriton School of Computer Science at the University of Waterloo. His research aims to build tools that help users make sense of large amounts of data. He works at the intersection of information retrieval, natural language processing, and databases, with a focus on large-scale distributed algorithms and infrastructure for data analytics.

Project Staff

Ryan Deschamps is a postdoctoral fellow in the University of Waterloo's Department of History, working under Milligan. He completed his dissertation at the Johnson Shoyama Graduate School of Public Policy (University of Regina) studying the role of social media on policy agendas. His post-doctoral research focuses on the influence of digital information on the interpretation of Canadian historical events. Ryan's position is funded by the Social Sciences and Humanities Research Council and the David R. Cheriton Chair at the David R. Cheriton School of Computer Science.

Advisory Board

Jefferson Bailey is Director of Web Archiving at Internet Archive. Jefferson joined Internet Archive in Summer 2014 and manages Internet Archive’s web archiving services, including Archive-It, used by over 450 institutions to preserve the web. He also oversees contract domain-scale web archiving services for national libraries and archives around the world, including Library of Congress, NARA, and foreign national libraries. He works closely with partner institutions on technology development, web data research services, educational partnerships, and other programs. He is PI on multiple grants focused on systems interoperability, data-driven research use of web archives, and digital preservation. Prior to Internet Archive, he worked on strategic initiatives, digital collections, and digital preservation at institutions such as Metropolitan New York Library Council, Library of Congress, Brooklyn Public Library, and Frick Art Reference Library and has worked in the archives at NARA, NASA, and Atlantic Records. He is currently Vice Chair of the International Internet Preservation Consortium. He has an MLIS in Archives from University of Pittsburgh and a BA in English from Oberlin College.

Nathalie Casemajor is an Assistant Professor in the Urbanisation Culture Société Research Centre at INRS (Institut national de la recherche scientifique, Montreal). Her work focuses on culture, territories and communities as well as digital culture. She was previously an Assistant Professor at the Université du Québec en Outaouais, a Postdoctoral Fellow at McGill University (Department of Art History and Communications Studies) as well as a Visiting Scholar at the New York University (Department of Media, Culture and Communication).

Robert H. McDonald is Associate Dean, Research and Technology Strategies at Indiana University. McDonald works to provide library information system services and discovery services to the entire Indiana University system and manages projects related to scholarly communications, new model publishing, and technologies that enable the Libraries to support teaching and learning for the IU Bloomington campus. In his role as Deputy Director of the Data to Insight Center, he works on new research related to large data analysis, storage and preservation through grant-funded and collaborative projects such as the HathiTrust Research Center. He also serves as the Data Steward for the IU Libraries. His research interests include technology management and integration of lean and agile frameworks, data preservation, learning eco-systems, data cyberinfrastructure, and big data analytics. Robert frequently presents and writes on a variety of topics, and was editor of the E-Content column for EDUCAUSE Review in 2016 – 2017. He is active professionally with a number of national and international organizations and conferences, serving on the HathiTrust Program Steering committee, as the chair for the Digital Preservation Network Heavy Users committee, and as general co-chair for the ACM/IEEE Joint Conference on Digital Libraries in 2013 and 2017.

Matthew Weber is an Associate Professor at Rutgers University. He is Principal Investigator on a National Science Foundation grant that aims to develop new methods and new collaborations for conducting research utilizing Internet Archive data. Weber’s grant works with more than 50 TB of archived Internet data, testing and publishing scripts for transforming archived Internet data into formats that are compatible with existing social science computing packages such as R and SPSS. Weber has related funding from the Democracy Fund, Institute of Library and Information Science and the William T. Grant Foundation.

Michele Weigle is an Associate Professor of Computer Science at Old Dominion University. Her research interests include digital preservation, web science, information visualization, and mobile networking. Since 2012, she has been PI or Co-PI on over $2M in funding for research related to web archiving from NSF, NEH, IMLS, and the Andrew W. Mellon Foundation. Dr. Weigle received her PhD in computer science from the University of North Carolina at Chapel Hill in 2003.

Nicholas Worby is the Government Information and Statistics Librarian as well as the Web Archives Program Coordinator at the University of Toronto. In addition to providing research and instruction support for government information and statistics, he oversees collection development, production crawls and researcher outreach for web archive collections.

Contact Us!

Questions? Comments? Please contact us, either by leaving an issue on one of our GitHub projects or by sending us an e-mail.

This work is primarily supported by the Andrew W. Mellon Foundation. Additional funding for the Toolkit has come from the U.S. National Science Foundation, Columbia University Library's Mellon-funded Web Archiving Incentive Award, the Natural Sciences and Engineering Research Council of Canada, the Social Sciences and Humanities Research Council of Canada, and the Ontario Ministry of Research and Innovation's Early Researcher Award program. Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors.