About the Archives Unleashed Project

Material Screenshot

The Archives Unleashed Project aims to make petabytes of historical internet content accessible to scholars and others interested in researching the recent past.

The project grew out of a series of datathons held at the University of Toronto, Library of Congress, Internet Archive, and the British Library, as our team recognized the need for better analytics tools, community infrastructure, and accessible web archival interfaces. Supported by a grant from The Andrew W. Mellon Foundation, the team developed web archive search and data analysis tools from 2017 to 2020.

We continue our work to enable scholars, librarians, and archivists to access, share, and investigate recent history since the early days of the World Wide Web.

Starting in June 2020, the project entered a second phase with the support of a second grant from The Andrew W. Mellon Foundation. The principal aim of this phase is to extend and sustain Archives Unleashed by integrating the Cloud with the Internet Archive’s Archive-It service.

Over three years, the team will be engaged in two priorities:

We will integrate the Archives Unleashed Cloud with the Internet Archive’s Archive-It Service.
We will launch the Archives Unleashed Cohorts program to facilitate researcher engagement with web archives.

Citing Archives Unleashed

Your citations help further recognize the use of open-source tools for scientific inquiry, assist in growing the web archiving community, and acknowledge the efforts of contributors to this project.

How to cite the Archives Unleashed Toolkit or Cloud in your research:


	Nick Ruest, Jimmy Lin, Ian Milligan, and Samantha Fritz. 2020. The Archives Unleashed Project: Technology, Process, and Community to Improve Scholarly Access to Web Archives. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 (JCDL ‘20). Association for Computing Machinery, New York, NY, USA, 157–166. DOI: https://doi.org/10.1145/3383583.3398513

Project Team


	Ian Milligan is Professor of history at the University of Waterloo, where he is also Associate Vice-President, Research Oversight and Analysis. Since 2012, he has been engaged in building tools, infrastructure, and frameworks to facilitate the historical use of web archives. In 2016, he was awarded the Canadian Society for Digital Humanities’ Outstanding Early Career Award.
	Jefferson Bailey is Director of Web Archiving & Data Services at the Internet Archive. Bailey joined the Internet Archive in 2014 and manages a range of Internet Archive’s services for web archiving, digital preservation, data management, and computational research.
	Nick Ruest is an Associate Librarian at York University. He is dedicated to building systems to ensure that valuable historical and cultural materials are preserved and made universally accessible.
	Jimmy Lin is the David R. Cheriton Chair in the David R. Cheriton School of Computer Science at the University of Waterloo. His research aims to build tools that help users make sense of large amounts of data. He works at the intersection of information retrieval, natural language processing, and databases, with a focus on large-scale distributed algorithms and infrastructure for data analytics.
	Helge Holzmann is Web Data Engineer at the Internet Archive. In this role, he directly supports computational research services, is the primary developer of the ArchiveSpark data processing library (an Apache Spark-based platform for web archive processing and extraction), and is the tech lead for research and development on datasets and analytics for the web archive within Internet Archive and related services in Archive-It. His Ph.D. from Leibniz University Hannover was entitled “Concepts and Tools for the Effective and Efficient Use of Web Archives” and outlined technical approaches to data mining web archives.
	Samantha Fritz is the project manager for the Archives Unleashed Team. She is an information management professional with a passion for open access, information literacy education, and helping people connect with and make sense of data. Samantha is driven to support information and resource dissemination to positively transform the way researchers view, experience, interpret and share information and knowledge. She has worked with organizations such as Ryerson University’s Social Media Lab, the Islandora Foundation, and Dalhousie University Libraries on digitization and data visualization projects.
	Kody Willis leads product operations for the Archiving & Data Services department of the Internet Archive. He and his team support the department’s growing number of web archiving and digital preservation services. In his collaboration with the Archives Unleashed Project, Kody ensures service users have efficient, reliable access to support.
	Alex Dempsey is an engineering manager at the Internet Archive. In his collaboration with the Archives Unleashed Project, Alex supports infrastructural setup and configuration. Additionally, he contributes his experience in backend/frontend development while providing an excellent eye for functional and accessible graphic design elements.

The Archives Unleashed Team would like to recognize former collaborators for their contributions to the project:

Peggy Lee, Web Archiving & Data Services, Internet Archive (2021-2022)
Sarah McTavish, Department of History, University of Waterloo (2018-2021)
Rebecca MacAlpine, Department of History, University of Waterloo (2019-2020)
Tobi Adewoye, David R. Cheriton School of Computer Science, University of Waterloo (2019-2020)
Xiao Han, David R. Cheriton School of Computer Science, University of Waterloo (2019-2020)
Gursimran Singh, David R. Cheriton School of Computer Science, University of Waterloo (2019-2020)
Hsiu-Wei Yang, David R. Cheriton School of Computer Science, University of Waterloo (2018-2019)
Linqing Liu, David R. Cheriton School of Computer Science, University of Waterloo (2018-2019)
Borui Lin, David R. Cheriton School of Computer Science, University of Waterloo (2018)
Jeremy Wiebe, Department of History, University of Waterloo (2018-2019)

Finally, we would like to recognize our former postdoctoral fellow Ryan Deschamps (2017-2019), who provided invaluable support to this project, especially in the development of GraphPass.

Advisory Board


Matthew Weber is an Associate Professor in the Department of Communication, School of Communication and Information at Rutgers University. He is Principal Investigator on a National Science Foundation grant that aims to develop new methods and new collaborations for conducting research utilizing Internet Archive data. Weber’s grant works with more than 50 TB of archived Internet data, testing and publishing scripts for transforming archived Internet data into formats that are compatible with existing social science computing packages such as R and SPSS. Weber has related funding from the Democracy Fund, Institute of Library and Information Science, and the William T. Grant Foundation.
Michele Weigle is a Professor of Computer Science at Old Dominion University. Her research interests include digital preservation, web science, information visualization, and mobile networking. Since 2012, she has been PI or Co-PI on over $2M in funding for research related to web archiving from NSF, NEH, IMLS, and The Andrew W. Mellon Foundation. Dr. Weigle received her Ph.D. in computer science from the University of North Carolina at Chapel Hill in 2003.
Robert H. McDonald is the Dean of Libraries at the University of Colorado Boulder. His research interests include technology management and integration of lean and agile frameworks, data preservation, learning ecosystems, data cyberinfrastructure, and big data analytics. Robert frequently presents and writes on a variety of topics and was editor of the E-Content column for EDUCAUSE Review in 2016 – 2017. He is active professionally with a number of national and international organizations and conferences, serving on the HathiTrust Program Steering committee, as the chair for the Digital Preservation Network Heavy Users committee, and as general co-chair for the ACM/IEEE Joint Conference on Digital Libraries in 2013 and 2017.
Jane Winters is Professor of Digital Humanities at the School of Advanced Study, University of London. Her current and past research projects include the Marie Curie Innovative Training Network CLEOPATRA, Digging into Linked Parliamentary Data, Big UK Domain Data for the Arts and Humanities, Traces through Time: Prosopography in Practice across Big Data, and the Thesaurus of British and Irish History as SKOS. Her research interests include digital history, born-digital archives (particularly the archived web) and open access publishing. Jane is a Fellow and Councillor of the Royal Historical Society.
Sylvain Bélanger has been Director General of the Digital Operations and Preservation Branch for Library and Archives Canada since February 2014. In this role, Sylvain is responsible for leading and supporting LAC’s digital business operations, and all aspects of preservation for digital and analog collections. Sylvain is also lead for LAC’s digital transformation activities. Prior to accepting this role, Sylvain had been Director of the Holdings Management Division since 2010, and previously Corporate Secretary and Chief of Staff for Library and Archives Canada. Sylvain is Treasurer of the International Internet Preservation Consortium, member of the IFLA Standing Committee and Conservation and Preservation, LAC’s representative at the Government of Canada’s Enterprise Architecture Review Board, and until recently was a member of the Canadian Association of Research Libraries’ Digital Preservation Working Group, among other roles.
Nicholas Taylor is the Deputy Group Leader for Technology Strategy and Services at the Los Alamos National Laboratory Research Library. In this role, he oversees IT research and development efforts focused on digital repository services, scholarly publishing, and applied information science. Prior to Los Alamos National Laboratory, he managed and supported digital library, digital preservation, library technology, and web archiving programs at Stanford University, the Library of Congress, and the U.S. Supreme Court Library. He possesses an M.A. in Communication, Culture, and Technology from Georgetown University and an M.L.S. from the University of Maryland, College Park.

The Archives Unleashed Team would like to recognize members who sat on the 2017-2020 Advisory Board and for their contributions to the project:

Funding

The work of this project is primarily supported by The Andrew W. Mellon Foundation, and the Archives Unleashed Project is grateful for the generous support of financial and in-kind support from several institutions:

Code of Conduct

Our Pledge

The Archives Unleashed Project believes in supporting an open, inclusive, and diverse community which respects the experience, expertise, and knowledge of all community members.
The Archives Unleashed community is dedicated to providing a harassment-free experience for everyone, and welcomes individuals regardless of age, body size, disability, ethnicity, gender identity and expression, level of experience, nationality, personal appearance, race, religion, or sexual identity and orientation.
To foster respectful collaborations, this code of conduct applies to all Archives Unleashed spaces, including, but not limited to, GitHub, Slack, Medium, social media platforms and meeting spaces, both online and off.
Anyone who violates this code of conduct may be sanctioned or expelled from these spaces at the discretion of the Archives Unleashed Project Team.

Our Standards

Examples of behaviour that contributes to creating a positive environment include:

Using welcoming and inclusive language
Being respectful of differing viewpoints and experiences
Gracefully accepting constructive criticism
Focusing on what is best for the community
Showing empathy towards other community members

Examples of unacceptable behaviour by participants include:

The use of sexualized language or imagery and unwelcome sexual attention or advances
Trolling, insulting/derogatory comments, and personal or political attacks
Public or private harassment
Publishing other’s private information, such as a physical or electronic address, without explicit permission
Other conduct which could reasonably be considered inappropriate in a professional setting

Our Responsibilities

Project maintainers are responsible for clarifying the standards of acceptable behaviour and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behaviour.

Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviours that they deem inappropriate, threatening, offensive, or harmful.

Scope

This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the project or its community. Examples of representing a project or community include using an official project email address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers.

Enforcement

Instances of abusive, harassing, or otherwise unacceptable behaviour may be reported by contacting the project team. All complaints will be reviewed and investigated and will result in a response that is deemed necessary and appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. Further details of specific enforcement policies may be posted separately.

Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project’s leadership.

Attribution

This Code of Conduct is adapted from the Contributor Covenant, version 1.4, available at http://contributor-covenant.org/version/1/4

Privacy Policy

This Privacy Policy is effective January 2019

We recognize the importance of and are committed to protecting the privacy of all users. The following privacy policy guides and outlines the Archives Unleashed Project’s (AU) online information practices.

The Archives Unleashed Privacy Policy describes what users can expect from Archives Unleashed as to how information is collected, used and shared. This policy applies to all information collected from or submitted to the Archives Unleashed Project, including the Archives Unleashed Cloud portal located at cloud.archivesunleashed.org and all related sub-domains.

By accessing and using the services and products of the Archives Unleashed Project, users accept the practices outlined in the policy below.

Information Collection and Use

We request the minimum amount of personal information necessary for the operation of Archives Unleashed software. We collect several types of information for various purposes to provide and improve our service to you.

Personal Data

While using our service, we may ask you to provide us with certain personally identifiable information that can be used to contact or identify you.

Archives Unleashed Service	Information Required	Required	Collected by AU	Explanation
Archives Unleashed Cloud	GitHub/Twitter Credentials	Yes	Yes	Your Github/Twitter username and password are used to authenticate. We do not have access to any information, except your username. As of 30 June 2021, the Cloud is no longer available.
	Email	Yes	Yes	We need a point of contact to connect with you.
	Name	Yes	Yes	We like to know who our users are.
	Institution	Yes	Yes	We like to know a bit about our users: where they’re from, what sort of user they are, so we can best focus and refine our services.
	Archive-It Account Information	Yes	Yes	Credentials are necessary to connect Archive-It to the Cloud. This information is encrypted with a salt in our database.
Archives Unleashed Toolkit	None	No	No	The Toolkit is used locally, so we don’t need any information.
Newsletter	Email Address	Yes	Yes	An email address is required so we can send you our quarterly newsletter. This service is opt-in and is maintained by Mailchimp. Users have the option to opt-out at any time.
	First name (optional)	No	Yes	It’s always nice when we can personalize a message to our subscribers and get to know them. This information is ONLY collected if provided.
	Last name (optional)	No	Yes
Slack	Slack URL	Yes	No	To join our Slack channel, you’ll need to input this information. All we will see is when you post to any of our channels or direct message our team.
GitHub	Username	Yes	No	When anyone follows, watches, or contributes to any of the Archives Unleashed GitHub Repositories, we are able to see usernames, but participation is voluntary and we do not collect any information.

Usage Data

We may also collect information on how our services are accessed and used. In some cases, reports are generated by the applications we’ve subscribed to, such as Slack, Mailchimp and GitHub. These reports include general usage statics and descriptive data.

Archives Unleashed Service	Usage Data Collected
Archives Unleashed Cloud	Apache Spark logs showing timestamps to produce derivative files; name and size of your Archive-It collections; username and institution. As of 30 June 2021, the Cloud is no longer available.
Archives Unleashed Toolkit	Not Applicable; application is run locally
Newsletter	General stats that help us understand how people are interacting with our newsletters: number of subscribers, number of opt-outs, audience growth, Open/click rates, campaign performance, email clients used, locations (general, not specific).
Slack	Statistical reports are provided to Slack administrators to understand our total number of users, and when users become inactive.
GitHub	Our public repositories provide insights into the repositories maintained by the Archives Unleashed Project to understand the work being done and who our contributors are.

Tracking & Cookies Data

We use a session cookie to keep you logged in on our service.

Cookies are files with a small amount of data which may include an anonymous unique identifier. They are sent to your browser from a website and stored on your device. The Archives Unleashed Cloud uses a session cookie, which allows you to stay logged in between visits to the page.

No personal or identifying information is collected while using cookies, and you can always opt-out by changing your browser settings or permanently using a browser plugin. If you do not accept cookies, you may encounter some minor issues when using our service.

Use of Information Collected

In accessing Archives Unleashed tools and services, collected information is used for a variety of purposes:

To provide and maintain Archives Unleashed services
To notify users about changes to our policies or immediate news about our services (e.g. systems interruptions, maintenance)
To allow you to participate in interactive features of our services when you choose to do so, such as our newsletter, Slack channel, or within our GitHub repositories.
To provide user support by responding to inquiries and requests
To provide analysis or valuable information so that we can improve services
To monitor the usage of the services
To detect, prevent and address technical issues

Security of Information

We do not sell or provide access to any user information to third-parties. We will NOT share any of your information without consent.

As mentioned before, we do not store or have access to your authenticating credentials for GitHub or Twitter. Authentication with these applications is used to authenticate to the Archives Unleashed Cloud. Archive-It credentials are supplied over HTTPS and are salted and encrypted.

Links To Other Sites

We are strongly committed to serving and participating in open-source communities, which is why you will find that our services reference and link out to other projects and resources that are not operated by us.

When you click on a third-party link, you will be directed to a site outside of the Archives Unleashed Project. We recommend you review the privacy policy for those sites to fully understand their information practices. We have no control over and assume no responsibility for the content, privacy policies, or practices of any third-party sites or services.

Changes to this Privacy Policy

Any updates to our Privacy Policy will be posted to this page. We will also let users know via email and/or a prominent notice, prior to the change becoming effective and update the “effective date” at the top of this Privacy Policy. Changes to this Privacy Policy are effective when they are posted on this page.

Contact Us

General questions or concerns about the Archives Unleashed Privacy Policy should be directed to our Project Manager, Samantha Fritz.

Acknowledgements

We would like to acknowledge that this privacy policy is inspired from work done by:

Free Privacy Policy Template Website.
University of Waterloo, Website privacy statement.
IIPC, Privacy Policy.
GitHub, Privacy Statement.
Wikimedia, Privacy Policy.