The Archives Unleashed Toolkit

The Archives Unleashed Toolkit is an open-source platform for analyzing web archives built on Apache Spark, which provides powerful tools for analytics and data processing.

Check out the code on GitHub along with helpful user documentation to get you started using the Archives Unleashed Toolkit. If you want to hack on the Archives Unleashed Toolkit, check out our Java and Scala API documentation.

The Archives Unleashed Toolkit can also be used in conjunction with Spark Notebooks and Apache Zepplin.

If you want to learn more about Apache Spark, we highly recommend Spark: The Definitive Guide

Citing Archives Unleashed

Your citations help to further recognize using open-source tools for scientific inquiry, assist in growing the web archiving community, and acknowledge the efforts of contributors to this project.

How to cite the Archives Unleashed Toolkit or Cloud in your research:


	Nick Ruest, Jimmy Lin, Ian Milligan, and Samantha Fritz. 2020. The Archives Unleashed Project: Technology, Process, and Community to Improve Scholarly Access to Web Archives. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 (JCDL ‘20). Association for Computing Machinery, New York, NY, USA, 157–166. DOI: https://doi.org/10.1145/3383583.3398513