Digging into WARCs - Hands-On with the Archives Unleashed Toolkit

AUK Notebook screenshot

Workshop Description

Welcome to “Digging into WARCs: Hands-On with the Archives Unleashed Toolkit.” The Archives Unleashed Toolkit, or AUT, is an open-source platform for managing and analyzing web archives built on Apache Spark.

This is a hands-on introductory workshop with the Archives Unleashed Toolkit and Archives Unleashed Jupyter Notebooks. No existing technical knowledge is needed, and we will be aiming this at beginner and intermediate users alike.

This workshop is being held on Tuesday, June 18th from 14:00-17:30 in room BG2 0.02 at the “Web that Was: Archives, Traces, Reflections” conference.

Workshop Schedule

Time Content
1400 - 1410 Introductions, Getting Settled
1410 - 1430 Introduction to the Archives Unleashed Toolkit (and related project)
1430 - 1530 Hands-on with the Archives Unleashed Toolkit, Gephi, etc.
1530 - 1600 Coffee Break
1600 - 1630 More Advanced Analytics (DataFrames, etc.)
1630 - 1715 Digging into WARCs with Jupyter Notebooks
1715 - 1730 Wrap Up

Jupyter Notebooks

You can run the Jupyter Notebooks in your browser at https://mybinder.org/v2/gh/archivesunleashed/auk-notebooks/master. Have fun!


What should you bring? If you want to dig into WARCs yourself, you’ll need a laptop. If not, we will be working through exercises collectively, so you are more than welcome to participate in that manner too.

If you are planning to participate in the hands-on components, to reduce load on the conference WiFi we would also like you to do the following homework:

While we will provide sample data as part of the workshop, you may want to try it out with your own data. If you have some small WARCs you can bring those, or alternatively you can crawl a few websites with https://webrecorder.io and export the WARC(s).


If you have any questions, please contact Nick Ruest or Ian Milligan.

AUK Notebook screenshot