Digging into WARCs - Hands-On with the Archives Unleashed Toolkit
Welcome to “Digging into WARCs: Hands-On with the Archives Unleashed Toolkit.” The Archives Unleashed Toolkit, or AUT, is an open-source platform for managing and analyzing web archives built on Apache Spark.
This is a hands-on introductory workshop with the Archives Unleashed Toolkit and Archives Unleashed Jupyter Notebooks. No existing technical knowledge is needed, and we will be aiming this at beginner and intermediate users alike.
This workshop is being held on Tuesday, June 18th from 14:00-17:30 in room BG2 0.02 at the “Web that Was: Archives, Traces, Reflections” conference.
|1400 - 1410||Introductions, Getting Settled|
|1410 - 1430||Introduction to the Archives Unleashed Toolkit (and related project)|
|1430 - 1530||Hands-on with the Archives Unleashed Toolkit, Gephi, etc.|
|1530 - 1600||Coffee Break|
|1600 - 1630||More Advanced Analytics (DataFrames, etc.)|
|1630 - 1715||Digging into WARCs with Jupyter Notebooks|
|1715 - 1730||Wrap Up|
You can run the Jupyter Notebooks in your browser at https://mybinder.org/v2/gh/archivesunleashed/auk-notebooks/master. Have fun!
What should you bring? If you want to dig into WARCs yourself, you’ll need a laptop. If not, we will be working through exercises collectively, so you are more than welcome to participate in that manner too.
If you are planning to participate in the hands-on components, to reduce load on the conference WiFi we would also like you to do the following homework:
- Installing Docker for Windows or Mac: https://archivesunleashed.org/aut/docker-install/
- Please install Anaconda Distribution for your platform: https://www.anaconda.com/distribution/. If you are versed on the command line, try installing our Notebooks using the instructions under Local (Anaconda) here: https://github.com/archivesunleashed/auk-notebooks. If not, don’t worry!
- Finally, please download and install Gephi for your platform: https://gephi.org.
While we will provide sample data as part of the workshop, you may want to try it out with your own data. If you have some small WARCs you can bring those, or alternatively you can crawl a few websites with https://webrecorder.io and export the WARC(s).