Archives Unleashed Jupyter Notebooks
We are excited to introduce our Archives Unleashed Jupyter Notebooks, a prototype method for working with the derivatives generated by the Archives Unleashed Cloud. They allow you to interactively explore and filter the domain count information, extracted full text, and network visualization data generated by the Cloud.
We are currently exploring greater integration between the notebooks and the Archives Unleashed Cloud.
This is still in the prototype stage!
Any and all feedback and suggestions are greatly appreciated and can be sent to Samantha Fritz, our project manager.
You can read more about the thinking behind these Notebooks in our Medium post, “Exploring Web Archival Data through Archives Unleashed Cloud Jupyter Notebooks.”
There are three notebooks: domain analysis, text analysis, and network analysis. Each are discussed below.
Domains are a fairly basic analysis of the web archive that highlight what domains are included and how often they appear. You can, for example, see how many .com addresses are in the collection or which domains are over or underrepresented.
Text analysis is a popular way to do exploratory analysis of web archive data. The Natural Languages Toolkit (nltk) library offers an array of options here. For example, we can see which domains use which words; how words are dispersed around a collection; or the average sentiment of a collection.
Archives Unleashed Cloud network derivatives already offer some solid visualisation information out of the box using our GraphPass tool. This includes node sizing (based on Degree), positioning (based on the Fruchterman Reingold algorithm) and coloring (based on walktrap modularity). But that does not mean one cannot use Python libraries like networkx to produce interesting analyses. For instance, creating an ego network of a particular node in the graph is pretty straight-forward.