The Archives Unleashed Toolkit

Where's the documentation?

We’ve reworked our website, and the documentation now lives here.

Introduction

aut in action

The Archives Unleashed Toolkit is an open-source platform for analyzing web archives built on Hadoop. Tight integration with Hadoop provides powerful tools for analytics and data processing via Spark.

Most of this documentation is built on resilient distributed datasets (RDD). We are working on adding support for DataFrames.

Check out the code on GitHub along with helpful user documentation to get you started using the Archives Unleashed Toolkit. If you want hack on the Archives Unleashed Toolkit, check out our API documentation.

The Archives Unleashed Toolkit can also be used in conjunction with Spark Notebooks, and Apache Zepplin.

If you want to learn more about Apache Spark, we highly recommend Spark: The Definitive Guide