Skip to content
Andy Jackson edited this page Jan 7, 2015 · 20 revisions

The primary goal of this project to provide full-text search for our web archives. To achieve this, the warc-indexer component is used to parse the (W)ARC files and, for each resource, it posts a record into one or more Apache Solr servers. We then use client facing tools that allow researchers to query the Solr index and explore the collections.

Overview

The simplest way to use webarchive-discovery as as a command-line tool, as described in the Quick Start.

[WARC & ARC files] -> [indexer] -> [Solr cluster] -> [front-end UI]

Currently, our experimental front-end is based on Drupal Sarnia, but we are starting to build our own, called shine, that is more suited to the needs of our users.

For moderate collections, the stand-alone warc-indexer tool can be used to populate a suitable Solr server. However, we need to index very large collections (tens of TB of compressed ARCs/WARCs, containing billions of resources), and so much of the rest of the codebase is concerned with running the indexer at scale. We use the ‘warc-hadoop-recordreaders’ to process (W)ARC records in a Hadoop Map-Reduce job that posts the content to the Solr servers.

[WARC & ARC files on HDFS] -> [map-reduce indexer] -> [Solr cluster] -> [front-end UI]

While search is the primary goal, the fact that we are going through and parsing every byte means that this is a good time to perform any other analysis or processing of interest. Therefore, we have been exploring a range of additional content properties to be exposed via the Solr index. These include format analysis (Apache Tika and DROID), some experimental preservation risk scanning, link extraction, metadata extraction, and so on (see Features for more details).

IIPC Solr Training Event (Jan 2014)

The schedule for this event is here.

Workshop Activities