Package it.unimi.dsi.law.warc.io

Provides classes performing low and high level WARC I/O (for format details, please see the ISO draft).

This code is designed to be very efficient. In particular some choices are worth mentioning:

Object reuse

Whenever possible, a single object must be used to perform multiple operations.

For instance, to read a sequence of WARC records, one must instantiate a single WarcRecord object and invoke repeatedly its read method, as opposed to have a new WarcRecord object for every record read.

The convenience iterators offered by WarcFilteredIterator and HttpResponseFilteredIterator follow this approach and need a WarcRecord to be given at construction time; such object will be reused by the iterator to expose the record read at every iteration.

Pull writes

To write a record, one should provide a MeasurableInputStream (by setting the block field of WarcRecord) from which the write method of WarcRecord will pull the data to be written.

This is especially convenient when the data comes from the network (as during a crawl) or from a read operation (as during batch processing of data) so that one can write a record simply using the available (suitably wrapped) input stream.

Public fields

To avoid the overhead of getters and setters, some fields (such as the block and header of WarcRecord and gzheader of GZWarcRecord) are declared public.

Low level I/O

Low level I/O can be performed using WarcRecord for the WARC format or GZWarcRecord for the compressed WARC format. For further detail, see the class documentation.

A very simple example of low level sequential reads can be found in the source code of SequentialWarcRecordRead, while an example of low level sequential writes can be found in the source code of SequentialWarcRecordWrite; both examples show how to use the plain and compressed WARC format.

A (more complex) example of low level random access reads can be found in the source code of the CutWarc tool (see also the IndexWarc tool on how to create the index).

High level I/O

High level I/O is provided at the moment only for (a subset of) the response record type of the WARC specification. More precisely, interfaces and classes are provided to deal with responses with content-type of HTTP (or HTTPS) kind. At the interface level, read-only access is mainly provided, in particular, all the interfaces specify getters for the various properties, but no getters.

Interfaces

More in detail, the Response interface (besides the abovementioned getters), prescribes the method fromWarcRecord that allows to obtain a response from record. The HttpResponse specialises such interface to deal with responses having content-type of HTTP (or HTTPS) kind.

Classes

The AbstractHttpResponse provides an abstract implementation of a HttpResponse with all the getters except for contentAsStream that concrete subclasses must implement according to the way they represent the response content; the fromWarcRecord) method is also left unimplemented much for the same reason. On the other hand, there is a toWarcRecord method that can be used by concrete subclasses to populate a WARC record with the data in the response (for example, in order to subsequently write such record).

The concrete class WarcHttpResponse, for instance, implements contentAsStream and fromWarcRecord methods so that the content is read from a WARC file and should indeed be used to read a WarcRecord (of suitable record-type) as an HttpResponse.

An example of the usage of such class to read a sequence of responses from a WARC file can be found in the source code of SequentialHttpResponseRead.

Finally, the MutableHttpResponse is a concrete implementation of an AbstractHttpResponse endowed with setters that can be used to populate the response and hence a WarcRecord (via toWarcRecord method) in order to write it.

Digest based duplicate detection

There are many ways to detect duplicates in a crawl. UbiCrawler adopts a technique based on page content digests. The interface DigestBasedDuplicateDetection is used to represent duplicate information present in WARC files produced by UbiCrawler.

More precisely, the class WarcHttpResponse implements such interface so that after using it to read a WARC record, one can use the class isDuplicate method to know whether the response is a duplicate of a previously crawled one, and can use the digest method to get the digest of the response content.

Observe that AbstractHttpResponse takes this interface into account and if the concrete class used to invoke toWarcRecord implements such interface, such method will populate the WarcRecord with the duplicate information.