Provides classes performing low and high level WARC I/O (for format details, please see the ISO draft).

This code is designed to be very efficient. In particular some choices are worth mentioning:

Object reuse

Whenever possible, a single object must be used to perform multiple operations.

For instance, to read a sequence of WARC records, one must instantiate a single {@link it.unimi.dsi.law.warc.io.WarcRecord} object and invoke repeatedly its {@link it.unimi.dsi.law.warc.io.WarcRecord#read(it.unimi.dsi.fastutil.io.FastBufferedInputStream bin) read} method, as opposed to have a new {@link it.unimi.dsi.law.warc.io.WarcRecord} object for every record read.

The convenience iterators offered by {@link it.unimi.dsi.law.warc.io.WarcFilteredIterator} and {@link it.unimi.dsi.law.warc.io.HttpResponseFilteredIterator} follow this approach and need a {@link it.unimi.dsi.law.warc.io.WarcRecord} to be given at construction time; such object will be reused by the iterator to expose the record read at every iteration.

Pull writes

To write a record, one should provide a {@link it.unimi.dsi.fastutil.io.MeasurableInputStream} (by setting the {@link it.unimi.dsi.law.warc.io.WarcRecord#block block} field of {@link it.unimi.dsi.law.warc.io.WarcRecord}) from which the {@link it.unimi.dsi.law.warc.io.WarcRecord#write(java.io.OutputStream out) write} method of {@link it.unimi.dsi.law.warc.io.WarcRecord} will pull the data to be written.

This is especially convenient when the data comes from the network (as during a crawl) or from a read operation (as during batch processing of data) so that one can write a record simply using the available (suitably wrapped) input stream.

Public fields

To avoid the overhead of getters and setters, some fields (such as the {@link it.unimi.dsi.law.warc.io.WarcRecord#block block} and {@link it.unimi.dsi.law.warc.io.WarcRecord#header header} of {@link it.unimi.dsi.law.warc.io.WarcRecord} and {@link it.unimi.dsi.law.warc.io.GZWarcRecord#gzheader gzheader} of {@link it.unimi.dsi.law.warc.io.GZWarcRecord}) are declared public.

Low level I/O

Low level I/O can be performed using {@link it.unimi.dsi.law.warc.io.WarcRecord} for the WARC format or {@link it.unimi.dsi.law.warc.io.GZWarcRecord} for the compressed WARC format. For further detail, see the class documentation.

A very simple example of low level sequential reads can be found in the source code of {@link it.unimi.dsi.law.warc.io.examples.SequentialWarcRecordRead}, while an example of low level sequential writes can be found in the source code of {@link it.unimi.dsi.law.warc.io.examples.SequentialWarcRecordWrite}; both examples show how to use the plain and compressed WARC format.

A (more complex) example of low level random access reads can be found in the source code of the {@link it.unimi.dsi.law.warc.tool.CutWarc} tool (see also the {@link it.unimi.dsi.law.warc.tool.IndexWarc} tool on how to create the index).

High level I/O

High level I/O is provided at the moment only for (a subset of) the response record type of the WARC specification. More precisely, interfaces and classes are provided to deal with responses with content-type of HTTP (or HTTPS) kind. At the interface level, read-only access is mainly provided, in particular, all the interfaces specify getters for the various properties, but no getters.

Interfaces

More in detail, the {@link it.unimi.dsi.law.warc.util.Response} interface (besides the abovementioned getters), prescribes the method {@link it.unimi.dsi.law.warc.util.Response#fromWarcRecord fromWarcRecord} that allows to obtain a response from record. The {@link it.unimi.dsi.law.warc.util.HttpResponse} specialises such interface to deal with responses having content-type of HTTP (or HTTPS) kind.

Classes

The {@link it.unimi.dsi.law.warc.util.AbstractHttpResponse} provides an abstract implementation of a {@link it.unimi.dsi.law.warc.util.HttpResponse} with all the getters except for {@link it.unimi.dsi.law.warc.util.HttpResponse#contentAsStream() contentAsStream} that concrete subclasses must implement according to the way they represent the response content; the {@link it.unimi.dsi.law.warc.util.Response#fromWarcRecord(WarcRecord record) fromWarcRecord}) method is also left unimplemented much for the same reason. On the other hand, there is a {@link it.unimi.dsi.law.warc.util.AbstractHttpResponse#toWarcRecord toWarcRecord} method that can be used by concrete subclasses to populate a WARC record with the data in the response (for example, in order to subsequently write such record).

The concrete class {@link it.unimi.dsi.law.warc.util.WarcHttpResponse}, for instance, implements {@link it.unimi.dsi.law.warc.util.HttpResponse#contentAsStream() contentAsStream} and {@link it.unimi.dsi.law.warc.util.Response#fromWarcRecord(WarcRecord record) fromWarcRecord} methods so that the content is read from a WARC file and should indeed be used to read a {@link it.unimi.dsi.law.warc.io.WarcRecord} (of suitable record-type) as an {@link it.unimi.dsi.law.warc.util.HttpResponse}.

An example of the usage of such class to read a sequence of responses from a WARC file can be found in the source code of {@link it.unimi.dsi.law.warc.io.examples.SequentialHttpResponseRead}.

Finally, the {@link it.unimi.dsi.law.warc.util.MutableHttpResponse} is a concrete implementation of an {@link it.unimi.dsi.law.warc.util.AbstractHttpResponse} endowed with setters that can be used to populate the response and hence a {@link it.unimi.dsi.law.warc.io.WarcRecord} (via {@link it.unimi.dsi.law.warc.util.AbstractHttpResponse#toWarcRecord(WarcRecord record) toWarcRecord} method) in order to write it.

Digest based duplicate detection

There are many ways to detect duplicates in a crawl. UbiCrawler adopts a technique based on page content digests. The interface {@link it.unimi.dsi.law.warc.util.DigestBasedDuplicateDetection} is used to represent duplicate information present in WARC files produced by UbiCrawler.

More precisely, the class {@link it.unimi.dsi.law.warc.util.WarcHttpResponse} implements such interface so that after using it to read a WARC record, one can use the class {@link it.unimi.dsi.law.warc.util.DigestBasedDuplicateDetection#isDuplicate() isDuplicate} method to know whether the response is a duplicate of a previously crawled one, and can use the {@link it.unimi.dsi.law.warc.util.DigestBasedDuplicateDetection#digest() digest} method to get the digest of the response content.

Observe that {@link it.unimi.dsi.law.warc.util.AbstractHttpResponse} takes this interface into account and if the concrete class used to invoke {@link it.unimi.dsi.law.warc.util.AbstractHttpResponse#toWarcRecord toWarcRecord} implements such interface, such method will populate the {@link it.unimi.dsi.law.warc.io.WarcRecord} with the duplicate information.