Package it.unimi.dsi.law.warc.io
Provides classes performing low and high level WARC I/O (for format details, please see the ISO draft).
This code is designed to be very efficient. In particular some choices are worth mentioning:
- Object reuse
-
Whenever possible, a single object must be used to perform multiple operations.
For instance, to read a sequence of WARC records, one must instantiate a single
WarcRecord
object and invoke repeatedly itsread
method, as opposed to have a newWarcRecord
object for every record read.The convenience iterators offered by
WarcFilteredIterator
andHttpResponseFilteredIterator
follow this approach and need aWarcRecord
to be given at construction time; such object will be reused by the iterator to expose the record read at every iteration. - Pull writes
-
To write a record, one should provide a
MeasurableInputStream
(by setting theblock
field ofWarcRecord
) from which thewrite
method ofWarcRecord
will pull the data to be written.This is especially convenient when the data comes from the network (as during a crawl) or from a read operation (as during batch processing of data) so that one can write a record simply using the available (suitably wrapped) input stream.
- Public fields
-
To avoid the overhead of getters and setters, some fields (such as the
block
andheader
ofWarcRecord
andgzheader
ofGZWarcRecord
) are declaredpublic
.
Low level I/O
Low level I/O can be performed using WarcRecord
for the WARC format or GZWarcRecord
for the compressed WARC format. For further detail, see the class documentation.
A very simple example of low level sequential reads can be found in the source code of SequentialWarcRecordRead
, while an example of low level sequential writes can be found in the source code of SequentialWarcRecordWrite
; both examples show how to use the plain and compressed WARC format.
A (more complex) example of low level random access reads can be found in the source code of the CutWarc
tool (see also the IndexWarc
tool on how to create the index).
High level I/O
High level I/O is provided at the moment only for (a subset of) the response
record type of the WARC specification. More precisely, interfaces and classes are provided to deal with responses with content-type
of HTTP
(or HTTPS
) kind. At the interface level, read-only access is mainly provided, in particular, all the interfaces specify getters for the various properties, but no getters.
Interfaces
More in detail, the Response
interface (besides the abovementioned getters), prescribes the method fromWarcRecord
that allows to obtain a response from record. The HttpResponse
specialises such interface to deal with responses having content-type
of HTTP
(or HTTPS
) kind.
Classes
The AbstractHttpResponse
provides an abstract implementation of a HttpResponse
with all the getters except for contentAsStream
that concrete subclasses must implement according to the way they represent the response content; the fromWarcRecord
) method is also left unimplemented much for the same reason. On the other hand, there is a toWarcRecord
method that can be used by concrete subclasses to populate a WARC record with the data in the response (for example, in order to subsequently write such record).
The concrete class WarcHttpResponse
, for instance, implements contentAsStream
and fromWarcRecord
methods so that the content is read from a WARC file and should indeed be used to read a WarcRecord
(of suitable record-type
) as an HttpResponse
.
An example of the usage of such class to read a sequence of responses from a WARC file can be found in the source code of SequentialHttpResponseRead
.
Finally, the MutableHttpResponse
is a concrete implementation of an AbstractHttpResponse
endowed with setters that can be used to populate the response and hence a WarcRecord
(via toWarcRecord
method) in order to write it.
Digest based duplicate detection
There are many ways to detect duplicates in a crawl. UbiCrawler adopts a technique based on page content digests. The interface DigestBasedDuplicateDetection
is used to represent duplicate information present in WARC files produced by UbiCrawler.
More precisely, the class WarcHttpResponse
implements such interface so that after using it to read a WARC record, one can use the class isDuplicate
method to know whether the response is a duplicate of a previously crawled one, and can use the digest
method to get the digest of the response content.
Observe that AbstractHttpResponse
takes this interface into account and if the concrete class used to invoke
toWarcRecord
implements such interface, such method will populate the WarcRecord
with the duplicate information.
-
Class Summary Class Description BoundedCountingInputStream A class that decorates anInputStream
to obtain aMeasurableInputStream
.GZWarcRecord GZWarcRecord.GZHeader A class to contain fields contained in the gzip header.HttpResponseFilteredIterator A class to iterate over WARC files getting only records corresponding toHttpResponse
that satisfy a given filter.InspectableBufferedInputStream An input stream that wraps an underlying input stream to make it rewindable and partially inspectable, using a bounded-capacity memory buffer and an overflow file.MeasurableSequenceInputStream AMeasurableInputStream
version of aSequenceInputStream
.WarcFilteredIterator A class to iterate over WARC files getting only records that satisfy a given filter.WarcRecord A class to read/write WARC/0.9 records (for format details, please see the WARC format specifications).WarcRecord.Header A class to contain fields contained in the warcheader
. -
Enum Summary Enum Description InspectableBufferedInputStream.State The possible states of this stream, as explained above.WarcRecord.ContentType Content types.WarcRecord.RecordType Record types. -
Exception Summary Exception Description WarcRecord.FormatException An exception to denote parsing errors during reads.