Package it.unimi.dsi.law.warc.io
Provides classes performing low and high level WARC I/O (for format details, please see the ISO draft).
This code is designed to be very efficient. In particular some choices are worth mentioning:
- Object reuse
- 
        
        Whenever possible, a single object must be used to perform multiple operations. For instance, to read a sequence of WARC records, one must instantiate a single WarcRecordobject and invoke repeatedly itsreadmethod, as opposed to have a newWarcRecordobject for every record read.The convenience iterators offered by WarcFilteredIteratorandHttpResponseFilteredIteratorfollow this approach and need aWarcRecordto be given at construction time; such object will be reused by the iterator to expose the record read at every iteration.
- Pull writes
- 
        
        To write a record, one should provide a MeasurableInputStream(by setting theblockfield ofWarcRecord) from which thewritemethod ofWarcRecordwill pull the data to be written.This is especially convenient when the data comes from the network (as during a crawl) or from a read operation (as during batch processing of data) so that one can write a record simply using the available (suitably wrapped) input stream. 
- Public fields
- 
        
        To avoid the overhead of getters and setters, some fields (such as the blockandheaderofWarcRecordandgzheaderofGZWarcRecord) are declaredpublic.
Low level I/O
Low level I/O can be performed using WarcRecord for the WARC format or GZWarcRecord for the compressed WARC format. For further detail, see the class documentation. 
        
        
A very simple example of low level sequential reads can be found in the source code of SequentialWarcRecordRead, while an example of low level sequential writes can be found in the source code of SequentialWarcRecordWrite; both examples show how to use the plain and compressed WARC format.   
        
        
A (more complex) example of low level random access reads can be found in the source code of the CutWarc tool (see also the IndexWarc tool on how to create the index).
        
        
High level I/O
High level I/O is provided at the moment only for (a subset of) the response record type of the WARC specification. More precisely, interfaces and classes are provided to deal with responses with content-type of HTTP (or HTTPS) kind. At the interface level, read-only access is mainly provided, in particular, all the interfaces specify getters for the various properties, but no getters. 
        
        
Interfaces
More in detail, the Response interface (besides the abovementioned getters), prescribes the method fromWarcRecord that allows to obtain a response from record. The HttpResponse specialises such interface to deal with responses having content-type of HTTP (or HTTPS) kind.
        
        
Classes
The AbstractHttpResponse provides an abstract implementation of a HttpResponse
        with all the getters except for contentAsStream that concrete subclasses must implement according to the way they represent the response content; the fromWarcRecord) method is also left unimplemented much for the same reason. On the other hand, there is a toWarcRecord method that can be used by concrete subclasses to populate a WARC record with the data in the response (for example, in order to subsequently write such record).
        
        
The concrete class WarcHttpResponse, for instance, implements contentAsStream and fromWarcRecord methods so that the content is read from a WARC file and should indeed be used to read a WarcRecord (of suitable record-type) as an HttpResponse.
        
        
An example of the usage of such class to read a sequence of responses from a WARC file can be found in the source code of SequentialHttpResponseRead.
                
        
Finally, the MutableHttpResponse is a concrete implementation of an AbstractHttpResponse endowed with setters that can be used to populate the response and hence a WarcRecord (via toWarcRecord method) in order to write it.
                
        
                
        
Digest based duplicate detection
There are many ways to detect duplicates in a crawl. UbiCrawler adopts a technique based on page content digests. The interface DigestBasedDuplicateDetection is used to represent duplicate information present in WARC files produced by UbiCrawler.
        
        
        
        
More precisely, the class WarcHttpResponse implements such interface so that after using it to read a WARC record, one can use the class isDuplicate method to know whether the response is a duplicate of a previously crawled one, and can use the digest method to get the digest of the response content.
        
        
        
        
Observe that AbstractHttpResponse takes this interface into account and if the concrete class used to invoke 
        toWarcRecord implements such interface, such method will populate the WarcRecord with the duplicate information.
- 
Class Summary Class Description BoundedCountingInputStream A class that decorates anInputStreamto obtain aMeasurableInputStream.GZWarcRecord GZWarcRecord.GZHeader A class to contain fields contained in the gzip header.HttpResponseFilteredIterator A class to iterate over WARC files getting only records corresponding toHttpResponsethat satisfy a given filter.InspectableBufferedInputStream An input stream that wraps an underlying input stream to make it rewindable and partially inspectable, using a bounded-capacity memory buffer and an overflow file.MeasurableSequenceInputStream AMeasurableInputStreamversion of aSequenceInputStream.WarcFilteredIterator A class to iterate over WARC files getting only records that satisfy a given filter.WarcRecord A class to read/write WARC/0.9 records (for format details, please see the WARC format specifications).WarcRecord.Header A class to contain fields contained in the warcheader.
- 
Enum Summary Enum Description InspectableBufferedInputStream.State The possible states of this stream, as explained above.WarcRecord.ContentType Content types.WarcRecord.RecordType Record types.
- 
Exception Summary Exception Description WarcRecord.FormatException An exception to denote parsing errors during reads.