Class GZWarcRecord

java.lang.Object
it.unimi.dsi.law.warc.io.WarcRecord
it.unimi.dsi.law.warc.io.GZWarcRecord

public class GZWarcRecord
extends WarcRecord
A class to read/write WARC/0.9 records in compressed form (for format details, please see the WARC and GZip format specifications).

Records written/read with this class use the skip-lengths as detailed in section 10.2 of warc specification.

Moreover the NAME optional gzip header field contains the WarcRecord.Header.recordId and the COMMENT optional gzip header field contains the value of the anvl-filed corresponding to the HttpResponse.DIGEST_HEADER key, if present, or the WarcRecord.Header.recordId, followed by tab and then by the WarcRecord.Header.subjectUri.

As for a WarcRecord, to write a record, set WarcRecord.header and WarcRecord.block appropriately and then call write(OutputStream). After such call, some WarcRecord.header fields will be modified and the gzheader fields will be set to reflect the write operation.

Again, as in the case of a WarcRecord, to perform a sequence of consecutive read/skip, call read(FastBufferedInputStream) or skip(FastBufferedInputStream). After a read, the WarcRecord.block can (but it is not required to) be read to obtain the read data. The WarcRecord.Header.contentType field can be used to determine how to parse the content of WarcRecord.block.

As an implementation note: skipping just populates the gzheader fields and returns the value of the compressed-skip-length field of the skipped record. On the other hand, reading parses the gzip header as well as the warc header and sets all the WarcRecord.header and gzheader fields appropriately, and hence sets WarcRecord.block so that it refers to the block part of the record. After a full read (a read that leaves less than PARTIAL_UNCOMPRESSED_READ_THRESHOLD bytes in uncompressedRecordStream) the CRC found in the gzip file is checked against the CRC computed over the record; partial reads can check the CRC calling checkCRC(FastBufferedInputStream) that will consume uncompressedRecordStream to compute the CRC.

This object can be reused for non-consecutive writes on different streams. On the other hand, to reuse this object for non-consecutive read/skip, the method resetRead() must be called any time a read/skip does not follow a read/skip from the same stream.

This class uses internal buffering, hence it is not thread safe.