Class WarcRecord

java.lang.Object
it.unimi.dsi.law.warc.io.WarcRecord
Direct Known Subclasses:
GZWarcRecord

public class WarcRecord
extends Object
A class to read/write WARC/0.9 records (for format details, please see the WARC format specifications).

To write a record, set header and block appropriately and then call write(OutputStream). After such call, the WarcRecord.Header.dataLength field of the header will be modified to reflect the write operation.

To perform a sequence of consecutive read/skip, call read(FastBufferedInputStream) or skip(FastBufferedInputStream). After a read, the block can (but it is not required to) be read to obtain the read data. The contentType field of the header can be used to determine how to further process the content of block.

As an implementation note: skipping just returns the value of the data-length field of the skipped record. On the other hand, reading parses the header and sets all the header fields appropriately; hence it sets block so that it refers to the block part of the record. Observe that since block is just a "view" over the underlying stream, its content, or position, are not guaranteed to remain the same after a consecutive read/skip on the same stream.

This object can be reused for non-consecutive writes on different streams. On the other hand, to reuse this object for non-consecutive read/skip, the method resetRead() must be called any time a read/skip does not follow a read/skip from the same stream.

This class uses internal buffering, hence it is not thread safe.