Class GZWarcRecord
public class GZWarcRecord extends WarcRecord
Records written/read with this class use the skip-lengths as detailed in
section 10.2 of warc specification.
Moreover the NAME optional gzip header field contains the
WarcRecord.Header.recordId and the COMMENT
optional gzip header field contains the value of the anvl-filed corresponding to
the HttpResponse.DIGEST_HEADER key, if present, or the
WarcRecord.Header.recordId, followed by tab and then by the
WarcRecord.Header.subjectUri.
As for a WarcRecord, to write a record, set WarcRecord.header
and WarcRecord.block appropriately and then call write(OutputStream). After such call,
some WarcRecord.header fields will be modified and the gzheader fields will be set to
reflect the write operation.
Again, as in the case of a WarcRecord, to perform a sequence
of consecutive read/skip, call read(FastBufferedInputStream) or
skip(FastBufferedInputStream). After a read, the WarcRecord.block can (but it is not
required to) be read to obtain the read data. The WarcRecord.Header.contentType field
can be used to determine how to parse the content of WarcRecord.block.
As an implementation note: skipping just populates the gzheader fields and returns
the value of the compressed-skip-length field of the skipped record. On the other
hand, reading parses the gzip header as well as the warc header and sets all the
WarcRecord.header and gzheader fields appropriately, and hence sets WarcRecord.block so that
it refers to the block part of the record. After a full read (a read that leaves
less than PARTIAL_UNCOMPRESSED_READ_THRESHOLD bytes in uncompressedRecordStream)
the CRC found in the gzip file is checked against the CRC computed over the record; partial reads
can check the CRC calling checkCRC(FastBufferedInputStream) that will consume
uncompressedRecordStream to compute the CRC.
This object can be reused for non-consecutive writes on different streams. On the other hand,
to reuse this object for non-consecutive read/skip, the method resetRead() must be
called any time a read/skip does not follow a read/skip from the same stream.
This class uses internal buffering, hence it is not thread safe.
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classGZWarcRecord.GZHeaderA class to contain fields contained in the gzip header.Nested classes/interfaces inherited from class it.unimi.dsi.law.warc.io.WarcRecord
WarcRecord.ContentType, WarcRecord.FormatException, WarcRecord.Header, WarcRecord.RecordType -
Field Summary
Fields Modifier and Type Field Description static booleanASSERTSGZWarcRecord.GZHeadergzheaderThe GZip headers used by this object.static booleanUSE_POSITION_INSTEAD_OF_SKIPTells what method to use to skip bytes in the input stream.Fields inherited from class it.unimi.dsi.law.warc.io.WarcRecord
block, BYTE_REPRESENTATION_TO_CONTENT_TYPE, BYTE_REPRESENTATION_TO_RECORD_TYPE, crc, CRLF, DEBUG, DEFAULT_BUFFER_SIZE, header, UUID_FIELD_NAME, WARC_ID -
Constructor Summary
Constructors Constructor Description GZWarcRecord() -
Method Summary
Modifier and Type Method Description voidcheckCRC(FastBufferedInputStream in)longread(FastBufferedInputStream in)A method to read a record from anInputStream.voidresetRead()A method to allow the reuse of the present object for non consecutive reads.longskip(FastBufferedInputStream in)A method to skip a record from anInputStream.StringtoString()voidwrite(OutputStream out)A method to write this record to anOutputStream.Methods inherited from class it.unimi.dsi.law.warc.io.WarcRecord
copy
-
Field Details
-
ASSERTS
public static final boolean ASSERTS- See Also:
- Constant Field Values
-
USE_POSITION_INSTEAD_OF_SKIP
public static final boolean USE_POSITION_INSTEAD_OF_SKIPTells what method to use to skip bytes in the input stream. It's here for profiling purposes.- See Also:
- Constant Field Values
-
gzheader
The GZip headers used by this object.
-
-
Constructor Details
-
GZWarcRecord
public GZWarcRecord()
-
-
Method Details
-
resetRead
public void resetRead()Description copied from class:WarcRecordA method to allow the reuse of the present object for non consecutive reads.- Overrides:
resetReadin classWarcRecord
-
skip
Description copied from class:WarcRecordA method to skip a record from anInputStream.- Overrides:
skipin classWarcRecord- Parameters:
in- theFastBufferedInputStreamto read from.- Returns:
- the value of the
data-length, or -1 if eof has been reached. - Throws:
IOExceptionWarcRecord.FormatException
-
read
Description copied from class:WarcRecordA method to read a record from anInputStream.- Overrides:
readin classWarcRecord- Parameters:
in- theFastBufferedInputStreamto read from.- Returns:
- the value of the
data-length, or -1 if eof has been reached. - Throws:
IOExceptionWarcRecord.FormatException
-
write
Description copied from class:WarcRecordA method to write this record to anOutputStream.- Overrides:
writein classWarcRecord- Parameters:
out- where to write.- Throws:
IOException
-
checkCRC
-
toString
- Overrides:
toStringin classWarcRecord
-