Class GZWarcRecord
public class GZWarcRecord extends WarcRecord
Records written/read with this class use the skip-lengths
as detailed in
section 10.2 of warc specification.
Moreover the NAME
optional gzip header field contains the
WarcRecord.Header.recordId
and the COMMENT
optional gzip header field contains the value of the anvl-filed
corresponding to
the HttpResponse.DIGEST_HEADER
key, if present, or the
WarcRecord.Header.recordId
, followed by tab and then by the
WarcRecord.Header.subjectUri
.
As for a WarcRecord
, to write a record, set WarcRecord.header
and WarcRecord.block
appropriately and then call write(OutputStream)
. After such call,
some WarcRecord.header
fields will be modified and the gzheader
fields will be set to
reflect the write operation.
Again, as in the case of a WarcRecord
, to perform a sequence
of consecutive read/skip, call read(FastBufferedInputStream)
or
skip(FastBufferedInputStream)
. After a read, the WarcRecord.block
can (but it is not
required to) be read to obtain the read data. The WarcRecord.Header.contentType
field
can be used to determine how to parse the content of WarcRecord.block
.
As an implementation note: skipping just populates the gzheader
fields and returns
the value of the compressed-skip-length
field of the skipped record. On the other
hand, reading parses the gzip header as well as the warc header
and sets all the
WarcRecord.header
and gzheader
fields appropriately, and hence sets WarcRecord.block
so that
it refers to the block
part of the record. After a full read (a read that leaves
less than PARTIAL_UNCOMPRESSED_READ_THRESHOLD
bytes in uncompressedRecordStream
)
the CRC found in the gzip file is checked against the CRC computed over the record; partial reads
can check the CRC calling checkCRC(FastBufferedInputStream)
that will consume
uncompressedRecordStream
to compute the CRC.
This object can be reused for non-consecutive writes on different streams. On the other hand,
to reuse this object for non-consecutive read/skip, the method resetRead()
must be
called any time a read/skip does not follow a read/skip from the same stream.
This class uses internal buffering, hence it is not thread safe.
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
GZWarcRecord.GZHeader
A class to contain fields contained in the gzip header.Nested classes/interfaces inherited from class it.unimi.dsi.law.warc.io.WarcRecord
WarcRecord.ContentType, WarcRecord.FormatException, WarcRecord.Header, WarcRecord.RecordType
-
Field Summary
Fields Modifier and Type Field Description static boolean
ASSERTS
GZWarcRecord.GZHeader
gzheader
The GZip headers used by this object.static boolean
USE_POSITION_INSTEAD_OF_SKIP
Tells what method to use to skip bytes in the input stream.Fields inherited from class it.unimi.dsi.law.warc.io.WarcRecord
block, BYTE_REPRESENTATION_TO_CONTENT_TYPE, BYTE_REPRESENTATION_TO_RECORD_TYPE, crc, CRLF, DEBUG, DEFAULT_BUFFER_SIZE, header, UUID_FIELD_NAME, WARC_ID
-
Constructor Summary
Constructors Constructor Description GZWarcRecord()
-
Method Summary
Modifier and Type Method Description void
checkCRC(FastBufferedInputStream in)
long
read(FastBufferedInputStream in)
A method to read a record from anInputStream
.void
resetRead()
A method to allow the reuse of the present object for non consecutive reads.long
skip(FastBufferedInputStream in)
A method to skip a record from anInputStream
.String
toString()
void
write(OutputStream out)
A method to write this record to anOutputStream
.Methods inherited from class it.unimi.dsi.law.warc.io.WarcRecord
copy
-
Field Details
-
ASSERTS
public static final boolean ASSERTS- See Also:
- Constant Field Values
-
USE_POSITION_INSTEAD_OF_SKIP
public static final boolean USE_POSITION_INSTEAD_OF_SKIPTells what method to use to skip bytes in the input stream. It's here for profiling purposes.- See Also:
- Constant Field Values
-
gzheader
The GZip headers used by this object.
-
-
Constructor Details
-
GZWarcRecord
public GZWarcRecord()
-
-
Method Details
-
resetRead
public void resetRead()Description copied from class:WarcRecord
A method to allow the reuse of the present object for non consecutive reads.- Overrides:
resetRead
in classWarcRecord
-
skip
Description copied from class:WarcRecord
A method to skip a record from anInputStream
.- Overrides:
skip
in classWarcRecord
- Parameters:
in
- theFastBufferedInputStream
to read from.- Returns:
- the value of the
data-length
, or -1 if eof has been reached. - Throws:
IOException
WarcRecord.FormatException
-
read
Description copied from class:WarcRecord
A method to read a record from anInputStream
.- Overrides:
read
in classWarcRecord
- Parameters:
in
- theFastBufferedInputStream
to read from.- Returns:
- the value of the
data-length
, or -1 if eof has been reached. - Throws:
IOException
WarcRecord.FormatException
-
write
Description copied from class:WarcRecord
A method to write this record to anOutputStream
.- Overrides:
write
in classWarcRecord
- Parameters:
out
- where to write.- Throws:
IOException
-
checkCRC
-
toString
- Overrides:
toString
in classWarcRecord
-