|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectit.unimi.dsi.law.warc.io.WarcRecord
public class WarcRecord
A class to read/write WARC/0.9 records (for format details, please see the WARC format specifications).
To write a record, set header and block appropriately and then call
write(OutputStream). After such call, the WarcRecord.Header.dataLength field
of the header will be modified to reflect the write operation.
To perform a sequence of consecutive read/skip, call read(FastBufferedInputStream)
or skip(FastBufferedInputStream). After a read, the block can (but it is not
required to) be read to obtain the read data. The contentType
field of the header can be used to determine how to further process the content of
block.
As an implementation note: skipping just returns the value of the data-length
field of the skipped record. On the other hand, reading parses the header and sets
all the header fields appropriately; hence it sets block so that it refers to
the block part of the record. Observe that since block is just a "view"
over the underlying stream, its content, or position, are not guaranteed to remain the same after
a consecutive read/skip on the same stream.
This object can be reused for non-consecutive writes on different streams. On the other hand,
to reuse this object for non-consecutive read/skip, the method resetRead() must be
called any time a read/skip does not follow a read/skip from the same stream.
This class uses internal buffering, hence it is not thread safe.
| Nested Class Summary | |
|---|---|
static class |
WarcRecord.ContentType
Content types. |
class |
WarcRecord.FormatException
An exception to denote parsing errors during reads. |
static class |
WarcRecord.Header
A class to contain fields contained in the warc header. |
static class |
WarcRecord.RecordType
Record types. |
| Field Summary | |
|---|---|
static boolean |
ASSERTS
|
MeasurableInputStream |
block
The warc block. |
static Object2ObjectMap<byte[],WarcRecord.ContentType> |
BYTE_REPRESENTATION_TO_CONTENT_TYPE
|
static Object2ObjectMap<byte[],WarcRecord.RecordType> |
BYTE_REPRESENTATION_TO_RECORD_TYPE
|
protected CRC32 |
crc
The class used in write(OutputStream) to compute CRC32 of the content for GZWarcRecord. |
static byte[] |
CRLF
Some constant strings in their byte equivalent. |
static boolean |
DEBUG
|
static int |
DEFAULT_BUFFER_SIZE
The default size of the internal buffer used for headers read/write. |
WarcRecord.Header |
header
The warc header. |
static boolean |
USE_POSITION_INSTEAD_OF_SKIP
Tells what method to use to skip bytes in the input stream. |
static byte[] |
UUID_FIELD_NAME
Some constant strings in their byte equivalent. |
static byte[] |
WARC_ID
Some constant strings in their byte equivalent. |
| Constructor Summary | |
|---|---|
WarcRecord()
Builds a warc record. |
|
WarcRecord(byte[] buffer)
Builds a warc record. |
|
| Method Summary | |
|---|---|
void |
copy(WarcRecord record)
Copies this warc record fields from another warc record. |
long |
read(FastBufferedInputStream bin)
A method to read a record from an InputStream. |
void |
resetRead()
A method to allow the reuse of the present object for non consecutive reads. |
long |
skip(FastBufferedInputStream bin)
A method to skip a record from an InputStream. |
String |
toString()
|
void |
write(OutputStream out)
A method to write this record to an OutputStream. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
| Field Detail |
|---|
public static final boolean DEBUG
public static final boolean ASSERTS
public static final boolean USE_POSITION_INSTEAD_OF_SKIP
public static final Object2ObjectMap<byte[],WarcRecord.ContentType> BYTE_REPRESENTATION_TO_CONTENT_TYPE
public static final Object2ObjectMap<byte[],WarcRecord.RecordType> BYTE_REPRESENTATION_TO_RECORD_TYPE
public static final int DEFAULT_BUFFER_SIZE
public static final byte[] WARC_ID
public static final byte[] UUID_FIELD_NAME
public static final byte[] CRLF
protected CRC32 crc
write(OutputStream) to compute CRC32 of the content for GZWarcRecord. If null the crc will not be updated.
public final WarcRecord.Header header
header.
public MeasurableInputStream block
block.
| Constructor Detail |
|---|
public WarcRecord(byte[] buffer)
buffer - the buffer used for header read/write buffering.public WarcRecord()
DEFAULT_BUFFER_SIZE bytes to buffer
header read/writes.
| Method Detail |
|---|
public void copy(WarcRecord record)
record - the record to copy from.public void resetRead()
public long skip(FastBufferedInputStream bin)
throws IOException,
WarcRecord.FormatException
InputStream.
bin - the FastBufferedInputStream to read from.
data-length, or -1 if eof has been reached.
IOException
WarcRecord.FormatException
public long read(FastBufferedInputStream bin)
throws IOException,
WarcRecord.FormatException
InputStream.
bin - the FastBufferedInputStream to read from.
data-length, or -1 if eof has been reached.
IOException
WarcRecord.FormatException
public void write(OutputStream out)
throws IOException
OutputStream.
out - where to write.
IOExceptionpublic String toString()
toString in class Object
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||