it.unimi.dsi.law.warc.io
Class WarcRecord

java.lang.Object
  extended by it.unimi.dsi.law.warc.io.WarcRecord
Direct Known Subclasses:
GZWarcRecord

public class WarcRecord
extends Object

A class to read/write WARC/0.9 records (for format details, please see the WARC format specifications).

To write a record, set header and block appropriately and then call write(OutputStream). After such call, the WarcRecord.Header.dataLength field of the header will be modified to reflect the write operation.

To perform a sequence of consecutive read/skip, call read(FastBufferedInputStream) or skip(FastBufferedInputStream). After a read, the block can (but it is not required to) be read to obtain the read data. The contentType field of the header can be used to determine how to further process the content of block.

As an implementation note: skipping just returns the value of the data-length field of the skipped record. On the other hand, reading parses the header and sets all the header fields appropriately; hence it sets block so that it refers to the block part of the record. Observe that since block is just a "view" over the underlying stream, its content, or position, are not guaranteed to remain the same after a consecutive read/skip on the same stream.

This object can be reused for non-consecutive writes on different streams. On the other hand, to reuse this object for non-consecutive read/skip, the method resetRead() must be called any time a read/skip does not follow a read/skip from the same stream.

This class uses internal buffering, hence it is not thread safe.


Nested Class Summary
static class WarcRecord.ContentType
          Content types.
 class WarcRecord.FormatException
          An exception to denote parsing errors during reads.
static class WarcRecord.Header
          A class to contain fields contained in the warc header.
static class WarcRecord.RecordType
          Record types.
 
Field Summary
static boolean ASSERTS
           
 MeasurableInputStream block
          The warc block.
static Object2ObjectMap<byte[],WarcRecord.ContentType> BYTE_REPRESENTATION_TO_CONTENT_TYPE
           
static Object2ObjectMap<byte[],WarcRecord.RecordType> BYTE_REPRESENTATION_TO_RECORD_TYPE
           
protected  CRC32 crc
          The class used in write(OutputStream) to compute CRC32 of the content for GZWarcRecord.
static byte[] CRLF
          Some constant strings in their byte equivalent.
static boolean DEBUG
           
static int DEFAULT_BUFFER_SIZE
          The default size of the internal buffer used for headers read/write.
 WarcRecord.Header header
          The warc header.
static boolean USE_POSITION_INSTEAD_OF_SKIP
          Tells what method to use to skip bytes in the input stream.
static byte[] UUID_FIELD_NAME
          Some constant strings in their byte equivalent.
static byte[] WARC_ID
          Some constant strings in their byte equivalent.
 
Constructor Summary
WarcRecord()
          Builds a warc record.
WarcRecord(byte[] buffer)
          Builds a warc record.
 
Method Summary
 void copy(WarcRecord record)
          Copies this warc record fields from another warc record.
 long read(FastBufferedInputStream bin)
          A method to read a record from an InputStream.
 void resetRead()
          A method to allow the reuse of the present object for non consecutive reads.
 long skip(FastBufferedInputStream bin)
          A method to skip a record from an InputStream.
 String toString()
           
 void write(OutputStream out)
          A method to write this record to an OutputStream.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

DEBUG

public static final boolean DEBUG
See Also:
Constant Field Values

ASSERTS

public static final boolean ASSERTS
See Also:
Constant Field Values

USE_POSITION_INSTEAD_OF_SKIP

public static final boolean USE_POSITION_INSTEAD_OF_SKIP
Tells what method to use to skip bytes in the input stream. It's here for profiling purposes.

See Also:
Constant Field Values

BYTE_REPRESENTATION_TO_CONTENT_TYPE

public static final Object2ObjectMap<byte[],WarcRecord.ContentType> BYTE_REPRESENTATION_TO_CONTENT_TYPE

BYTE_REPRESENTATION_TO_RECORD_TYPE

public static final Object2ObjectMap<byte[],WarcRecord.RecordType> BYTE_REPRESENTATION_TO_RECORD_TYPE

DEFAULT_BUFFER_SIZE

public static final int DEFAULT_BUFFER_SIZE
The default size of the internal buffer used for headers read/write.

See Also:
Constant Field Values

WARC_ID

public static final byte[] WARC_ID
Some constant strings in their byte equivalent.


UUID_FIELD_NAME

public static final byte[] UUID_FIELD_NAME
Some constant strings in their byte equivalent.


CRLF

public static final byte[] CRLF
Some constant strings in their byte equivalent.


crc

protected CRC32 crc
The class used in write(OutputStream) to compute CRC32 of the content for GZWarcRecord. If null the crc will not be updated.


header

public final WarcRecord.Header header
The warc header.


block

public MeasurableInputStream block
The warc block.

Constructor Detail

WarcRecord

public WarcRecord(byte[] buffer)
Builds a warc record.

Parameters:
buffer - the buffer used for header read/write buffering.

WarcRecord

public WarcRecord()
Builds a warc record. It will allocate an internal buffer of size DEFAULT_BUFFER_SIZE bytes to buffer header read/writes.

Method Detail

copy

public void copy(WarcRecord record)
Copies this warc record fields from another warc record.

Parameters:
record - the record to copy from.

resetRead

public void resetRead()
A method to allow the reuse of the present object for non consecutive reads.


skip

public long skip(FastBufferedInputStream bin)
          throws IOException,
                 WarcRecord.FormatException
A method to skip a record from an InputStream.

Parameters:
bin - the FastBufferedInputStream to read from.
Returns:
the value of the data-length, or -1 if eof has been reached.
Throws:
IOException
WarcRecord.FormatException

read

public long read(FastBufferedInputStream bin)
          throws IOException,
                 WarcRecord.FormatException
A method to read a record from an InputStream.

Parameters:
bin - the FastBufferedInputStream to read from.
Returns:
the value of the data-length, or -1 if eof has been reached.
Throws:
IOException
WarcRecord.FormatException

write

public void write(OutputStream out)
           throws IOException
A method to write this record to an OutputStream.

Parameters:
out - where to write.
Throws:
IOException

toString

public String toString()
Overrides:
toString in class Object