Class Digester

java.lang.Object
it.unimi.dsi.law.warc.parser.Digester
All Implemented Interfaces:
Callback

public class Digester
extends Object
implements Callback
A callback computing the digest of a page.

The page is somewhat simplified before being passed (as a sequence of bytes obtained by breaking each character into the upper and lower byte) to MessageDigest.update(byte[]). All start/end tags are case-normalized, and all their content (except for the element-type name) is removed. An exception is made for SRC attribute of FRAME and IFRAME elements, as they are necessary to distinguish correctly framed pages without alternative text. The attributes will be resolved w.r.t. the URL associated to the page.

To avoid clashes between digests coming from different sites, you can optionally set a URL whose authority that will be used to update the digest before adding the actual text page. You can set the URL with url(URI). A good idea is to use the host name (or even the authority).