java.lang.Appendable
public static final class HTMLParser.DigestAppendable
extends java.lang.Object
implements java.lang.Appendable
The page is somewhat simplified before being passed (as a sequence of bytes obtained
by breaking each character into the upper and lower byte) to a Hasher
.
All start/end tags are case-normalized, and their whole content (except for the
element-type name) is removed.
An exception is made for SRC
attribute of
FRAME
and IFRAME
elements, as they are necessary to
distinguish correctly framed pages without alternative text. The attributes will be resolved
w.r.t. the URL associated to the page.
Moreover, non-HTML tags are substituted with a special tag unknown
.
For what concerns the text, all digits are substituted by a whitespace, and nonempty whitespace maximal sequences are coalesced to a single space. Tags are considered as a non-whitespace character.
To avoid clashes between digests coming from different sites, you can optionally set a URL
(passed to the init(URI)
method) whose scheme+authority will be used to update the digest before adding the actual text page.
Additionally, since BUbiNG 0.9.10 location redirect URLs (both from headers and from META elements), if present, are mixed in to avoid collapsing 3xx pages with boilerplate text.
Modifier and Type | Field | Description |
---|---|---|
protected byte[] |
digest |
|
protected static Reference2ObjectOpenHashMap<java.lang.String,byte[]> |
endTags |
Cached byte representations of all closing tags.
|
protected com.google.common.hash.Hasher |
hasher |
The hasher currently used to compute the digest.
|
protected com.google.common.hash.HashFunction |
hashFunction |
The message digest used to compute the digest.
|
protected boolean |
lastAppendedWasSpace |
True iff the last character appended was a space.
|
protected static Reference2ObjectOpenHashMap<java.lang.String,byte[]> |
startTags |
Cached byte representations of all opening tags.
|
Constructor | Description |
---|---|
DigestAppendable(com.google.common.hash.HashFunction hashFunction) |
Create a digest appendable using a given hash function.
|
Modifier and Type | Method | Description |
---|---|---|
java.lang.Appendable |
append(char c) |
|
java.lang.Appendable |
append(java.lang.CharSequence csq) |
|
java.lang.Appendable |
append(java.lang.CharSequence csq,
int start,
int end) |
|
byte[] |
digest() |
|
void |
endTag(net.htmlparser.jericho.EndTag endTag) |
|
void |
init(java.net.URI url) |
Initializes the digest computation.
|
void |
startTag(net.htmlparser.jericho.StartTag startTag) |
protected static final Reference2ObjectOpenHashMap<java.lang.String,byte[]> startTags
protected static final Reference2ObjectOpenHashMap<java.lang.String,byte[]> endTags
protected final com.google.common.hash.HashFunction hashFunction
protected com.google.common.hash.Hasher hasher
protected boolean lastAppendedWasSpace
protected byte[] digest
public DigestAppendable(com.google.common.hash.HashFunction hashFunction)
hashFunction
- the hash function used to digest.public void init(java.net.URI url)
url
- a URL, or null
for no URL. In the former case, the host name will be used to initialize the digest.public java.lang.Appendable append(java.lang.CharSequence csq, int start, int end)
append
in interface java.lang.Appendable
public java.lang.Appendable append(char c)
append
in interface java.lang.Appendable
public java.lang.Appendable append(java.lang.CharSequence csq)
append
in interface java.lang.Appendable
public byte[] digest()
public void startTag(net.htmlparser.jericho.StartTag startTag)
public void endTag(net.htmlparser.jericho.EndTag endTag)