com.google.common.base.Predicate<URIResponse>, Parser<T>, Filter<URIResponse>, FlyweightPrototype<Filter<URIResponse>>, java.util.function.Predicate<URIResponse>public class HTMLParser<T> extends java.lang.Object implements Parser<T>
HttpResponse. Instances are heavyweight—they
should be pooled and shared, since their usage is transitory and CPU-intensive.
Warning: starting with version 0.9.15, this parser will not
return links marked with NoFollow unless otherwise specified in the
constructor.
| Modifier and Type | Class | Description |
|---|---|---|
static class |
HTMLParser.DigestAppendable |
A class computing the digest of a page.
|
static class |
HTMLParser.SetLinkReceiver |
An implementation of a
Parser.LinkReceiver that accumulates the URLs in a public set. |
Parser.LinkReceiver, Parser.TextProcessor<T>| Modifier and Type | Field | Description |
|---|---|---|
protected char[] |
buffer |
The character buffer.
|
static int |
CHAR_BUFFER_SIZE |
The size of the internal Jericho buffer.
|
protected static java.util.regex.Pattern |
CHARSET_PATTERN |
Used by
getCharsetName(byte[], int). |
protected static java.util.regex.Pattern |
CONTENT_PATTERN |
Used by
getCharsetName(byte[], int). |
protected boolean |
crossAuthorityDuplicates |
If
true, pages with the same content but with different authorities are considered duplicates. |
protected HTMLParser.DigestAppendable |
digestAppendable |
An object emboding the digest logic, or
null for no digest computation. |
protected java.lang.String |
guessedCharset |
The charset we guessed for the last response.
|
protected static java.util.regex.Pattern |
HTTP_EQUIV_PATTERN |
Used by
getCharsetName(byte[], int). |
protected java.net.URI |
location |
The location URL from headers of the last response, if any, or
null. |
protected static TextPattern |
META_PATTERN |
Used by
getCharsetName(byte[], int). |
protected java.net.URI |
metaLocation |
The location URL from
META elements of the last response, if any, or null. |
protected Parser.TextProcessor<T> |
textProcessor |
A text processor, or
null. |
protected static TextPattern |
URLEQUAL_PATTERN |
The pattern prefixing the URL in a
META HTTP-EQUIV element of refresh type. |
FILTER_PACKAGE_NAMENULL_LINK_RECEIVER| Constructor | Description |
|---|---|
HTMLParser() |
Builds a parser with a fixed buffer of
CHAR_BUFFER_SIZE characters for link extraction only (no digesting). |
HTMLParser(com.google.common.hash.HashFunction hashFunction) |
Builds a parser with a fixed buffer of
CHAR_BUFFER_SIZE characters for link extraction and, possibly, digesting a page. |
HTMLParser(com.google.common.hash.HashFunction hashFunction,
boolean crossAuthorityDuplicates) |
Builds a parser with a fixed buffer of
CHAR_BUFFER_SIZE characters for link extraction and, possibly, digesting a page. |
HTMLParser(com.google.common.hash.HashFunction hashFunction,
Parser.TextProcessor<T> textProcessor,
boolean crossAuthorityDuplicates) |
Builds a parser with a fixed buffer of
CHAR_BUFFER_SIZE characters for link extraction and, possibly, digesting a page. |
HTMLParser(com.google.common.hash.HashFunction hashFunction,
Parser.TextProcessor<T> textProcessor,
boolean crossAuthorityDuplicates,
boolean returnNoFollow) |
Builds a parser with a fixed buffer of
CHAR_BUFFER_SIZE for link extraction and, possibly, digesting a page. |
HTMLParser(com.google.common.hash.HashFunction hashFunction,
Parser.TextProcessor<T> textProcessor,
boolean crossAuthorityDuplicates,
boolean returnNoFollow,
int bufferSize) |
Builds a parser for link extraction and, possibly, digesting a page.
|
HTMLParser(com.google.common.hash.HashFunction hashFunction,
Parser.TextProcessor<T> textProcessor,
boolean crossAuthorityDuplicates,
int bufferSize) |
Builds a parser for link extraction and, possibly, digesting a page.
|
HTMLParser(java.lang.String messageDigest) |
Builds a parser with a fixed buffer of
CHAR_BUFFER_SIZE characters for link extraction and, possibly, digesting a page. |
HTMLParser(java.lang.String messageDigest,
java.lang.String crossAuthorityDuplicates) |
Builds a parser with a fixed buffer of
CHAR_BUFFER_SIZE characters for link extraction and, possibly, digesting a page. |
HTMLParser(java.lang.String messageDigest,
java.lang.String textProcessorSpec,
java.lang.String crossAuthorityDuplicates) |
Builds a parser with a fixed buffer of
CHAR_BUFFER_SIZE characters for link extraction and, possibly, digesting a page. |
HTMLParser(java.lang.String messageDigest,
java.lang.String textProcessorSpec,
java.lang.String crossAuthorityDuplicates,
java.lang.String returnNoFollow) |
Builds a parser with a fixed buffer of
CHAR_BUFFER_SIZE characters for link extraction and, possibly, digesting a page. |
| Modifier and Type | Method | Description |
|---|---|---|
boolean |
apply(URIResponse uriResponse) |
|
HTMLParser<T> |
clone() |
|
HTMLParser<T> |
copy() |
This method strengthens the return type of the method inherited from
Filter. |
static java.lang.String |
getCharsetName(byte[] buffer,
int length) |
Returns the charset name as indicated by a
META
HTTP-EQUIV element, if
present, interpreting the provided byte array as a sequence of
ISO-8859-1-encoded characters. |
static java.lang.String |
getCharsetNameFromHeader(java.lang.String headerValue) |
Extracts the charset name from the header value of a
content-type
header using a regular expression. |
java.lang.String |
guessedCharset() |
Returns a guessed charset for the document, or
null if the charset could not be
guessed. |
java.net.URI |
location() |
Returns the BURL location header, if present; if it is not present, but the page contains a valid metalocation, the latter
is returned.
|
static void |
main(java.lang.String[] arg) |
|
byte[] |
parse(java.net.URI uri,
org.apache.http.HttpResponse httpResponse,
Parser.LinkReceiver linkReceiver) |
Parses a response.
|
protected void |
process(Parser.LinkReceiver linkReceiver,
java.net.URI base,
java.lang.String s) |
Pre-process a string that represents a raw link found in the page, trying to derelativize it.
|
T |
result() |
Returns the result of the processing.
|
protected static final TextPattern URLEQUAL_PATTERN
META HTTP-EQUIV element of refresh type.public static final int CHAR_BUFFER_SIZE
protected final char[] buffer
protected java.lang.String guessedCharset
protected final HTMLParser.DigestAppendable digestAppendable
null for no digest computation.protected final Parser.TextProcessor<T> textProcessor
null.protected java.net.URI location
null.protected java.net.URI metaLocation
META elements of the last response, if any, or null.protected boolean crossAuthorityDuplicates
true, pages with the same content but with different authorities are considered duplicates.protected static final TextPattern META_PATTERN
getCharsetName(byte[], int).protected static final java.util.regex.Pattern HTTP_EQUIV_PATTERN
getCharsetName(byte[], int).protected static final java.util.regex.Pattern CONTENT_PATTERN
getCharsetName(byte[], int).protected static final java.util.regex.Pattern CHARSET_PATTERN
getCharsetName(byte[], int).public HTMLParser(com.google.common.hash.HashFunction hashFunction)
CHAR_BUFFER_SIZE characters for link extraction and, possibly, digesting a page. By default, only pages from within the same
scheme+authority may be considered to be duplicates.hashFunction - the hash function used to digest, null if no digesting will be performed.public HTMLParser(com.google.common.hash.HashFunction hashFunction,
Parser.TextProcessor<T> textProcessor,
boolean crossAuthorityDuplicates,
boolean returnNoFollow,
int bufferSize)
hashFunction - the hash function used to digest, null if no digesting will be performed.textProcessor - a text processor, or null if no text processing is required.crossAuthorityDuplicates - if true, pages with different scheme+authority but with the same content will be considered to be duplicates, as long
as they are assigned to the same Agent.returnNoFollow - forces returning also links marked as NoFollow.bufferSize - the fixed size of the internal buffer; if zero, the buffer will be dynamic.public HTMLParser(com.google.common.hash.HashFunction hashFunction,
Parser.TextProcessor<T> textProcessor,
boolean crossAuthorityDuplicates,
int bufferSize)
hashFunction - the hash function used to digest, null if no digesting will be performed.textProcessor - a text processor, or null if no text processing is required.crossAuthorityDuplicates - if true, pages with different scheme+authority but with the same content will be considered to be duplicates, as long
as they are assigned to the same Agent.bufferSize - the fixed size of the internal buffer; if zero, the buffer will be dynamic.public HTMLParser(com.google.common.hash.HashFunction hashFunction,
Parser.TextProcessor<T> textProcessor,
boolean crossAuthorityDuplicates,
boolean returnNoFollow)
CHAR_BUFFER_SIZE for link extraction and, possibly, digesting a page.hashFunction - the hash function used to digest, null if no digesting will be performed.textProcessor - a text processor, or null if no text processing is required.crossAuthorityDuplicates - if true, pages with different scheme+authority but with the same content will be considered to be duplicates, as long
as they are assigned to the same Agent.returnNoFollow - forces returning also links marked as NoFollow.public HTMLParser(com.google.common.hash.HashFunction hashFunction,
boolean crossAuthorityDuplicates)
CHAR_BUFFER_SIZE characters for link extraction and, possibly, digesting a page.hashFunction - the hash function used to digest, null if no digesting will be performed.crossAuthorityDuplicates - if true, pages with different scheme+authority but with the same content will be considered to be duplicates, as long
as they are assigned to the same Agent.public HTMLParser(com.google.common.hash.HashFunction hashFunction,
Parser.TextProcessor<T> textProcessor,
boolean crossAuthorityDuplicates)
CHAR_BUFFER_SIZE characters for link extraction and, possibly, digesting a page.hashFunction - the hash function used to digest, null if no digesting will be performed.textProcessor - a text processor, or null if no text processing is required.crossAuthorityDuplicates - if true, pages with different scheme+authority but with the same content will be considered to be duplicates, as long
as they are assigned to the same Agent.public HTMLParser(java.lang.String messageDigest)
throws java.security.NoSuchAlgorithmException
CHAR_BUFFER_SIZE characters for link extraction and, possibly, digesting a page. (No cross-authority duplicates are considered)messageDigest - the name of a message-digest algorithm, or the empty string if no digest will be computed.java.security.NoSuchAlgorithmExceptionpublic HTMLParser(java.lang.String messageDigest,
java.lang.String crossAuthorityDuplicates)
throws java.security.NoSuchAlgorithmException
CHAR_BUFFER_SIZE characters for link extraction and, possibly, digesting a page.messageDigest - the name of a message-digest algorithm, or the empty string if no digest will be computed.crossAuthorityDuplicates - a string whose value can only be "true" or "false" that is used to determine if you want to check for cross-authority duplicates.java.security.NoSuchAlgorithmExceptionpublic HTMLParser(java.lang.String messageDigest,
java.lang.String textProcessorSpec,
java.lang.String crossAuthorityDuplicates)
throws java.security.NoSuchAlgorithmException,
java.lang.IllegalArgumentException,
java.lang.ClassNotFoundException,
java.lang.IllegalAccessException,
java.lang.reflect.InvocationTargetException,
java.lang.InstantiationException,
java.lang.NoSuchMethodException,
java.io.IOException
CHAR_BUFFER_SIZE characters for link extraction and, possibly, digesting a page.messageDigest - the name of a message-digest algorithm, or the empty string if no digest will be computed.textProcessorSpec - the specification of a text processor that will be passed to an ObjectParser.crossAuthorityDuplicates - a string whose value can only be "true" or "false" that is used to determine if you want to check for cross-authority duplicates.java.security.NoSuchAlgorithmExceptionjava.lang.IllegalArgumentExceptionjava.lang.ClassNotFoundExceptionjava.lang.IllegalAccessExceptionjava.lang.reflect.InvocationTargetExceptionjava.lang.InstantiationExceptionjava.lang.NoSuchMethodExceptionjava.io.IOExceptionpublic HTMLParser(java.lang.String messageDigest,
java.lang.String textProcessorSpec,
java.lang.String crossAuthorityDuplicates,
java.lang.String returnNoFollow)
throws java.security.NoSuchAlgorithmException,
java.lang.IllegalArgumentException,
java.lang.ClassNotFoundException,
java.lang.IllegalAccessException,
java.lang.reflect.InvocationTargetException,
java.lang.InstantiationException,
java.lang.NoSuchMethodException,
java.io.IOException
CHAR_BUFFER_SIZE characters for link extraction and, possibly, digesting a page.messageDigest - the name of a message-digest algorithm, or the empty string if no digest will be computed.textProcessorSpec - the specification of a text processor that will be passed to an ObjectParser.crossAuthorityDuplicates - a string whose value can only be "true" or "false" that is used to determine if you want to check for cross-authority duplicates.returnNoFollow - forces returning also links marked as NoFollow.java.security.NoSuchAlgorithmExceptionjava.lang.IllegalArgumentExceptionjava.lang.ClassNotFoundExceptionjava.lang.IllegalAccessExceptionjava.lang.reflect.InvocationTargetExceptionjava.lang.InstantiationExceptionjava.lang.NoSuchMethodExceptionjava.io.IOExceptionpublic HTMLParser()
CHAR_BUFFER_SIZE characters for link extraction only (no digesting).protected void process(Parser.LinkReceiver linkReceiver, java.net.URI base, java.lang.String s)
linkReceiver - the link receiver that will receive the resulting URL.base - the base URL to be used to derelativize the link.s - the raw link to be derelativized.public byte[] parse(java.net.URI uri,
org.apache.http.HttpResponse httpResponse,
Parser.LinkReceiver linkReceiver)
throws java.io.IOException
Parserpublic java.lang.String guessedCharset()
Parsernull if the charset could not be
guessed.guessedCharset in interface Parser<T>null.public java.net.URI location()
null is returned.null otherwise.public static java.lang.String getCharsetName(byte[] buffer,
int length)
META
HTTP-EQUIV element, if
present, interpreting the provided byte array as a sequence of
ISO-8859-1-encoded characters. Only the first such occurrence is considered (even if
it might not correspond to a valid or available charset).
Beware: it might not work if the
value of some attribute in a meta tag
contains a string matching (case insensitively) the r.e.
http-equiv\s*=\s*('|")content-type('|"), or
content\s*=\s*('|")[^"']*('|").
buffer - a buffer containing raw bytes that will be interpreted as ISO-8859-1 characters.length - the number of significant bytes in the buffer.null if no
charset is specified; note that the charset might be not valid or not available.public static java.lang.String getCharsetNameFromHeader(java.lang.String headerValue)
content-type
header using a regular expression.
Warning: it might not work if someone puts the string charset=
in a string inside some attribute/value pair.headerValue - The value of a content-type header.null if no
charset is specified; note that the charset might be not valid or not available.public boolean apply(URIResponse uriResponse)
apply in interface com.google.common.base.Predicate<T>public HTMLParser<T> clone()
clone in class java.lang.Objectpublic HTMLParser<T> copy()
ParserFilter.public T result()
ParserNote that this method must be idempotent.
public static void main(java.lang.String[] arg)
throws java.lang.IllegalArgumentException,
java.io.IOException,
java.net.URISyntaxException,
JSAPException,
java.security.NoSuchAlgorithmException
java.lang.IllegalArgumentExceptionjava.io.IOExceptionjava.net.URISyntaxExceptionJSAPExceptionjava.security.NoSuchAlgorithmException