com.google.common.base.Predicate<URIResponse>
, Parser<T>
, Filter<URIResponse>
, FlyweightPrototype<Filter<URIResponse>>
, java.util.function.Predicate<URIResponse>
public class HTMLParser<T> extends java.lang.Object implements Parser<T>
HttpResponse
. Instances are heavyweight—they
should be pooled and shared, since their usage is transitory and CPU-intensive.
Warning: starting with version 0.9.15, this parser will not
return links marked with NoFollow
unless otherwise specified in the
constructor
.
Modifier and Type | Class | Description |
---|---|---|
static class |
HTMLParser.DigestAppendable |
A class computing the digest of a page.
|
static class |
HTMLParser.SetLinkReceiver |
An implementation of a
Parser.LinkReceiver that accumulates the URLs in a public set. |
Parser.LinkReceiver, Parser.TextProcessor<T>
Modifier and Type | Field | Description |
---|---|---|
protected char[] |
buffer |
The character buffer.
|
static int |
CHAR_BUFFER_SIZE |
The size of the internal Jericho buffer.
|
protected static java.util.regex.Pattern |
CHARSET_PATTERN |
Used by
getCharsetName(byte[], int) . |
protected static java.util.regex.Pattern |
CONTENT_PATTERN |
Used by
getCharsetName(byte[], int) . |
protected boolean |
crossAuthorityDuplicates |
If
true , pages with the same content but with different authorities are considered duplicates. |
protected HTMLParser.DigestAppendable |
digestAppendable |
An object emboding the digest logic, or
null for no digest computation. |
protected java.lang.String |
guessedCharset |
The charset we guessed for the last response.
|
protected static java.util.regex.Pattern |
HTTP_EQUIV_PATTERN |
Used by
getCharsetName(byte[], int) . |
protected java.net.URI |
location |
The location URL from headers of the last response, if any, or
null . |
protected static TextPattern |
META_PATTERN |
Used by
getCharsetName(byte[], int) . |
protected java.net.URI |
metaLocation |
The location URL from
META elements of the last response, if any, or null . |
protected Parser.TextProcessor<T> |
textProcessor |
A text processor, or
null . |
protected static TextPattern |
URLEQUAL_PATTERN |
The pattern prefixing the URL in a
META HTTP-EQUIV element of refresh type. |
FILTER_PACKAGE_NAME
NULL_LINK_RECEIVER
Constructor | Description |
---|---|
HTMLParser() |
Builds a parser with a fixed buffer of
CHAR_BUFFER_SIZE characters for link extraction only (no digesting). |
HTMLParser(com.google.common.hash.HashFunction hashFunction) |
Builds a parser with a fixed buffer of
CHAR_BUFFER_SIZE characters for link extraction and, possibly, digesting a page. |
HTMLParser(com.google.common.hash.HashFunction hashFunction,
boolean crossAuthorityDuplicates) |
Builds a parser with a fixed buffer of
CHAR_BUFFER_SIZE characters for link extraction and, possibly, digesting a page. |
HTMLParser(com.google.common.hash.HashFunction hashFunction,
Parser.TextProcessor<T> textProcessor,
boolean crossAuthorityDuplicates) |
Builds a parser with a fixed buffer of
CHAR_BUFFER_SIZE characters for link extraction and, possibly, digesting a page. |
HTMLParser(com.google.common.hash.HashFunction hashFunction,
Parser.TextProcessor<T> textProcessor,
boolean crossAuthorityDuplicates,
boolean returnNoFollow) |
Builds a parser with a fixed buffer of
CHAR_BUFFER_SIZE for link extraction and, possibly, digesting a page. |
HTMLParser(com.google.common.hash.HashFunction hashFunction,
Parser.TextProcessor<T> textProcessor,
boolean crossAuthorityDuplicates,
boolean returnNoFollow,
int bufferSize) |
Builds a parser for link extraction and, possibly, digesting a page.
|
HTMLParser(com.google.common.hash.HashFunction hashFunction,
Parser.TextProcessor<T> textProcessor,
boolean crossAuthorityDuplicates,
int bufferSize) |
Builds a parser for link extraction and, possibly, digesting a page.
|
HTMLParser(java.lang.String messageDigest) |
Builds a parser with a fixed buffer of
CHAR_BUFFER_SIZE characters for link extraction and, possibly, digesting a page. |
HTMLParser(java.lang.String messageDigest,
java.lang.String crossAuthorityDuplicates) |
Builds a parser with a fixed buffer of
CHAR_BUFFER_SIZE characters for link extraction and, possibly, digesting a page. |
HTMLParser(java.lang.String messageDigest,
java.lang.String textProcessorSpec,
java.lang.String crossAuthorityDuplicates) |
Builds a parser with a fixed buffer of
CHAR_BUFFER_SIZE characters for link extraction and, possibly, digesting a page. |
HTMLParser(java.lang.String messageDigest,
java.lang.String textProcessorSpec,
java.lang.String crossAuthorityDuplicates,
java.lang.String returnNoFollow) |
Builds a parser with a fixed buffer of
CHAR_BUFFER_SIZE characters for link extraction and, possibly, digesting a page. |
Modifier and Type | Method | Description |
---|---|---|
boolean |
apply(URIResponse uriResponse) |
|
HTMLParser<T> |
clone() |
|
HTMLParser<T> |
copy() |
This method strengthens the return type of the method inherited from
Filter . |
static java.lang.String |
getCharsetName(byte[] buffer,
int length) |
Returns the charset name as indicated by a
META
HTTP-EQUIV element, if
present, interpreting the provided byte array as a sequence of
ISO-8859-1-encoded characters. |
static java.lang.String |
getCharsetNameFromHeader(java.lang.String headerValue) |
Extracts the charset name from the header value of a
content-type
header using a regular expression. |
java.lang.String |
guessedCharset() |
Returns a guessed charset for the document, or
null if the charset could not be
guessed. |
java.net.URI |
location() |
Returns the BURL location header, if present; if it is not present, but the page contains a valid metalocation, the latter
is returned.
|
static void |
main(java.lang.String[] arg) |
|
byte[] |
parse(java.net.URI uri,
org.apache.http.HttpResponse httpResponse,
Parser.LinkReceiver linkReceiver) |
Parses a response.
|
protected void |
process(Parser.LinkReceiver linkReceiver,
java.net.URI base,
java.lang.String s) |
Pre-process a string that represents a raw link found in the page, trying to derelativize it.
|
T |
result() |
Returns the result of the processing.
|
protected static final TextPattern URLEQUAL_PATTERN
META
HTTP-EQUIV
element of refresh type.public static final int CHAR_BUFFER_SIZE
protected final char[] buffer
protected java.lang.String guessedCharset
protected final HTMLParser.DigestAppendable digestAppendable
null
for no digest computation.protected final Parser.TextProcessor<T> textProcessor
null
.protected java.net.URI location
null
.protected java.net.URI metaLocation
META
elements of the last response, if any, or null
.protected boolean crossAuthorityDuplicates
true
, pages with the same content but with different authorities are considered duplicates.protected static final TextPattern META_PATTERN
getCharsetName(byte[], int)
.protected static final java.util.regex.Pattern HTTP_EQUIV_PATTERN
getCharsetName(byte[], int)
.protected static final java.util.regex.Pattern CONTENT_PATTERN
getCharsetName(byte[], int)
.protected static final java.util.regex.Pattern CHARSET_PATTERN
getCharsetName(byte[], int)
.public HTMLParser(com.google.common.hash.HashFunction hashFunction)
CHAR_BUFFER_SIZE
characters for link extraction and, possibly, digesting a page. By default, only pages from within the same
scheme+authority may be considered to be duplicates.hashFunction
- the hash function used to digest, null
if no digesting will be performed.public HTMLParser(com.google.common.hash.HashFunction hashFunction, Parser.TextProcessor<T> textProcessor, boolean crossAuthorityDuplicates, boolean returnNoFollow, int bufferSize)
hashFunction
- the hash function used to digest, null
if no digesting will be performed.textProcessor
- a text processor, or null
if no text processing is required.crossAuthorityDuplicates
- if true
, pages with different scheme+authority but with the same content will be considered to be duplicates, as long
as they are assigned to the same Agent
.returnNoFollow
- forces returning also links marked as NoFollow
.bufferSize
- the fixed size of the internal buffer; if zero, the buffer will be dynamic.public HTMLParser(com.google.common.hash.HashFunction hashFunction, Parser.TextProcessor<T> textProcessor, boolean crossAuthorityDuplicates, int bufferSize)
hashFunction
- the hash function used to digest, null
if no digesting will be performed.textProcessor
- a text processor, or null
if no text processing is required.crossAuthorityDuplicates
- if true
, pages with different scheme+authority but with the same content will be considered to be duplicates, as long
as they are assigned to the same Agent
.bufferSize
- the fixed size of the internal buffer; if zero, the buffer will be dynamic.public HTMLParser(com.google.common.hash.HashFunction hashFunction, Parser.TextProcessor<T> textProcessor, boolean crossAuthorityDuplicates, boolean returnNoFollow)
CHAR_BUFFER_SIZE
for link extraction and, possibly, digesting a page.hashFunction
- the hash function used to digest, null
if no digesting will be performed.textProcessor
- a text processor, or null
if no text processing is required.crossAuthorityDuplicates
- if true
, pages with different scheme+authority but with the same content will be considered to be duplicates, as long
as they are assigned to the same Agent
.returnNoFollow
- forces returning also links marked as NoFollow
.public HTMLParser(com.google.common.hash.HashFunction hashFunction, boolean crossAuthorityDuplicates)
CHAR_BUFFER_SIZE
characters for link extraction and, possibly, digesting a page.hashFunction
- the hash function used to digest, null
if no digesting will be performed.crossAuthorityDuplicates
- if true
, pages with different scheme+authority but with the same content will be considered to be duplicates, as long
as they are assigned to the same Agent
.public HTMLParser(com.google.common.hash.HashFunction hashFunction, Parser.TextProcessor<T> textProcessor, boolean crossAuthorityDuplicates)
CHAR_BUFFER_SIZE
characters for link extraction and, possibly, digesting a page.hashFunction
- the hash function used to digest, null
if no digesting will be performed.textProcessor
- a text processor, or null
if no text processing is required.crossAuthorityDuplicates
- if true
, pages with different scheme+authority but with the same content will be considered to be duplicates, as long
as they are assigned to the same Agent
.public HTMLParser(java.lang.String messageDigest) throws java.security.NoSuchAlgorithmException
CHAR_BUFFER_SIZE
characters for link extraction and, possibly, digesting a page. (No cross-authority duplicates are considered)messageDigest
- the name of a message-digest algorithm, or the empty string if no digest will be computed.java.security.NoSuchAlgorithmException
public HTMLParser(java.lang.String messageDigest, java.lang.String crossAuthorityDuplicates) throws java.security.NoSuchAlgorithmException
CHAR_BUFFER_SIZE
characters for link extraction and, possibly, digesting a page.messageDigest
- the name of a message-digest algorithm, or the empty string if no digest will be computed.crossAuthorityDuplicates
- a string whose value can only be "true" or "false" that is used to determine if you want to check for cross-authority duplicates.java.security.NoSuchAlgorithmException
public HTMLParser(java.lang.String messageDigest, java.lang.String textProcessorSpec, java.lang.String crossAuthorityDuplicates) throws java.security.NoSuchAlgorithmException, java.lang.IllegalArgumentException, java.lang.ClassNotFoundException, java.lang.IllegalAccessException, java.lang.reflect.InvocationTargetException, java.lang.InstantiationException, java.lang.NoSuchMethodException, java.io.IOException
CHAR_BUFFER_SIZE
characters for link extraction and, possibly, digesting a page.messageDigest
- the name of a message-digest algorithm, or the empty string if no digest will be computed.textProcessorSpec
- the specification of a text processor that will be passed to an ObjectParser
.crossAuthorityDuplicates
- a string whose value can only be "true" or "false" that is used to determine if you want to check for cross-authority duplicates.java.security.NoSuchAlgorithmException
java.lang.IllegalArgumentException
java.lang.ClassNotFoundException
java.lang.IllegalAccessException
java.lang.reflect.InvocationTargetException
java.lang.InstantiationException
java.lang.NoSuchMethodException
java.io.IOException
public HTMLParser(java.lang.String messageDigest, java.lang.String textProcessorSpec, java.lang.String crossAuthorityDuplicates, java.lang.String returnNoFollow) throws java.security.NoSuchAlgorithmException, java.lang.IllegalArgumentException, java.lang.ClassNotFoundException, java.lang.IllegalAccessException, java.lang.reflect.InvocationTargetException, java.lang.InstantiationException, java.lang.NoSuchMethodException, java.io.IOException
CHAR_BUFFER_SIZE
characters for link extraction and, possibly, digesting a page.messageDigest
- the name of a message-digest algorithm, or the empty string if no digest will be computed.textProcessorSpec
- the specification of a text processor that will be passed to an ObjectParser
.crossAuthorityDuplicates
- a string whose value can only be "true" or "false" that is used to determine if you want to check for cross-authority duplicates.returnNoFollow
- forces returning also links marked as NoFollow
.java.security.NoSuchAlgorithmException
java.lang.IllegalArgumentException
java.lang.ClassNotFoundException
java.lang.IllegalAccessException
java.lang.reflect.InvocationTargetException
java.lang.InstantiationException
java.lang.NoSuchMethodException
java.io.IOException
public HTMLParser()
CHAR_BUFFER_SIZE
characters for link extraction only (no digesting).protected void process(Parser.LinkReceiver linkReceiver, java.net.URI base, java.lang.String s)
linkReceiver
- the link receiver that will receive the resulting URL.base
- the base URL to be used to derelativize the link.s
- the raw link to be derelativized.public byte[] parse(java.net.URI uri, org.apache.http.HttpResponse httpResponse, Parser.LinkReceiver linkReceiver) throws java.io.IOException
Parser
public java.lang.String guessedCharset()
Parser
null
if the charset could not be
guessed.guessedCharset
in interface Parser<T>
null
.public java.net.URI location()
null
is returned.null
otherwise.public static java.lang.String getCharsetName(byte[] buffer, int length)
META
HTTP-EQUIV
element, if
present, interpreting the provided byte array as a sequence of
ISO-8859-1-encoded characters. Only the first such occurrence is considered (even if
it might not correspond to a valid or available charset).
Beware: it might not work if the
value of some attribute in a meta
tag
contains a string matching (case insensitively) the r.e.
http-equiv\s*=\s*('|")content-type('|")
, or
content\s*=\s*('|")[^"']*('|")
.
buffer
- a buffer containing raw bytes that will be interpreted as ISO-8859-1 characters.length
- the number of significant bytes in the buffer.null
if no
charset is specified; note that the charset might be not valid or not available.public static java.lang.String getCharsetNameFromHeader(java.lang.String headerValue)
content-type
header using a regular expression.
Warning: it might not work if someone puts the string charset=
in a string inside some attribute/value pair.headerValue
- The value of a content-type
header.null
if no
charset is specified; note that the charset might be not valid or not available.public boolean apply(URIResponse uriResponse)
apply
in interface com.google.common.base.Predicate<T>
public HTMLParser<T> clone()
clone
in class java.lang.Object
public HTMLParser<T> copy()
Parser
Filter
.public T result()
Parser
Note that this method must be idempotent.
public static void main(java.lang.String[] arg) throws java.lang.IllegalArgumentException, java.io.IOException, java.net.URISyntaxException, JSAPException, java.security.NoSuchAlgorithmException
java.lang.IllegalArgumentException
java.io.IOException
java.net.URISyntaxException
JSAPException
java.security.NoSuchAlgorithmException