HTMLParser (BUbiNG 0.9.15)

java.lang.Object
- it.unimi.di.law.bubing.parser.HTMLParser<T>

All Implemented Interfaces:

com.google.common.base.Predicate<URIResponse>, Parser<T>, Filter<URIResponse>, FlyweightPrototype<Filter<URIResponse>>, java.util.function.Predicate<URIResponse>
```
public class HTMLParser<T>
extends java.lang.Object
implements Parser<T>
```
An HTML parser with additional responsibilities. An instance of this class does some buffering that makes it possible to parse quickly a HttpResponse. Instances are heavyweight—they should be pooled and shared, since their usage is transitory and CPU-intensive.
Warning: starting with version 0.9.15, this parser will not return links marked with NoFollow unless otherwise specified in the constructor.

Nested Class Summary

Nested Classes
Modifier and Type	Class	Description
`static class`	`HTMLParser.DigestAppendable`	A class computing the digest of a page.
`static class`	`HTMLParser.SetLinkReceiver`	An implementation of a `Parser.LinkReceiver` that accumulates the URLs in a public set.

Nested classes/interfaces inherited from interface it.unimi.di.law.bubing.parser.Parser
Parser.LinkReceiver, Parser.TextProcessor<T>

Field Summary

Fields
Modifier and Type	Field	Description
`protected char[]`	`buffer`	The character buffer.
`static int`	`CHAR_BUFFER_SIZE`	The size of the internal Jericho buffer.
`protected static java.util.regex.Pattern`	`CHARSET_PATTERN`	Used by `getCharsetName(byte[], int)`.
`protected static java.util.regex.Pattern`	`CONTENT_PATTERN`	Used by `getCharsetName(byte[], int)`.
`protected boolean`	`crossAuthorityDuplicates`	If `true`, pages with the same content but with different authorities are considered duplicates.
`protected HTMLParser.DigestAppendable`	`digestAppendable`	An object emboding the digest logic, or `null` for no digest computation.
`protected java.lang.String`	`guessedCharset`	The charset we guessed for the last response.
`protected static java.util.regex.Pattern`	`HTTP_EQUIV_PATTERN`	Used by `getCharsetName(byte[], int)`.
`protected java.net.URI`	`location`	The location URL from headers of the last response, if any, or `null`.
`protected static TextPattern`	`META_PATTERN`	Used by `getCharsetName(byte[], int)`.
`protected java.net.URI`	`metaLocation`	The location URL from `META` elements of the last response, if any, or `null`.
`protected Parser.TextProcessor<T>`	`textProcessor`	A text processor, or `null`.
`protected static TextPattern`	`URLEQUAL_PATTERN`	The pattern prefixing the URL in a `META` `HTTP-EQUIV` element of refresh type.

Fields inherited from interface it.unimi.di.law.warc.filters.Filter
FILTER_PACKAGE_NAME

Fields inherited from interface it.unimi.di.law.bubing.parser.Parser
NULL_LINK_RECEIVER

Constructor Summary

Constructors
Constructor	Description
`HTMLParser()`	Builds a parser with a fixed buffer of `CHAR_BUFFER_SIZE` characters for link extraction only (no digesting).
`HTMLParser(com.google.common.hash.HashFunction hashFunction)`	Builds a parser with a fixed buffer of `CHAR_BUFFER_SIZE` characters for link extraction and, possibly, digesting a page.
`HTMLParser(com.google.common.hash.HashFunction hashFunction, boolean crossAuthorityDuplicates)`	Builds a parser with a fixed buffer of `CHAR_BUFFER_SIZE` characters for link extraction and, possibly, digesting a page.
`HTMLParser(com.google.common.hash.HashFunction hashFunction, Parser.TextProcessor<T> textProcessor, boolean crossAuthorityDuplicates)`	Builds a parser with a fixed buffer of `CHAR_BUFFER_SIZE` characters for link extraction and, possibly, digesting a page.
`HTMLParser(com.google.common.hash.HashFunction hashFunction, Parser.TextProcessor<T> textProcessor, boolean crossAuthorityDuplicates, boolean returnNoFollow)`	Builds a parser with a fixed buffer of `CHAR_BUFFER_SIZE` for link extraction and, possibly, digesting a page.
`HTMLParser(com.google.common.hash.HashFunction hashFunction, Parser.TextProcessor<T> textProcessor, boolean crossAuthorityDuplicates, boolean returnNoFollow, int bufferSize)`	Builds a parser for link extraction and, possibly, digesting a page.
`HTMLParser(com.google.common.hash.HashFunction hashFunction, Parser.TextProcessor<T> textProcessor, boolean crossAuthorityDuplicates, int bufferSize)`	Builds a parser for link extraction and, possibly, digesting a page.
`HTMLParser(java.lang.String messageDigest)`	Builds a parser with a fixed buffer of `CHAR_BUFFER_SIZE` characters for link extraction and, possibly, digesting a page.
`HTMLParser(java.lang.String messageDigest, java.lang.String crossAuthorityDuplicates)`	Builds a parser with a fixed buffer of `CHAR_BUFFER_SIZE` characters for link extraction and, possibly, digesting a page.
`HTMLParser(java.lang.String messageDigest, java.lang.String textProcessorSpec, java.lang.String crossAuthorityDuplicates)`	Builds a parser with a fixed buffer of `CHAR_BUFFER_SIZE` characters for link extraction and, possibly, digesting a page.
`HTMLParser(java.lang.String messageDigest, java.lang.String textProcessorSpec, java.lang.String crossAuthorityDuplicates, java.lang.String returnNoFollow)`	Builds a parser with a fixed buffer of `CHAR_BUFFER_SIZE` characters for link extraction and, possibly, digesting a page.

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`boolean`	`apply(URIResponse uriResponse)`
`HTMLParser<T>`	`clone()`
`HTMLParser<T>`	`copy()`	This method strengthens the return type of the method inherited from `Filter`.
`static java.lang.String`	`getCharsetName(byte[] buffer, int length)`	Returns the charset name as indicated by a `META` `HTTP-EQUIV` element, if present, interpreting the provided byte array as a sequence of ISO-8859-1-encoded characters.
`static java.lang.String`	`getCharsetNameFromHeader(java.lang.String headerValue)`	Extracts the charset name from the header value of a `content-type` header using a regular expression.
`java.lang.String`	`guessedCharset()`	Returns a guessed charset for the document, or `null` if the charset could not be guessed.
`java.net.URI`	`location()`	Returns the BURL location header, if present; if it is not present, but the page contains a valid metalocation, the latter is returned.
`static void`	`main(java.lang.String[] arg)`
`byte[]`	`parse(java.net.URI uri, org.apache.http.HttpResponse httpResponse, Parser.LinkReceiver linkReceiver)`	Parses a response.
`protected void`	`process(Parser.LinkReceiver linkReceiver, java.net.URI base, java.lang.String s)`	Pre-process a string that represents a raw link found in the page, trying to derelativize it.
`T`	`result()`	Returns the result of the processing.

Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface com.google.common.base.Predicate
equals, test

Methods inherited from interface java.util.function.Predicate
and, negate, or

- Field Detail
  - URLEQUAL_PATTERN
```
protected static final TextPattern URLEQUAL_PATTERN
```
    The pattern prefixing the URL in a META HTTP-EQUIV element of refresh type.
  - CHAR_BUFFER_SIZE
```
public static final int CHAR_BUFFER_SIZE
```
    The size of the internal Jericho buffer.
    
    See Also:
    
    Constant Field Values
  - buffer
```
protected final char[] buffer
```
    The character buffer. It is set up at construction time, but it can be changed later.
  - guessedCharset
```
protected java.lang.String guessedCharset
```
    The charset we guessed for the last response.
  - digestAppendable
```
protected final HTMLParser.DigestAppendable digestAppendable
```
    An object emboding the digest logic, or null for no digest computation.
  - textProcessor
```
protected final Parser.TextProcessor<T> textProcessor
```
    A text processor, or null.
  - location
```
protected java.net.URI location
```
    The location URL from headers of the last response, if any, or null.
  - metaLocation
```
protected java.net.URI metaLocation
```
    The location URL from META elements of the last response, if any, or null.
  - crossAuthorityDuplicates
```
protected boolean crossAuthorityDuplicates
```
    If true, pages with the same content but with different authorities are considered duplicates.
  - META_PATTERN
```
protected static final TextPattern META_PATTERN
```
    Used by getCharsetName(byte[], int).
  - HTTP_EQUIV_PATTERN
```
protected static final java.util.regex.Pattern HTTP_EQUIV_PATTERN
```
    Used by getCharsetName(byte[], int).
  - CONTENT_PATTERN
```
protected static final java.util.regex.Pattern CONTENT_PATTERN
```
    Used by getCharsetName(byte[], int).
  - CHARSET_PATTERN
```
protected static final java.util.regex.Pattern CHARSET_PATTERN
```
    Used by getCharsetName(byte[], int).
- Constructor Detail
  - HTMLParser
```
public HTMLParser(com.google.common.hash.HashFunction hashFunction)
```
    Builds a parser with a fixed buffer of CHAR_BUFFER_SIZE characters for link extraction and, possibly, digesting a page. By default, only pages from within the same scheme+authority may be considered to be duplicates.
    
    Parameters:
    
    hashFunction - the hash function used to digest, null if no digesting will be performed.
  - HTMLParser
```
public HTMLParser(com.google.common.hash.HashFunction hashFunction,
                  Parser.TextProcessor<T> textProcessor,
                  boolean crossAuthorityDuplicates,
                  boolean returnNoFollow,
                  int bufferSize)
```
    Builds a parser for link extraction and, possibly, digesting a page.
    
    Parameters:
    
    hashFunction - the hash function used to digest, null if no digesting will be performed.
    
    textProcessor - a text processor, or null if no text processing is required.
    
    crossAuthorityDuplicates - if true, pages with different scheme+authority but with the same content will be considered to be duplicates, as long as they are assigned to the same Agent.
    
    returnNoFollow - forces returning also links marked as NoFollow.
    
    bufferSize - the fixed size of the internal buffer; if zero, the buffer will be dynamic.
  - HTMLParser
```
public HTMLParser(com.google.common.hash.HashFunction hashFunction,
                  Parser.TextProcessor<T> textProcessor,
                  boolean crossAuthorityDuplicates,
                  int bufferSize)
```
    Builds a parser for link extraction and, possibly, digesting a page.
    
    Parameters:
    
    hashFunction - the hash function used to digest, null if no digesting will be performed.
    
    textProcessor - a text processor, or null if no text processing is required.
    
    crossAuthorityDuplicates - if true, pages with different scheme+authority but with the same content will be considered to be duplicates, as long as they are assigned to the same Agent.
    
    bufferSize - the fixed size of the internal buffer; if zero, the buffer will be dynamic.
  - HTMLParser
```
public HTMLParser(com.google.common.hash.HashFunction hashFunction,
                  Parser.TextProcessor<T> textProcessor,
                  boolean crossAuthorityDuplicates,
                  boolean returnNoFollow)
```
    Builds a parser with a fixed buffer of CHAR_BUFFER_SIZE for link extraction and, possibly, digesting a page.
    
    Parameters:
    
    hashFunction - the hash function used to digest, null if no digesting will be performed.
    
    textProcessor - a text processor, or null if no text processing is required.
    
    crossAuthorityDuplicates - if true, pages with different scheme+authority but with the same content will be considered to be duplicates, as long as they are assigned to the same Agent.
    
    returnNoFollow - forces returning also links marked as NoFollow.
  - HTMLParser
```
public HTMLParser(com.google.common.hash.HashFunction hashFunction,
                  boolean crossAuthorityDuplicates)
```
    Builds a parser with a fixed buffer of CHAR_BUFFER_SIZE characters for link extraction and, possibly, digesting a page.
    
    Parameters:
    
    hashFunction - the hash function used to digest, null if no digesting will be performed.
    
    crossAuthorityDuplicates - if true, pages with different scheme+authority but with the same content will be considered to be duplicates, as long as they are assigned to the same Agent.
  - HTMLParser
```
public HTMLParser(com.google.common.hash.HashFunction hashFunction,
                  Parser.TextProcessor<T> textProcessor,
                  boolean crossAuthorityDuplicates)
```
    Builds a parser with a fixed buffer of CHAR_BUFFER_SIZE characters for link extraction and, possibly, digesting a page.
    
    Parameters:
    
    hashFunction - the hash function used to digest, null if no digesting will be performed.
    
    textProcessor - a text processor, or null if no text processing is required.
    
    crossAuthorityDuplicates - if true, pages with different scheme+authority but with the same content will be considered to be duplicates, as long as they are assigned to the same Agent.
  - HTMLParser
```
public HTMLParser(java.lang.String messageDigest)
           throws java.security.NoSuchAlgorithmException
```
    Builds a parser with a fixed buffer of CHAR_BUFFER_SIZE characters for link extraction and, possibly, digesting a page. (No cross-authority duplicates are considered)
    
    Parameters:
    
    messageDigest - the name of a message-digest algorithm, or the empty string if no digest will be computed.
    
    Throws:
    
    java.security.NoSuchAlgorithmException
  - HTMLParser
```
public HTMLParser(java.lang.String messageDigest,
                  java.lang.String crossAuthorityDuplicates)
           throws java.security.NoSuchAlgorithmException
```
    Builds a parser with a fixed buffer of CHAR_BUFFER_SIZE characters for link extraction and, possibly, digesting a page.
    
    Parameters:
    
    messageDigest - the name of a message-digest algorithm, or the empty string if no digest will be computed.
    
    crossAuthorityDuplicates - a string whose value can only be "true" or "false" that is used to determine if you want to check for cross-authority duplicates.
    
    Throws:
    
    java.security.NoSuchAlgorithmException
  - HTMLParser
```
public HTMLParser(java.lang.String messageDigest,
                  java.lang.String textProcessorSpec,
                  java.lang.String crossAuthorityDuplicates)
           throws java.security.NoSuchAlgorithmException,
                  java.lang.IllegalArgumentException,
                  java.lang.ClassNotFoundException,
                  java.lang.IllegalAccessException,
                  java.lang.reflect.InvocationTargetException,
                  java.lang.InstantiationException,
                  java.lang.NoSuchMethodException,
                  java.io.IOException
```
    Builds a parser with a fixed buffer of CHAR_BUFFER_SIZE characters for link extraction and, possibly, digesting a page.
    
    Parameters:
    
    messageDigest - the name of a message-digest algorithm, or the empty string if no digest will be computed.
    
    textProcessorSpec - the specification of a text processor that will be passed to an ObjectParser.
    
    crossAuthorityDuplicates - a string whose value can only be "true" or "false" that is used to determine if you want to check for cross-authority duplicates.
    
    Throws:
    
    java.security.NoSuchAlgorithmException
    
    java.lang.IllegalArgumentException
    
    java.lang.ClassNotFoundException
    
    java.lang.IllegalAccessException
    
    java.lang.reflect.InvocationTargetException
    
    java.lang.InstantiationException
    
    java.lang.NoSuchMethodException
    
    java.io.IOException
  - HTMLParser
```
public HTMLParser(java.lang.String messageDigest,
                  java.lang.String textProcessorSpec,
                  java.lang.String crossAuthorityDuplicates,
                  java.lang.String returnNoFollow)
           throws java.security.NoSuchAlgorithmException,
                  java.lang.IllegalArgumentException,
                  java.lang.ClassNotFoundException,
                  java.lang.IllegalAccessException,
                  java.lang.reflect.InvocationTargetException,
                  java.lang.InstantiationException,
                  java.lang.NoSuchMethodException,
                  java.io.IOException
```
    Builds a parser with a fixed buffer of CHAR_BUFFER_SIZE characters for link extraction and, possibly, digesting a page.
    
    Parameters:
    
    messageDigest - the name of a message-digest algorithm, or the empty string if no digest will be computed.
    
    textProcessorSpec - the specification of a text processor that will be passed to an ObjectParser.
    
    crossAuthorityDuplicates - a string whose value can only be "true" or "false" that is used to determine if you want to check for cross-authority duplicates.
    
    returnNoFollow - forces returning also links marked as NoFollow.
    
    Throws:
    
    java.security.NoSuchAlgorithmException
    
    java.lang.IllegalArgumentException
    
    java.lang.ClassNotFoundException
    
    java.lang.IllegalAccessException
    
    java.lang.reflect.InvocationTargetException
    
    java.lang.InstantiationException
    
    java.lang.NoSuchMethodException
    
    java.io.IOException
  - HTMLParser
```
public HTMLParser()
```
    Builds a parser with a fixed buffer of CHAR_BUFFER_SIZE characters for link extraction only (no digesting).
- Method Detail
  - process
```
protected void process(Parser.LinkReceiver linkReceiver,
                       java.net.URI base,
                       java.lang.String s)
```
    Pre-process a string that represents a raw link found in the page, trying to derelativize it. If it succeeds, the resulting URL is passed to the link receiver.
    
    Parameters:
    
    linkReceiver - the link receiver that will receive the resulting URL.
    
    base - the base URL to be used to derelativize the link.
    
    s - the raw link to be derelativized.
  - parse
```
public byte[] parse(java.net.URI uri,
                    org.apache.http.HttpResponse httpResponse,
                    Parser.LinkReceiver linkReceiver)
             throws java.io.IOException
```
    Description copied from interface: Parser
    
    Parses a response.
    
    Specified by:
    
    parse in interface Parser<T>
    
    httpResponse - a response to parse.
    
    linkReceiver - a link receiver.
    
    Returns:
    
    a digest of the page content, or null if no digest has been computed.
    
    Throws:
    
    java.io.IOException
  - guessedCharset
```
public java.lang.String guessedCharset()
```
    Description copied from interface: Parser
    
    Returns a guessed charset for the document, or null if the charset could not be guessed.
    
    Specified by:
    
    guessedCharset in interface Parser<T>
    
    Returns:
    
    a charset or null.
  - location
```
public java.net.URI location()
```
    Returns the BURL location header, if present; if it is not present, but the page contains a valid metalocation, the latter is returned. Otherwise, null is returned.
    
    Returns:
    
    the location (or metalocation), if present; null otherwise.
  - getCharsetName
```
public static java.lang.String getCharsetName(byte[] buffer,
                                              int length)
```
    Returns the charset name as indicated by a META HTTP-EQUIV element, if present, interpreting the provided byte array as a sequence of ISO-8859-1-encoded characters. Only the first such occurrence is considered (even if it might not correspond to a valid or available charset).
    Beware: it might not work if the value of some attribute in a meta tag contains a string matching (case insensitively) the r.e. http-equiv\s*=\s*('|")content-type('|"), or content\s*=\s*('|")[^"']*('|").
    
    Parameters:
    
    buffer - a buffer containing raw bytes that will be interpreted as ISO-8859-1 characters.
    
    length - the number of significant bytes in the buffer.
    
    Returns:
    
    the charset name, or null if no charset is specified; note that the charset might be not valid or not available.
  - getCharsetNameFromHeader
```
public static java.lang.String getCharsetNameFromHeader(java.lang.String headerValue)
```
    Extracts the charset name from the header value of a content-type header using a regular expression. Warning: it might not work if someone puts the string charset= in a string inside some attribute/value pair.
    
    Parameters:
    
    headerValue - The value of a content-type header.
    
    Returns:
    
    the charset name, or null if no charset is specified; note that the charset might be not valid or not available.
  - apply
```
public boolean apply(URIResponse uriResponse)
```
    Specified by:
    
    apply in interface com.google.common.base.Predicate<T>
  - clone
```
public HTMLParser<T> clone()
```
    Overrides:
    
    clone in class java.lang.Object
  - copy
```
public HTMLParser<T> copy()
```
    Description copied from interface: Parser
    
    This method strengthens the return type of the method inherited from Filter.
    
    Specified by:
    
    copy in interface FlyweightPrototype<T>
    
    Specified by:
    
    copy in interface Parser<T>
    
    Returns:
    
    a copy of this object, sharing state with this object as much as possible.
  - result
```
public T result()
```
    Description copied from interface: Parser
    
    Returns the result of the processing.
    Note that this method must be idempotent.
    
    Specified by:
    
    result in interface Parser<T>
    
    Returns:
    
    the result of the processing.
  - main
```
public static void main(java.lang.String[] arg)
                 throws java.lang.IllegalArgumentException,
                        java.io.IOException,
                        java.net.URISyntaxException,
                        JSAPException,
                        java.security.NoSuchAlgorithmException
```
    Throws:
    
    java.lang.IllegalArgumentException
    
    java.io.IOException
    
    java.net.URISyntaxException
    
    JSAPException
    
    java.security.NoSuchAlgorithmException

Class HTMLParser<T>

Nested Class Summary

Nested classes/interfaces inherited from interface it.unimi.di.law.bubing.parser.Parser

Field Summary

Fields inherited from interface it.unimi.di.law.warc.filters.Filter

Fields inherited from interface it.unimi.di.law.bubing.parser.Parser

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface com.google.common.base.Predicate

Methods inherited from interface java.util.function.Predicate

Field Detail

URLEQUAL_PATTERN

CHAR_BUFFER_SIZE

buffer

guessedCharset

digestAppendable

textProcessor

location

metaLocation

crossAuthorityDuplicates

META_PATTERN

HTTP_EQUIV_PATTERN

CONTENT_PATTERN

CHARSET_PATTERN

Constructor Detail

HTMLParser

HTMLParser

HTMLParser

HTMLParser

HTMLParser

HTMLParser

HTMLParser

HTMLParser

HTMLParser

HTMLParser

HTMLParser

Method Detail

process

parse

guessedCharset

location

getCharsetName

getCharsetNameFromHeader

apply

clone

copy

result

main