Package it.unimi.dsi.law.warc.parser
Class HTMLParser
java.lang.Object
it.unimi.dsi.law.warc.parser.HTMLParser
- All Implemented Interfaces:
com.google.common.base.Predicate<Response>
,Filter<Response>
,Parser
,Cloneable
,Predicate<Response>
public class HTMLParser extends Object implements Parser
An HTML parser with additional responsibilities (such as guessing the character encoding
and resolving relative URLs).
An instance of this class contains buffers and classes that makes it possible to
parse quickly a HttpResponse
. Instances are heavyweight—they
should be pooled and shared, since their usage is transitory and CPU-intensive.
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
HTMLParser.SetLinkReceiver
Nested classes/interfaces inherited from interface it.unimi.dsi.law.warc.parser.Parser
Parser.LinkReceiver
-
Field Summary
Fields Modifier and Type Field Description char[]
buffer
The character buffer.static int
CHAR_BUFFER_SIZE
The size of the internal Jericho buffer.Fields inherited from interface it.unimi.dsi.law.warc.filters.Filter
FILTER_PACKAGE_NAME
Fields inherited from interface it.unimi.dsi.law.warc.parser.Parser
NULL_LINK_RECEIVER
-
Constructor Summary
Constructors Constructor Description HTMLParser()
Builds a parser for link extraction.HTMLParser(String messageDigest)
Builds a parser for link extraction and, possibly, digesting a page.HTMLParser(MessageDigest messageDigest)
Builds a parser for link extraction and, possibly, digesting a page. -
Method Summary
Modifier and Type Method Description boolean
apply(Response response)
Object
clone()
static String
getCharsetName(byte[] buffer, int length)
Returns the charset name as indicated by aMETA
HTTP-EQUIV
element, if present, interpreting the provided byte array as a sequence of ISO-8859-1-encoded characters.static String
getCharsetNameFromHeader(String headerValue)
Extracts the charset name from the header value of acontent-type
header.String
guessedCharset()
Returns a guessed charset for the document, ornull
if the charset could not be guessed.URI
location()
Returns the BURL location header, if present; if it is not present, but the page contains a valid metalocation, the latter is returned.byte[]
parse(Response response, Parser.LinkReceiver linkReceiver)
Parses a response.Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface com.google.common.base.Predicate
equals, test
-
Field Details
-
CHAR_BUFFER_SIZE
public static final int CHAR_BUFFER_SIZEThe size of the internal Jericho buffer.- See Also:
- Constant Field Values
-
buffer
public final char[] bufferThe character buffer. It is set up at construction time, but it can be changed later.
-
-
Constructor Details
-
HTMLParser
Builds a parser for link extraction and, possibly, digesting a page.- Parameters:
messageDigest
- the digesting algorithm, ornull
if digesting will be performed.
-
HTMLParser
Builds a parser for link extraction and, possibly, digesting a page.- Parameters:
messageDigest
- the digesting algorithm (as a string).- Throws:
NoSuchAlgorithmException
-
HTMLParser
public HTMLParser()Builds a parser for link extraction.
-
-
Method Details
-
parse
Description copied from interface:Parser
Parses a response.- Specified by:
parse
in interfaceParser
- Parameters:
response
- a response to parse.linkReceiver
- a link receiver.- Returns:
- a byte digest for the page, or
null
if no digest has been computed. - Throws:
IOException
-
guessedCharset
Description copied from interface:Parser
Returns a guessed charset for the document, ornull
if the charset could not be guessed.- Specified by:
guessedCharset
in interfaceParser
- Returns:
- a charset or
null
.
-
location
Returns the BURL location header, if present; if it is not present, but the page contains a valid metalocation, the latter is returned. Otherwise,null
is returned.- Returns:
- the location (or metalocation), if present;
null
otherwise.
-
getCharsetName
Returns the charset name as indicated by aMETA
HTTP-EQUIV
element, if present, interpreting the provided byte array as a sequence of ISO-8859-1-encoded characters. Only the first such occurrence is considered (even if it might not correspond to a valid or available charset).Beware: it might not work if the value of some attribute in a
meta
tag contains a string matching (case insensitively) the r.e.http-equiv\s*=\s*('|")content-type('|")
, orcontent\s*=\s*('|")[^"']*('|")
.- Parameters:
buffer
- a buffer containing raw bytes that will be interpreted as ISO-8859-1 characters.length
- the number of significant bytes in the buffer.- Returns:
- the charset name, or
null
if no charset is specified; note that the charset might be not valid or not available.
-
getCharsetNameFromHeader
Extracts the charset name from the header value of acontent-type
header. TODO: explain better Warning: it might not work if someone puts the stringcharset=
in a string inside some attribute/value pair.- Parameters:
headerValue
- The value of acontent-type
header.- Returns:
- the charset name, or
null
if no charset is specified; note that the charset might be not valid or not available.
-
apply
- Specified by:
apply
in interfacecom.google.common.base.Predicate<Response>
-
clone
-