Package it.unimi.dsi.law.warc.parser
Class HTMLParser
java.lang.Object
it.unimi.dsi.law.warc.parser.HTMLParser
- All Implemented Interfaces:
com.google.common.base.Predicate<Response>,Filter<Response>,Parser,Cloneable,Predicate<Response>
public class HTMLParser extends Object implements Parser
An HTML parser with additional responsibilities (such as guessing the character encoding
and resolving relative URLs).
An instance of this class contains buffers and classes that makes it possible to
parse quickly a HttpResponse. Instances are heavyweight—they
should be pooled and shared, since their usage is transitory and CPU-intensive.
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classHTMLParser.SetLinkReceiverNested classes/interfaces inherited from interface it.unimi.dsi.law.warc.parser.Parser
Parser.LinkReceiver -
Field Summary
Fields Modifier and Type Field Description char[]bufferThe character buffer.static intCHAR_BUFFER_SIZEThe size of the internal Jericho buffer.Fields inherited from interface it.unimi.dsi.law.warc.filters.Filter
FILTER_PACKAGE_NAMEFields inherited from interface it.unimi.dsi.law.warc.parser.Parser
NULL_LINK_RECEIVER -
Constructor Summary
Constructors Constructor Description HTMLParser()Builds a parser for link extraction.HTMLParser(String messageDigest)Builds a parser for link extraction and, possibly, digesting a page.HTMLParser(MessageDigest messageDigest)Builds a parser for link extraction and, possibly, digesting a page. -
Method Summary
Modifier and Type Method Description booleanapply(Response response)Objectclone()static StringgetCharsetName(byte[] buffer, int length)Returns the charset name as indicated by aMETAHTTP-EQUIVelement, if present, interpreting the provided byte array as a sequence of ISO-8859-1-encoded characters.static StringgetCharsetNameFromHeader(String headerValue)Extracts the charset name from the header value of acontent-typeheader.StringguessedCharset()Returns a guessed charset for the document, ornullif the charset could not be guessed.URIlocation()Returns the BURL location header, if present; if it is not present, but the page contains a valid metalocation, the latter is returned.byte[]parse(Response response, Parser.LinkReceiver linkReceiver)Parses a response.Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface com.google.common.base.Predicate
equals, test
-
Field Details
-
CHAR_BUFFER_SIZE
public static final int CHAR_BUFFER_SIZEThe size of the internal Jericho buffer.- See Also:
- Constant Field Values
-
buffer
public final char[] bufferThe character buffer. It is set up at construction time, but it can be changed later.
-
-
Constructor Details
-
HTMLParser
Builds a parser for link extraction and, possibly, digesting a page.- Parameters:
messageDigest- the digesting algorithm, ornullif digesting will be performed.
-
HTMLParser
Builds a parser for link extraction and, possibly, digesting a page.- Parameters:
messageDigest- the digesting algorithm (as a string).- Throws:
NoSuchAlgorithmException
-
HTMLParser
public HTMLParser()Builds a parser for link extraction.
-
-
Method Details
-
parse
Description copied from interface:ParserParses a response.- Specified by:
parsein interfaceParser- Parameters:
response- a response to parse.linkReceiver- a link receiver.- Returns:
- a byte digest for the page, or
nullif no digest has been computed. - Throws:
IOException
-
guessedCharset
Description copied from interface:ParserReturns a guessed charset for the document, ornullif the charset could not be guessed.- Specified by:
guessedCharsetin interfaceParser- Returns:
- a charset or
null.
-
location
Returns the BURL location header, if present; if it is not present, but the page contains a valid metalocation, the latter is returned. Otherwise,nullis returned.- Returns:
- the location (or metalocation), if present;
nullotherwise.
-
getCharsetName
Returns the charset name as indicated by aMETAHTTP-EQUIVelement, if present, interpreting the provided byte array as a sequence of ISO-8859-1-encoded characters. Only the first such occurrence is considered (even if it might not correspond to a valid or available charset).Beware: it might not work if the value of some attribute in a
metatag contains a string matching (case insensitively) the r.e.http-equiv\s*=\s*('|")content-type('|"), orcontent\s*=\s*('|")[^"']*('|").- Parameters:
buffer- a buffer containing raw bytes that will be interpreted as ISO-8859-1 characters.length- the number of significant bytes in the buffer.- Returns:
- the charset name, or
nullif no charset is specified; note that the charset might be not valid or not available.
-
getCharsetNameFromHeader
Extracts the charset name from the header value of acontent-typeheader. TODO: explain better Warning: it might not work if someone puts the stringcharset=in a string inside some attribute/value pair.- Parameters:
headerValue- The value of acontent-typeheader.- Returns:
- the charset name, or
nullif no charset is specified; note that the charset might be not valid or not available.
-
apply
- Specified by:
applyin interfacecom.google.common.base.Predicate<Response>
-
clone
-