Package it.unimi.dsi.law.warc.parser
Interface Parser
- All Superinterfaces:
Cloneable,Filter<Response>,com.google.common.base.Predicate<Response>,Predicate<Response>
- All Known Implementing Classes:
BinaryParser,HTMLParser
public interface Parser extends Filter<Response>, Cloneable
A generic parser for
responses. It provides link extraction through a
Parser.LinkReceiver callback and optional digesting.-
Nested Class Summary
Nested Classes Modifier and Type Interface Description static interfaceParser.LinkReceiverA class that can receive URLs discovered during parsing. -
Field Summary
Fields Modifier and Type Field Description static Parser.LinkReceiverNULL_LINK_RECEIVERA no-op implementation ofParser.LinkReceiver.Fields inherited from interface it.unimi.dsi.law.warc.filters.Filter
FILTER_PACKAGE_NAME -
Method Summary
Modifier and Type Method Description StringguessedCharset()Returns a guessed charset for the document, ornullif the charset could not be guessed.byte[]parse(Response response, Parser.LinkReceiver linkReceiver)Parses a response.Methods inherited from interface com.google.common.base.Predicate
apply, equals, test
-
Field Details
-
NULL_LINK_RECEIVER
A no-op implementation ofParser.LinkReceiver.
-
-
Method Details
-
parse
Parses a response.- Parameters:
response- a response to parse.linkReceiver- a link receiver.- Returns:
- a byte digest for the page, or
nullif no digest has been computed. - Throws:
IOException
-
guessedCharset
String guessedCharset()Returns a guessed charset for the document, ornullif the charset could not be guessed.- Returns:
- a charset or
null.
-