A parser is an object that may be able to parse the content of a HTTP response (a.k.a. page), in order to
extract links from it (provided that the format allows for links), to compute a suitable digest
of the response (for duplicate detection) etc. Typically, a ParsingThread
will try
in turn with a sequence of parsers: each parser acts as a filter that can decide whether it is
able to parse a given page or not (e.g., based on the content-type
header).
The first parser that declares to be suitable for the given response will be used, if any.
If no parser was available with this property, or if the parser used failed, a catch-all
BinaryParser
will be used instead.
Parsers should be written so that they can be easily re-used, should be lightweight and should be very robust to errors in the parsed responses.
Interface | Description |
---|---|
Parser<T> |
A generic parser for
responses . |
Parser.LinkReceiver |
A class that can receive URLs discovered during parsing.
|
Parser.TextProcessor<T> |
A class that can receive piece of text discovered during parsing.
|
Class | Description |
---|---|
BinaryParser |
A universal binary parser that just computes digests.
|
HTMLParser<T> |
An HTML parser with additional responsibilities.
|
HTMLParser.DigestAppendable |
A class computing the digest of a page.
|
HTMLParser.SetLinkReceiver |
An implementation of a
Parser.LinkReceiver that accumulates the URLs in a public set. |
SpamTextProcessor |
An implementation of a
Parser.TextProcessor that accumulates the counts of terms from a given set specified via a
StringMap . |
SpamTextProcessor.TermCount |