it.unimi.di.law.bubing.parser (BUbiNG 0.9.15)

The system of parsers used for analyzing HTTP responses.

A parser is an object that may be able to parse the content of a HTTP response (a.k.a. page), in order to extract links from it (provided that the format allows for links), to compute a suitable digest of the response (for duplicate detection) etc. Typically, a ParsingThread will try in turn with a sequence of parsers: each parser acts as a filter that can decide whether it is able to parse a given page or not (e.g., based on the content-type header).

The first parser that declares to be suitable for the given response will be used, if any. If no parser was available with this property, or if the parser used failed, a catch-all BinaryParser will be used instead.

Parsers should be written so that they can be easily re-used, should be lightweight and should be very robust to errors in the parsed responses.

Interface Summary
Interface	Description
Parser<T>	A generic parser for `responses`.
Parser.LinkReceiver	A class that can receive URLs discovered during parsing.
Parser.TextProcessor<T>	A class that can receive piece of text discovered during parsing.

Class Summary
Class	Description
BinaryParser	A universal binary parser that just computes digests.
HTMLParser<T>	An HTML parser with additional responsibilities.
HTMLParser.DigestAppendable	A class computing the digest of a page.
HTMLParser.SetLinkReceiver	An implementation of a `Parser.LinkReceiver` that accumulates the URLs in a public set.
SpamTextProcessor	An implementation of a `Parser.TextProcessor` that accumulates the counts of terms from a given set specified via a `StringMap`.
SpamTextProcessor.TermCount

Package it.unimi.di.law.bubing.parser