Skip navigation links

Package it.unimi.di.law.warc.filters

A comprehensive filtering system.

See: Description

Package it.unimi.di.law.warc.filters Description

A comprehensive filtering system.

A filter is a strategy to decide whether a certain object should be accepted or not; the type of objects a filter considers is called the base type of the filter. In most cases, the base type is going to be a URL or a fetched page. More precisely, a prefetch filter is one that has BURL as its base type (typically: to decide whether a URL should be scheduled for later visit, or should be fetched); a postfetch filter is one that has FetchedResponse as base type and decides whether to do something with that response (typically: to parse it, to store it, etc.).

Various kinds of filters are available, and moreover they can be composed with boolean operators using the static methods specified in the Filters class. Additionally, a filter parser is provided in the it.unimi.dsi.law.ubi.filters.parser package; since the parser itself is written using JavaCC, we provide a description of it here.

Two filters are called homogeneous if they filter the same kind of objects, heterogeneous otherwise.

A filter parser is instantiated on the basis of the kind of filters it will actually return; more precisely a FilterParser<T> is a filter parser that will return a Filter<T>; for technical reasons, the class T must be provided as unique parameter when the parser is constructed. A parser can be used many times. Every time a filter is sought, the parse(String x) method of the parser is called, which returns a filter of the correct kind, or throws a parse exception.

The syntax used by the filter parser is available. Basically, it is a propositional calculus, with and (denoted by infix and or &), or (denoted by infix or or |) and not (denoted by prefix not or !), whose ground terms have the same form as returned by the toString() method of the Filter class.

Here are some examples:

Usually, an expression should only contain references to homogeneous filters of type T, where T is the type used to instantiate the parser. Nonetheless, if some ground term refers to a filter of some other type D, the parser will try to find a static method in the Filters class having the following signature:

        public static Filter<T> adaptFilterD2T(Filter<D> f)
        

that adapts the given filter f to a filter of the correct type. If this method is missing, the parser will itself throw an exception.

Skip navigation links