A filter
is a strategy to decide whether a certain object should be accepted or not; the type of objects
a filter considers is called the base type of the filter. The base type
is usually a URI
, a URIResponse
or a law.bubing.util.Link
.
The type depends on the phase in which the filter will be applied.
Various kinds of filters are available, and moreover they can be composed with boolean operators
using the static methods specified in the Filters
class. Additionally, a filter parser is
provided in the it.unimi.dsi.law.ubi.filters.parser package; since the parser itself is written
using JavaCC, we provide a description of it here.
Two filters are called homogeneous if they filter the same kind of objects, heterogeneous otherwise.
A filter parser is instantiated on the basis of the kind of filters it will actually return; more precisely a
FilterParser<T>
is a filter parser that will return a Filter<T>
; for technical
reasons, the class T
must be provided as unique parameter when the parser is constructed. A parser
can be used many times. Every time a filter is sought, the parse(String x)
method of the parser
is called, which returns a filter of the correct kind, or throws a parse exception.
The syntax used by the filter parser is available. Basically, it is a propositional calculus, with and (denoted by infix and or &), or (denoted by infix or or |) and not (denoted by prefix not or !), whose ground terms have the same form as returned by the toString() method of the Filter class.
Here are some examples:
HostEquals(www.foo.bar)
(HostEndsWith(foo.bar) and not ForbiddenHost(http://xxx.yyy.zzz/list-of-forbidden-hosts)) or NoMoreSlashThan(10)
Usually, an expression should only contain references to homogeneous filters of type T
, where
T
is the type used to instantiate the parser. Nonetheless, if some ground term refers to a
filter of some other type D
, the parser will try to find a static method in the Filters
class
having the following signature:
public static Filter<T> adaptFilterD2T(Filter<D> f)
that adapts the given filter f
to a filter of the correct type. If this method is missing,
the parser will itself throw an exception. This is how a URI
filter can be applied
to a URIResponse
instance (by extractying the associated
URI
), or to a Link
instance
(applying the filter to its target).
Interface | Description |
---|---|
Filter<T> |
A filter is a strategy to decide whether to accept a given
object or not.
|
URIResponse |
An interface implemented by all classes able to expose a
HttpResponse and URI , e.g. |
Class | Description |
---|---|
AbstractFilter<T> | |
ContentTypeStartsWith |
A filter accepting only fetched response whose content type starts with a given string.
|
DigestEquals |
A filter accepting only records of given digest, specified as a hexadecimal string.
|
DuplicateSegmentsLessThan |
A filter accepting only URIs whose path does not contain too many duplicate segments.
|
Filters |
A collection of static methods to deal with
filters . |
HostEndsWith |
A filter accepting only URIs whose host ends with (case-insensitively) a certain suffix.
|
HostEndsWithOneOf |
A filter accepting only URIs whose host part ends (case-insensitively) with one of a given set of suffixes.
|
HostEquals |
A filter accepting only URIs whose host equals (case-insensitively) a certain string.
|
IsHttpResponse |
A filter accepting only records that are http/https responses.
|
IsProbablyBinary |
A filter accepting only http responses whose content stream appears to be binary.
|
PathEndsWithOneOf |
A filter accepting only URIs whose path ends (case-insensitively) with one of a given set of suffixes.
|
ResponseMatches |
A filter accepting only http responses whose content stream (in ISO-8859-1 encoding) matches a regular expression.
|
SameHost |
A filter accepting only inter-host links.
|
SchemeEquals |
A filter accepting only URIs whose scheme equals a certain string (typically,
http ). |
StatusCategory |
A filter accepting only fetched response whose status category (status/100) has a certain value.
|
URLEquals |
A filter accepting only a given URIs.
|
URLMatchesRegex |
A filter accepting only URIs that match a certain regular expression.
|
URLShorterThan |
A filter accepting only URIs whose overall length is below a given threshold.
|