public class DuplicateSegmentsLessThan extends AbstractFilter<URI>
It is not uncommon to find URIs generated by badly configured 404
pages that look like
This filter will not accept such URIs if some sequence of consecutive segments
appears more times than a given threshold.
This implementation uses ideas from Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications, by Toru Kasai, Gunho Lee, Hiroki Arimura, Setsuo Arikawa, and Kunsoo Park, in Proc. of the 12th Annual Symposium on Combinatorial Pattern Matching, number 2089 of Lecture Notes In Computer Science, pages 181−192, Springer-Verlag, 2001, to simulate a suffix-tree visit on a suffix array, and ideas from Simple and flexible detection of contiguous repeats using a suffix tree, by Jens Stoye and Dan Gusfield, Theoret. Comput. Sci. 270:843−856, 2002, for the linear-time detection of tandem arrays using suffix trees.
The resulting code is one order of magnitude faster than regular expressions.
Constructors Constructor Description
(int threshold)Creates a filter that only accepts URIs whose path does contains less duplicate consecutive segments than the given threshold.
Methods inherited from class java.lang.Object
clone, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
Methods inherited from interface com.google.common.base.Predicate
DuplicateSegmentsLessThanpublic DuplicateSegmentsLessThan(int threshold)Creates a filter that only accepts URIs whose path does contains less duplicate consecutive segments than the given threshold.
threshold- the duplicate-segment threshold (at least 2); if a URI contains less than this number of duplicate consecutive segments it will be accepted.
applypublic boolean apply(URI url)
toStringpublic String toString()
equalspublic boolean equals(Object x)
mainpublic static void main(String arg)