com.google.common.base.Predicate<java.net.URI>
, Filter<java.net.URI>
, FlyweightPrototype<Filter<java.net.URI>>
, java.util.function.Predicate<java.net.URI>
public class DuplicateSegmentsLessThan extends AbstractFilter<java.net.URI>
It is not uncommon to find URIs generated by badly configured 404
pages that look like http://example.com/foo/bar/foo/bar/…
.
This filter will not accept such URIs if some sequence of consecutive segments
appears more times than a given threshold.
This implementation uses ideas from “Linear–Time Longest–Common–Prefix Computation in Suffix Arrays and Its Applications”, by Toru Kasai, Gunho Lee, Hiroki Arimura, Setsuo Arikawa, and Kunsoo Park, in Proc. of the 12th Annual Symposium on Combinatorial Pattern Matching, volume 2089 of Lecture Notes In Computer Science, pages 181−192, Springer-Verlag, 2001, to simulate a suffix-tree visit on a suffix array, and ideas from “Simple and flexible detection of contiguous repeats using a suffix tree”, by Jens Stoye and Dan Gusfield, Theoret. Comput. Sci. 270:843−856, 2002, for the linear-time detection of tandem arrays using suffix trees.
The resulting code is one order of magnitude faster than regular expressions.
FILTER_PACKAGE_NAME
Constructor | Description |
---|---|
DuplicateSegmentsLessThan(int threshold) |
Creates a filter that only accepts URIs whose path does contains less duplicate consecutive segments than
the given threshold.
|
Modifier and Type | Method | Description |
---|---|---|
boolean |
apply(java.net.URI url) |
Apply the filter to a given URI
|
Filter<java.net.URI> |
copy() |
|
boolean |
equals(java.lang.Object x) |
Compare this object with a given generic one
|
int |
hashCode() |
|
static void |
main(java.lang.String[] arg) |
|
java.lang.String |
toString() |
A string representation of the state of this object, that is just the threshold used.
|
static DuplicateSegmentsLessThan |
valueOf(java.lang.String spec) |
Get a new
DuplicateSegmentsLessThan that will accept only URIs whose path does not contain too many duplicate segments. |
toString
public DuplicateSegmentsLessThan(int threshold)
threshold
- the duplicate-segment threshold (at least 2); if a URI contains less than
this number of duplicate consecutive segments it will be accepted.public boolean apply(java.net.URI url)
url
- the URI to be filteredtrue
if the path contains a number of duplicate segments less than a thresholdpublic static DuplicateSegmentsLessThan valueOf(java.lang.String spec)
DuplicateSegmentsLessThan
that will accept only URIs whose path does not contain too many duplicate segments.spec
- a String integer, that is the threshold used by the filterDuplicateSegmentsLessThan
that will accept only URIs whose path contain a number of duplicate segments less than spec
public java.lang.String toString()
toString
in class java.lang.Object
public boolean equals(java.lang.Object x)
equals
in interface com.google.common.base.Predicate<java.net.URI>
equals
in class java.lang.Object
x
is an instance of DuplicateSegmentsLessThan
and the URIs allowed by x are allowed by this and vice versapublic int hashCode()
hashCode
in class java.lang.Object
public static void main(java.lang.String[] arg)
public Filter<java.net.URI> copy()