it.unimi.dsi.law.warc.filters.AbstractFilter<URI>

it.unimi.dsi.law.warc.filters.DuplicateSegmentsLessThan

All Implemented Interfaces:: com.google.common.base.Predicate<URI>, Filter<URI>, Predicate<URI>

public class DuplicateSegmentsLessThan
extends AbstractFilter<URI>

Accepts only URIs whose path does not contain too many duplicate segments.

It is not uncommon to find URIs generated by badly configured 404 pages that look like http://example.com/foo/bar/foo/bar/…. This filter will not accept such URIs if some sequence of consecutive segments appears more times than a given threshold.

This implementation uses ideas from Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications, by Toru Kasai, Gunho Lee, Hiroki Arimura, Setsuo Arikawa, and Kunsoo Park, in Proc. of the 12th Annual Symposium on Combinatorial Pattern Matching, number 2089 of Lecture Notes In Computer Science, pages 181−192, Springer-Verlag, 2001, to simulate a suffix-tree visit on a suffix array, and ideas from Simple and flexible detection of contiguous repeats using a suffix tree, by Jens Stoye and Dan Gusfield, Theoret. Comput. Sci. 270:843−856, 2002, for the linear-time detection of tandem arrays using suffix trees.

The resulting code is one order of magnitude faster than regular expressions.

Field Summary

Fields inherited from interface it.unimi.dsi.law.warc.filters.Filter
FILTER_PACKAGE_NAME

Constructor Summary

Constructors
Constructor	Description
`DuplicateSegmentsLessThan(int threshold)`	Creates a filter that only accepts URIs whose path does contains less duplicate consecutive segments than the given threshold.

Method Summary

Modifier and Type	Method	Description
`boolean`	`apply(URI url)`
`boolean`	`equals(Object x)`
`static void`	`main(String[] arg)`
`String`	`toString()`
`static DuplicateSegmentsLessThan`	`valueOf(String spec)`

Methods inherited from class it.unimi.dsi.law.warc.filters.AbstractFilter

toString

Methods inherited from class java.lang.Object

clone, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

Methods inherited from interface com.google.common.base.Predicate

test

Methods inherited from interface java.util.function.Predicate

and, negate, or

Constructor Details
- DuplicateSegmentsLessThan
  
  public DuplicateSegmentsLessThan(int threshold)
  
  Creates a filter that only accepts URIs whose path does contains less duplicate consecutive segments than the given threshold.
  
  Parameters:
  
  threshold - the duplicate-segment threshold (at least 2); if a URI contains less than this number of duplicate consecutive segments it will be accepted.
Method Details
- apply
  
  public boolean apply(URI url)
- valueOf
  
  public static DuplicateSegmentsLessThan valueOf(String spec)
- toString
  
  public String toString()
  
  Overrides:
  
  toString in class Object
- equals
  
  public boolean equals(Object x)
  
  Specified by:
  
  equals in interface com.google.common.base.Predicate<URI>
  
  Overrides:
  
  equals in class Object
- main
  
  public static void main(String[] arg)

Class DuplicateSegmentsLessThan

Field Summary

Fields inherited from interface it.unimi.dsi.law.warc.filters.Filter

Constructor Summary

Method Summary

Methods inherited from class it.unimi.dsi.law.warc.filters.AbstractFilter

Methods inherited from class java.lang.Object

Methods inherited from interface com.google.common.base.Predicate

Methods inherited from interface java.util.function.Predicate

Constructor Details

DuplicateSegmentsLessThan

Method Details

apply

valueOf

toString

equals

main