Package it.unimi.dsi.law.warc.filters
Class IsProbablyBinary
java.lang.Object
it.unimi.dsi.law.warc.filters.AbstractFilter<HttpResponse>
it.unimi.dsi.law.warc.filters.IsProbablyBinary
- All Implemented Interfaces:
com.google.common.base.Predicate<HttpResponse>
,Filter<HttpResponse>
,Predicate<HttpResponse>
public class IsProbablyBinary extends AbstractFilter<HttpResponse>
Accepts only http responses whose content stream appears to be binary.
-
Field Summary
Fields Modifier and Type Field Description static int
BINARY_CHECK_SCAN_LENGTH
static IsProbablyBinary
INSTANCE
static int
THRESHOLD
The number of zeroes that must appear to cause the page to be considered probably binary.Fields inherited from interface it.unimi.dsi.law.warc.filters.Filter
FILTER_PACKAGE_NAME
-
Method Summary
Modifier and Type Method Description boolean
apply(HttpResponse httpResponse)
This method implements a simple heuristic for guessing whether a page is binary.String
toString()
static IsProbablyBinary
valueOf(String emptySpec)
Methods inherited from class it.unimi.dsi.law.warc.filters.AbstractFilter
toString
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
Methods inherited from interface com.google.common.base.Predicate
equals, test
-
Field Details
-
INSTANCE
-
BINARY_CHECK_SCAN_LENGTH
public static final int BINARY_CHECK_SCAN_LENGTH- See Also:
- Constant Field Values
-
THRESHOLD
public static final int THRESHOLDThe number of zeroes that must appear to cause the page to be considered probably binary. Some misconfigured servers emit one or two ASCII NULs at the start of their pages, so we use a relatively safe value.- See Also:
- Constant Field Values
-
-
Method Details
-
apply
This method implements a simple heuristic for guessing whether a page is binary.The first
BINARY_CHECK_SCAN_LENGTH
bytes are scanned: if we find more thanTHRESHOLD
zeroes, we deduce that this page is binary. Note that this works also with UTF-8, as no UTF-8 legal character encoding contains these characters (unless you're encoding 0, but this is not our case).- Returns:
true
iff this page has most probably a binary content.- Throws:
NullPointerException
- if the page has no byte content.
-
valueOf
-
toString
-