Package it.unimi.dsi.law.warc.filters
Class IsProbablyBinary
java.lang.Object
it.unimi.dsi.law.warc.filters.AbstractFilter<HttpResponse>
it.unimi.dsi.law.warc.filters.IsProbablyBinary
- All Implemented Interfaces:
com.google.common.base.Predicate<HttpResponse>,Filter<HttpResponse>,Predicate<HttpResponse>
public class IsProbablyBinary extends AbstractFilter<HttpResponse>
Accepts only http responses whose content stream appears to be binary.
-
Field Summary
Fields Modifier and Type Field Description static intBINARY_CHECK_SCAN_LENGTHstatic IsProbablyBinaryINSTANCEstatic intTHRESHOLDThe number of zeroes that must appear to cause the page to be considered probably binary.Fields inherited from interface it.unimi.dsi.law.warc.filters.Filter
FILTER_PACKAGE_NAME -
Method Summary
Modifier and Type Method Description booleanapply(HttpResponse httpResponse)This method implements a simple heuristic for guessing whether a page is binary.StringtoString()static IsProbablyBinaryvalueOf(String emptySpec)Methods inherited from class it.unimi.dsi.law.warc.filters.AbstractFilter
toStringMethods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, waitMethods inherited from interface com.google.common.base.Predicate
equals, test
-
Field Details
-
INSTANCE
-
BINARY_CHECK_SCAN_LENGTH
public static final int BINARY_CHECK_SCAN_LENGTH- See Also:
- Constant Field Values
-
THRESHOLD
public static final int THRESHOLDThe number of zeroes that must appear to cause the page to be considered probably binary. Some misconfigured servers emit one or two ASCII NULs at the start of their pages, so we use a relatively safe value.- See Also:
- Constant Field Values
-
-
Method Details
-
apply
This method implements a simple heuristic for guessing whether a page is binary.The first
BINARY_CHECK_SCAN_LENGTHbytes are scanned: if we find more thanTHRESHOLDzeroes, we deduce that this page is binary. Note that this works also with UTF-8, as no UTF-8 legal character encoding contains these characters (unless you're encoding 0, but this is not our case).- Returns:
trueiff this page has most probably a binary content.- Throws:
NullPointerException- if the page has no byte content.
-
valueOf
-
toString
-