Class IsProbablyBinary

java.lang.Object
it.unimi.dsi.law.warc.filters.AbstractFilter<HttpResponse>
it.unimi.dsi.law.warc.filters.IsProbablyBinary
All Implemented Interfaces:
com.google.common.base.Predicate<HttpResponse>, Filter<HttpResponse>, Predicate<HttpResponse>

public class IsProbablyBinary
extends AbstractFilter<HttpResponse>
Accepts only http responses whose content stream appears to be binary.
  • Field Details

    • INSTANCE

      public static final IsProbablyBinary INSTANCE
    • BINARY_CHECK_SCAN_LENGTH

      public static final int BINARY_CHECK_SCAN_LENGTH
      See Also:
      Constant Field Values
    • THRESHOLD

      public static final int THRESHOLD
      The number of zeroes that must appear to cause the page to be considered probably binary. Some misconfigured servers emit one or two ASCII NULs at the start of their pages, so we use a relatively safe value.
      See Also:
      Constant Field Values
  • Method Details

    • apply

      public boolean apply​(HttpResponse httpResponse)
      This method implements a simple heuristic for guessing whether a page is binary.

      The first BINARY_CHECK_SCAN_LENGTH bytes are scanned: if we find more than THRESHOLD zeroes, we deduce that this page is binary. Note that this works also with UTF-8, as no UTF-8 legal character encoding contains these characters (unless you're encoding 0, but this is not our case).

      Returns:
      true iff this page has most probably a binary content.
      Throws:
      NullPointerException - if the page has no byte content.
    • valueOf

      public static IsProbablyBinary valueOf​(String emptySpec)
    • toString

      public String toString()
      Overrides:
      toString in class Object