Interface Parser

All Superinterfaces:
Cloneable, Filter<Response>, com.google.common.base.Predicate<Response>, Predicate<Response>
All Known Implementing Classes:
BinaryParser, HTMLParser

public interface Parser
extends Filter<Response>, Cloneable
A generic parser for responses. It provides link extraction through a Parser.LinkReceiver callback and optional digesting.
  • Nested Class Summary

    Nested Classes
    Modifier and Type Interface Description
    static interface  Parser.LinkReceiver
    A class that can receive URLs discovered during parsing.
  • Field Summary

    Fields
    Modifier and Type Field Description
    static Parser.LinkReceiver NULL_LINK_RECEIVER
    A no-op implementation of Parser.LinkReceiver.

    Fields inherited from interface it.unimi.dsi.law.warc.filters.Filter

    FILTER_PACKAGE_NAME
  • Method Summary

    Modifier and Type Method Description
    String guessedCharset()
    Returns a guessed charset for the document, or null if the charset could not be guessed.
    byte[] parse​(Response response, Parser.LinkReceiver linkReceiver)
    Parses a response.

    Methods inherited from interface com.google.common.base.Predicate

    apply, equals, test

    Methods inherited from interface java.util.function.Predicate

    and, negate, or
  • Field Details

  • Method Details

    • parse

      byte[] parse​(Response response, Parser.LinkReceiver linkReceiver) throws IOException
      Parses a response.
      Parameters:
      response - a response to parse.
      linkReceiver - a link receiver.
      Returns:
      a byte digest for the page, or null if no digest has been computed.
      Throws:
      IOException
    • guessedCharset

      String guessedCharset()
      Returns a guessed charset for the document, or null if the charset could not be guessed.
      Returns:
      a charset or null.