Class HTMLParser

java.lang.Object
it.unimi.dsi.law.warc.parser.HTMLParser
All Implemented Interfaces:
com.google.common.base.Predicate<Response>, Filter<Response>, Parser, Cloneable, Predicate<Response>

public class HTMLParser
extends Object
implements Parser
An HTML parser with additional responsibilities (such as guessing the character encoding and resolving relative URLs).

An instance of this class contains buffers and classes that makes it possible to parse quickly a HttpResponse. Instances are heavyweight—they should be pooled and shared, since their usage is transitory and CPU-intensive.

  • Nested Class Summary

    Nested Classes
    Modifier and Type Class Description
    static class  HTMLParser.SetLinkReceiver  

    Nested classes/interfaces inherited from interface it.unimi.dsi.law.warc.parser.Parser

    Parser.LinkReceiver
  • Field Summary

    Fields
    Modifier and Type Field Description
    char[] buffer
    The character buffer.
    static int CHAR_BUFFER_SIZE
    The size of the internal Jericho buffer.

    Fields inherited from interface it.unimi.dsi.law.warc.filters.Filter

    FILTER_PACKAGE_NAME

    Fields inherited from interface it.unimi.dsi.law.warc.parser.Parser

    NULL_LINK_RECEIVER
  • Constructor Summary

    Constructors
    Constructor Description
    HTMLParser()
    Builds a parser for link extraction.
    HTMLParser​(String messageDigest)
    Builds a parser for link extraction and, possibly, digesting a page.
    HTMLParser​(MessageDigest messageDigest)
    Builds a parser for link extraction and, possibly, digesting a page.
  • Method Summary

    Modifier and Type Method Description
    boolean apply​(Response response)  
    Object clone()  
    static String getCharsetName​(byte[] buffer, int length)
    Returns the charset name as indicated by a META HTTP-EQUIV element, if present, interpreting the provided byte array as a sequence of ISO-8859-1-encoded characters.
    static String getCharsetNameFromHeader​(String headerValue)
    Extracts the charset name from the header value of a content-type header.
    String guessedCharset()
    Returns a guessed charset for the document, or null if the charset could not be guessed.
    URI location()
    Returns the BURL location header, if present; if it is not present, but the page contains a valid metalocation, the latter is returned.
    byte[] parse​(Response response, Parser.LinkReceiver linkReceiver)
    Parses a response.

    Methods inherited from class java.lang.Object

    equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

    Methods inherited from interface com.google.common.base.Predicate

    equals, test

    Methods inherited from interface java.util.function.Predicate

    and, negate, or
  • Field Details

    • CHAR_BUFFER_SIZE

      public static final int CHAR_BUFFER_SIZE
      The size of the internal Jericho buffer.
      See Also:
      Constant Field Values
    • buffer

      public final char[] buffer
      The character buffer. It is set up at construction time, but it can be changed later.
  • Constructor Details

    • HTMLParser

      public HTMLParser​(MessageDigest messageDigest)
      Builds a parser for link extraction and, possibly, digesting a page.
      Parameters:
      messageDigest - the digesting algorithm, or null if digesting will be performed.
    • HTMLParser

      public HTMLParser​(String messageDigest) throws NoSuchAlgorithmException
      Builds a parser for link extraction and, possibly, digesting a page.
      Parameters:
      messageDigest - the digesting algorithm (as a string).
      Throws:
      NoSuchAlgorithmException
    • HTMLParser

      public HTMLParser()
      Builds a parser for link extraction.
  • Method Details

    • parse

      public byte[] parse​(Response response, Parser.LinkReceiver linkReceiver) throws IOException
      Description copied from interface: Parser
      Parses a response.
      Specified by:
      parse in interface Parser
      Parameters:
      response - a response to parse.
      linkReceiver - a link receiver.
      Returns:
      a byte digest for the page, or null if no digest has been computed.
      Throws:
      IOException
    • guessedCharset

      public String guessedCharset()
      Description copied from interface: Parser
      Returns a guessed charset for the document, or null if the charset could not be guessed.
      Specified by:
      guessedCharset in interface Parser
      Returns:
      a charset or null.
    • location

      public URI location()
      Returns the BURL location header, if present; if it is not present, but the page contains a valid metalocation, the latter is returned. Otherwise, null is returned.
      Returns:
      the location (or metalocation), if present; null otherwise.
    • getCharsetName

      public static String getCharsetName​(byte[] buffer, int length)
      Returns the charset name as indicated by a META HTTP-EQUIV element, if present, interpreting the provided byte array as a sequence of ISO-8859-1-encoded characters. Only the first such occurrence is considered (even if it might not correspond to a valid or available charset).

      Beware: it might not work if the value of some attribute in a meta tag contains a string matching (case insensitively) the r.e. http-equiv\s*=\s*('|")content-type('|"), or content\s*=\s*('|")[^"']*('|").

      Parameters:
      buffer - a buffer containing raw bytes that will be interpreted as ISO-8859-1 characters.
      length - the number of significant bytes in the buffer.
      Returns:
      the charset name, or null if no charset is specified; note that the charset might be not valid or not available.
    • getCharsetNameFromHeader

      public static String getCharsetNameFromHeader​(String headerValue)
      Extracts the charset name from the header value of a content-type header. TODO: explain better Warning: it might not work if someone puts the string charset= in a string inside some attribute/value pair.
      Parameters:
      headerValue - The value of a content-type header.
      Returns:
      the charset name, or null if no charset is specified; note that the charset might be not valid or not available.
    • apply

      public boolean apply​(Response response)
      Specified by:
      apply in interface com.google.common.base.Predicate<Response>
    • clone

      public Object clone()
      Overrides:
      clone in class Object