public class URLRespectsRobots
extends java.lang.Object
robots.txt into arrays of char arrays and
handle robot filtering.| Modifier and Type | Field | Description |
|---|---|---|
static char[][] |
EMPTY_ROBOTS_FILTER |
A singleton empty robots filter.
|
static int |
MAX_TO_STRING_ROBOTS |
The maximum number of robots entries returned by
Object.toString(). |
| Modifier and Type | Method | Description |
|---|---|---|
static boolean |
apply(char[][] robotsFilter,
java.net.URI url) |
Checks whether a specified URL passes a specified robots filter.
|
static void |
main(java.lang.String[] arg) |
|
static char[][] |
parseRobotsReader(java.io.Reader content,
java.lang.String userAgent) |
Parses the argument as if it were the content of a
robots.txt file,
and returns a sorted array of prefixes of URLs that the agent should not follow. |
static char[][] |
parseRobotsResponse(URIResponse robotsResponse,
java.lang.String userAgent) |
Parses a
robots.txt file contained in a FetchData and
returns the corresponding filter as an array of sorted prefixes. |
static char[][] |
toSortedPrefixFreeCharArrays(java.util.Set<java.lang.String> set) |
|
static java.lang.String |
toString(char[][] robotsFilter) |
Prints gracefully a robot filter using at most 30 prefixes.
|
public static final int MAX_TO_STRING_ROBOTS
Object.toString().public static final char[][] EMPTY_ROBOTS_FILTER
public static char[][] toSortedPrefixFreeCharArrays(java.util.Set<java.lang.String> set)
public static char[][] parseRobotsReader(java.io.Reader content,
java.lang.String userAgent)
throws java.io.IOException
robots.txt file,
and returns a sorted array of prefixes of URLs that the agent should not follow.content - the content of the robots.txt file.userAgent - the string representing the user agent of interest.java.io.IOExceptionpublic static char[][] parseRobotsResponse(URIResponse robotsResponse, java.lang.String userAgent) throws java.io.IOException
robots.txt file contained in a FetchData and
returns the corresponding filter as an array of sorted prefixes. HTTP statuses
different from 2xx are logged. HTTP statuses of class 4xx
generate an empty filter. HTTP statuses 2xx/3xx cause the tentative parsing of the
request content. In the remaining cases we return null.robotsResponse - the response containing robots.txt.userAgent - the string representing the user agent of interest.nulljava.io.IOExceptionpublic static boolean apply(char[][] robotsFilter,
java.net.URI url)
robotsFilter - a robot filter.url - a URL to check against robotsFilter.url passes robotsFilter.public static java.lang.String toString(char[][] robotsFilter)
robotsFilter - a robots filter.public static void main(java.lang.String[] arg)
throws java.io.IOException
java.io.IOException