public class URLRespectsRobots
extends java.lang.Object
robots.txt
into arrays of char arrays and
handle robot filtering.Modifier and Type | Field | Description |
---|---|---|
static char[][] |
EMPTY_ROBOTS_FILTER |
A singleton empty robots filter.
|
static int |
MAX_TO_STRING_ROBOTS |
The maximum number of robots entries returned by
Object.toString() . |
Modifier and Type | Method | Description |
---|---|---|
static boolean |
apply(char[][] robotsFilter,
java.net.URI url) |
Checks whether a specified URL passes a specified robots filter.
|
static void |
main(java.lang.String[] arg) |
|
static char[][] |
parseRobotsReader(java.io.Reader content,
java.lang.String userAgent) |
Parses the argument as if it were the content of a
robots.txt file,
and returns a sorted array of prefixes of URLs that the agent should not follow. |
static char[][] |
parseRobotsResponse(URIResponse robotsResponse,
java.lang.String userAgent) |
Parses a
robots.txt file contained in a FetchData and
returns the corresponding filter as an array of sorted prefixes. |
static char[][] |
toSortedPrefixFreeCharArrays(java.util.Set<java.lang.String> set) |
|
static java.lang.String |
toString(char[][] robotsFilter) |
Prints gracefully a robot filter using at most 30 prefixes.
|
public static final int MAX_TO_STRING_ROBOTS
Object.toString()
.public static final char[][] EMPTY_ROBOTS_FILTER
public static char[][] toSortedPrefixFreeCharArrays(java.util.Set<java.lang.String> set)
public static char[][] parseRobotsReader(java.io.Reader content, java.lang.String userAgent) throws java.io.IOException
robots.txt
file,
and returns a sorted array of prefixes of URLs that the agent should not follow.content
- the content of the robots.txt
file.userAgent
- the string representing the user agent of interest.java.io.IOException
public static char[][] parseRobotsResponse(URIResponse robotsResponse, java.lang.String userAgent) throws java.io.IOException
robots.txt
file contained in a FetchData
and
returns the corresponding filter as an array of sorted prefixes. HTTP statuses
different from 2xx are logged. HTTP statuses of class 4xx
generate an empty filter. HTTP statuses 2xx/3xx cause the tentative parsing of the
request content. In the remaining cases we return null
.robotsResponse
- the response containing robots.txt
.userAgent
- the string representing the user agent of interest.null
java.io.IOException
public static boolean apply(char[][] robotsFilter, java.net.URI url)
robotsFilter
- a robot filter.url
- a URL to check against robotsFilter
.url
passes robotsFilter
.public static java.lang.String toString(char[][] robotsFilter)
robotsFilter
- a robots filter.public static void main(java.lang.String[] arg) throws java.io.IOException
java.io.IOException