public final class BURL
extends java.lang.Object
URLs in BUbiNG (and in the crawl data) are represented by instances of URI
passing through a number of normalizations and fixes that
try to avoid useless page duplication (for the record, URL
is known to be quite broken).
In particular, URLs that contains a scheme but not an authority (e.g., mailto:foo@example.com
)
will not be considered valid.
You should always use the provided
factory method to create a URI
to represent a BUbiNG URL. The method is
oriented towards errors—it returns null
upon malformed URLs rather than
throwing an exception.
URI
instances generated by the factory methods of this class are guaranteed to be entirely formed by ASCII strings,
provided that you always access the raw version of the getters (e.g., URI.getRawPath()
).
In particular, the Object.toString()
method always returns an ASCII string.
A number of methods such as toByteArray(URI)
, schemeAndAuthority(byte[])
,
fromNormalizedSchemeAuthorityAndPathQuery(String, byte[])
, and so on make it possible to
manipulate URLs represented as byte arrays of ASCII characters, greatly reducing the memory
usage in applications (such as crawlers) storing millions of URLs.
WARNING: methods returning byte arrays from URI
instances
assume that the argument getters in raw form all return ASCII strings, as it
happens from URI
instances parsed by this class. The methods blindly computes
the representation by taking the lower 7 bits of each character. Enable assertions for
this class if you suspect that the methods are being applied to non-ASCII URI
instances.
Modifier and Type | Field | Description |
---|---|---|
static char[] |
BAD_CHAR |
A list of bad characters.
|
static java.lang.String[] |
BAD_CHAR_SUBSTITUTE |
Substitutes for bad characters.
|
static char[] |
FORBIDDEN_CHARS |
Characters that will cause a URI spec to be rejected.
|
Modifier and Type | Method | Description |
---|---|---|
static java.net.URI |
fromNormalizedByteArray(byte[] normalized) |
Creates a new BUbiNG URL from a normalized ASCII string represented by a byte array.
|
static java.net.URI |
fromNormalizedSchemeAuthorityAndPathQuery(byte[] schemeAuthority,
byte[] normalizedPathQuery) |
Creates a new BUbiNG URL from a byte-array representation of a normalized scheme and
authority and a byte-array representation of a normalized ASCII path and query.
|
static java.net.URI |
fromNormalizedSchemeAuthorityAndPathQuery(java.lang.String schemeAuthority,
byte[] normalizedPathQuery) |
Creates a new BUbiNG URL from a normalized ASCII string representing scheme and
authority and a byte-array representation of a normalized ASCII path and query.
|
static java.lang.String |
hostFromSchemeAndAuthority(byte[] schemeAuthority) |
Extracts the host part from a scheme+authority by removing the scheme, the user info and the port number.
|
static int |
lengthOfHost(byte[] url,
int startOfHost) |
Finds the length of the host part in a scheme+authority or URL.
|
static int |
memoryUsageOf(byte[] array) |
Returns the memory usage associated to a byte array.
|
static java.net.URI |
parse(MutableString spec) |
Creates a new BUbiNG URL from a mutable string
specification if possible, or returns
null otherwise. |
static java.net.URI |
parse(java.lang.String spec) |
Creates a new BUbiNG URL from a string specification if possible, or returns
null otherwise. |
static java.lang.String |
pathAndQuery(java.net.URI url) |
Returns the concatenated raw path and raw query of a BUbiNG URL.
|
static byte[] |
pathAndQueryAsByteArray(byte[] url) |
Extracts the path and query of an absolute BUbiNG URL in its byte-array representation.
|
static byte[] |
pathAndQueryAsByteArray(ByteArrayList url) |
Extracts the path and query of an absolute BUbiNG URL in its byte-array representation.
|
static byte[] |
pathAndQueryAsByteArray(java.net.URI url) |
Returns an ASCII byte-array representation of
the raw path and raw query of a BUbiNG URL.
|
static java.lang.String |
schemeAndAuthority(byte[] url) |
Extracts the scheme+authority of an absolute BUbiNG URL in its byte-array representation.
|
static java.lang.String |
schemeAndAuthority(java.net.URI url) |
Returns the concatenated URI.getScheme() and
raw authority of a BUbiNG URL. |
static byte[] |
schemeAndAuthorityAsByteArray(byte[] url) |
Extracts the scheme+authority of an absolute BUbiNG URL in its byte-array representation.
|
static int |
startOfHost(byte[] url) |
Finds the start of the host part in a URL.
|
static int |
startOfpathAndQuery(byte[] url) |
Returns the starting position (i.e., the slash) of the path+query in the given BUbiNG URL.
|
static byte[] |
toByteArray(java.net.URI url) |
Returns an ASCII byte-array representation of a BUbiNG URL.
|
static ByteArrayList |
toByteArrayList(java.net.URI url,
ByteArrayList list) |
Writes an ASCII representation of a BUbiNG URL in a
ByteArrayList . |
public static final char[] FORBIDDEN_CHARS
public static final char[] BAD_CHAR
public static final java.lang.String[] BAD_CHAR_SUBSTITUTE
public static java.net.URI parse(java.lang.String spec)
null
otherwise.spec
- the string specification for a URL.spec
without possibly the fragment, or null
if spec
is malformed.parse(MutableString)
public static java.net.URI parse(MutableString spec)
null
otherwise.
The conditions for this method not returning null
are as follows:
spec
, once trimmed, must not contain characters in FORBIDDEN_CHARS
;
BAD_CHAR
have been substituted with the corresponding
strings in BAD_CHAR_SUBSTITUTE
, and percent signs not followed by two hexadecimal
digits have been substituted by %25
, spec
must not throw
an exception when made into a URI.
URI
instance so obtained must not be opaque.
URI
instance so obtained, if absolute,
must have a non-null
authority.
For efficiency, this method modifies the provided specification, and in particular it makes it loose. Caveat emptor.
Fragments are removed (for a web crawler fragments are just noise). Normalization
is applied for you. Scheme and host name are downcased. If the URL has no host name, it is guaranteed
that the path is non-null
and non-empty (by adding a slash, if necessary). If
the host name ends with a dot, it is removed. Percent-specified special characters have always their
hexadecimal letters in upper case.
spec
- the string specification for a URL; it can be modified by this method, and
in particularly it will always be made loose.spec
without possibly the fragment, or null
if
spec
is malformed.parse(String)
public static java.net.URI fromNormalizedByteArray(byte[] normalized)
The string represented by the argument will not go through parse(MutableString)
.
URI.create(String)
will be used instead.
normalized
- a normalized URI string represented by a byte array.java.lang.IllegalArgumentException
- if normalized
does not parse correctly.public static java.net.URI fromNormalizedSchemeAuthorityAndPathQuery(java.lang.String schemeAuthority, byte[] normalizedPathQuery)
This method is intended to combine the results of schemeAndAuthority(URI)
/
schemeAndAuthority(byte[])
and pathAndQueryAsByteArray(byte[])
(
pathAndQueryAsByteArray(URI)
.
schemeAuthority
- an ASCII string representing scheme+authority.normalizedPathQuery
- the byte-array representation of a normalized ASCII path and query.java.lang.IllegalArgumentException
- if the two parts, concatenated, do not parse correctly.public static java.net.URI fromNormalizedSchemeAuthorityAndPathQuery(byte[] schemeAuthority, byte[] normalizedPathQuery)
This method is intended to combine the results of schemeAndAuthority(URI)
/
schemeAndAuthority(byte[])
and pathAndQueryAsByteArray(byte[])
(
pathAndQueryAsByteArray(URI)
.
schemeAuthority
- the byte-array represention of a normalized scheme+authority.normalizedPathQuery
- the byte-array representation of a normalized ASCII path and query.java.lang.IllegalArgumentException
- if the two parts, concatenated, do not parse correctly.public static byte[] toByteArray(java.net.URI url)
url
- a BUbiNG URL.url
.public static ByteArrayList toByteArrayList(java.net.URI url, ByteArrayList list)
ByteArrayList
.url
- a BUbiNG URL.list
- a byte list that will contain the byte representation.list
.public static byte[] pathAndQueryAsByteArray(java.net.URI url)
url
- a BUbiNG URL.public static java.lang.String pathAndQuery(java.net.URI url)
url
- a BUbiNG URL.uri
.public static java.lang.String schemeAndAuthority(java.net.URI url)
raw authority
of a BUbiNG URL.url
- a BUbiNG URL.raw authority
of uri
.public static byte[] pathAndQueryAsByteArray(byte[] url)
url
- a byte-array representation of a BUbiNG URL.public static byte[] pathAndQueryAsByteArray(ByteArrayList url)
url
- BUbiNG URL.public static int startOfpathAndQuery(byte[] url)
url
- a byte-array representation of a BUbiNG URL; extra bytes can be present after the URL.public static byte[] schemeAndAuthorityAsByteArray(byte[] url)
url
- an absolute BUbiNG URL.url
in byte-array representation.public static java.lang.String hostFromSchemeAndAuthority(byte[] schemeAuthority)
schemeAuthority
- a scheme+authority.public static final int startOfHost(byte[] url)
url
- a byte-array representation of a BUbiNG URL.public static final int lengthOfHost(byte[] url, int startOfHost)
url
- a byte-array representation of a BUbiNG URL or scheme+authority; only the first length
bytes are considered to be valid.startOfHost
- the starting position of the host part.url
.public static java.lang.String schemeAndAuthority(byte[] url)
url
- an absolute BUbiNG URL.url
.public static int memoryUsageOf(byte[] array)
This method is useful in establishing the memory footprint of URLs in byte-array representation.
array
- a byte array.