public class StartupConfiguration
extends java.lang.Object
All public fields of this class represent a configuration property:
String
; For integer or long fields, the configuration values are
passed to a IntSizeStringParser
or LongSizeStringParser
, respectively. Thus, you
can use any size specification allowed by those parsers such as
K
(103), Ki
(210), M
(106), Mi
(220), etc.
Note that all fields are
mandatory unless marked with a StartupConfiguration.OptionalSpecification
annotation: if a non-annotated
field is missing, or if it is of the wrong type, or if it
fails to satisfy the extra condition required for that field, a
ConfigurationException
will be thrown. The same is happen if
one specifies an unknown field.
Fields marked with a StartupConfiguration.ManyValuesSpecification
annotation can be specified multiple times. As usual,
you can also enumerate several values separated by a comma.
If a field named xxYyyZzzz
requires some additional check (e.g., that its value
is set in a certain range), a method named checkXxYyyZzzz
with signature
public void checkXxYyyZzzz() throws ConfigurationExceptionshould be added to this class: this method should throw an exception if the field fails to satisfy the required condition; it may also perform some logging.
A configuration is constructed by reflection using one of the class constructors. The
StartupConfiguration(Configuration)
takes a
Configuration
and sets the fields by reflection (field
names are used to look up for properties in the given configuration, and the values found are
used to initialize the corresponding fields).
A number of facility constructors are provided that read the configuration from a file (provided as a file or as a filename) in the classic properties format.
Moreover, some additional constructors are provided that read the configuration from a file
and possibly change some of them; see for example
StartupConfiguration(String, Configuration)
.
Modifier and Type | Class | Description |
---|---|---|
static interface |
StartupConfiguration.DnsResolverSpecification |
A marker for the
DnsResolver class specification. |
static interface |
StartupConfiguration.FilterSpecification |
A marker for filter specifications.
|
static interface |
StartupConfiguration.ManyValuesSpecification |
A marker for specifications that may have multiple values.
|
static interface |
StartupConfiguration.OptionalSpecification |
A marker for optional specifications with a default parameter.
|
static interface |
StartupConfiguration.StoreSpecification |
A marker for the
Store class specification. |
static interface |
StartupConfiguration.TimeSpecification |
A marker for time specifications; such specification are by default in milliseconds, but it is possible
to use suffixes
ms (milliseconds), s (seconds), m (minutes), h (hours) and d (days). |
Modifier and Type | Field | Description |
---|---|---|
boolean |
acceptAllCertificates |
Whether to accept all SSL certificates, or self-signed only.
|
java.lang.String[] |
blackListedHosts |
A host that should be blacklisted (i.e., not crawled).
|
java.lang.String[] |
blackListedIPv4Addresses |
An IPv4 address that should be blacklisted (i.e., not crawled).
|
double |
bloomFilterPrecision |
The precision of the Bloom filter used for duplicate detection (usually, at least 1/
maxUrls ). |
int |
connectionTimeout |
The socket connection timeout in milliseconds.
|
int |
cookieMaxByteSize |
The maximum overall size for the (external form of) the cookies accepted from a single host.
|
java.lang.String |
cookiePolicy |
The cookie policy to be used.
|
boolean |
crawlIsNew |
Whether this is a new crawl.
|
java.lang.String |
digestAlgorithm |
The algorithm used for digesting pages (for duplicate filtering).
|
int |
dnsCacheMaxSize |
Maximum number of entries cached by the DNS resolutor when using
DnsJavaResolver . |
long |
dnsNegativeTtl |
Expiration time for negative DNS answers when using
DnsJavaResolver . |
long |
dnsPositiveTtl |
Expiration time for positive DNS answers when using
DnsJavaResolver . |
java.lang.Class<? extends org.apache.http.conn.DnsResolver> |
dnsResolverClass |
A
DnsResolver . |
int |
dnsThreads |
The number of DNS threads (usually few dozens, depending on the server).
|
int |
fetchDataBufferByteSize |
Size of the buffer for
InspectableFileCachedInputStream instances, in bytes. |
Filter<java.net.URI> |
fetchFilter |
A filter that will be applied to all ready URLs to decide whether to fetch them.
|
int |
fetchingThreads |
The number of fetching threads (hundreds or even thousands).
|
Filter<URIResponse> |
followFilter |
A filter that will be applied to all parsed resources to decide whether to follow their links.
|
java.lang.String |
frontierDir |
A directory for storing files (mainly queues managed by
ByteArrayDiskQueue ) related to the frontier. |
java.lang.String |
group |
The group of this agent; all agents belonging to the same group will coordinate their crawling activity.
|
long |
ipDelay |
The minimum delay between two consecutive fetches from the same IP address.
|
double |
ipDelayFactor |
An attenuation factor for the multiple-agent IP delay mechanism.
|
long |
keepAliveTime |
If zero, connections are closed at each downloaded resource.
|
long |
maxUrls |
The maximum number of URLs to crawl.
|
int |
maxUrlsPerSchemeAuthority |
The maximum number of URLs we shall download from each scheme+authority.
|
java.lang.String |
name |
The name of this agent; it must be unique within its group.
|
Filter<URIResponse> |
parseFilter |
A filter that will be applied to all fetched resources to decide whether to parse them.
|
java.lang.String[] |
parserSpec |
A
Parser specification that will be parsed using an ObjectParser . |
int |
parsingThreads |
The number of parsing threads (usually, the number of available cores).
|
java.lang.String |
proxyHost |
The proxy host, if a proxy should be used; an empty value means that the proxy should not be set.
|
int |
proxyPort |
The proxy port, meaningful only if
proxyHost is not empty. |
int |
responseBodyMaxByteSize |
The maximum size (in bytes) of a response body.
|
java.lang.String |
responseCacheDir |
A directory where the content overflowing the in-memory buffers of
FetchData instances
(of fetchDataBufferByteSize bytes) will be stored using an InspectableFileCachedInputStream . |
long |
robotsExpiration |
The delay after which the
robots.txt file is no longer considered valid. |
java.lang.String |
rootDir |
A root directory from which the remainig one will be stemmed, if
they are relative.
|
Filter<Link> |
scheduleFilter |
A filter that will be applied to all links obtained by parsing a page before scheduling them.
|
long |
schemeAuthorityDelay |
The minimum delay between two consecutive fetches from the same scheme+authority.
|
java.lang.String[] |
seed |
A URL from which BUbiNG will start crawling.
|
int |
sieveAuxFileIOBufferByteSize |
The I/O buffer used to write the auxiliary file (containing URLs) and to read it back during flushes.
|
java.lang.String |
sieveDir |
A directory for storing files related to the sieve.
|
int |
sieveSize |
The number of slots in the sieve.
|
int |
sieveStoreIOBufferByteSize |
The size of the two buffers used to read the 64-bit hashes stored by the sieve during flushes.
|
int |
socketTimeout |
The socket timeout.
|
int |
spamDetectionPeriodicity |
The number of pages per scheme+authority after which spam detection is performed again periodically.
|
int |
spamDetectionThreshold |
The number of pages per scheme+authority after which spam detection is performed.
|
java.lang.String |
spamDetectorUri |
An optional
SpamDetector ; this URI should point to a serialized instance. |
boolean |
startPaused |
Whether we should start in paused state.
|
java.lang.Class<? extends Store> |
storeClass |
The class used to
Store the resources. |
java.lang.String |
storeDir |
A directory where the retrieved content will be written.
|
Filter<URIResponse> |
storeFilter |
A filter that will be applied to all fetched resources to decide whether to store them.
|
long |
urlCacheMaxByteSize |
The maximum size of the URL cache in bytes.
|
java.lang.String |
userAgent |
The User Agent header used for HTTP requests.
|
java.lang.String |
userAgentFrom |
The From header used for HTTP requests.
|
long |
virtualizerMaxByteSize |
The maximum size of the virtualizer in bytes; this field is ignored if the virtualizer does not need to be sized.
|
int |
weight |
The weight of this agent; agents are assigned a part of the crawl that is proportional to their weight.
|
long |
workbenchMaxByteSize |
The maximum size of the workbench in bytes.
|
Constructor | Description |
---|---|
StartupConfiguration(java.io.File file) |
Creates a configuration starting from a given file.
|
StartupConfiguration(java.io.File file,
Configuration additionalProp) |
Creates a configuration starting from a given file and possibly adding and/or overriding some
properties with new values.
|
StartupConfiguration(java.lang.String fileName) |
Creates a configuration starting from a given file.
|
StartupConfiguration(java.lang.String fileName,
Configuration additionalProp) |
Creates a configuration starting from a given file and possibly adding and/or overriding some
properties with new values.
|
StartupConfiguration(Configuration configuration) |
Populate the object fields starting from the given configuration.
|
Modifier and Type | Method | Description |
---|---|---|
static long |
parseTime(java.lang.String timeSpec) |
|
static java.io.File |
subDir(java.lang.String parent,
java.lang.String child) |
Returns a
File object representing a child relative
to a parent, or just the child, if absolute. |
java.lang.String |
toString() |
public java.lang.String name
public java.lang.String group
public int weight
public int maxUrlsPerSchemeAuthority
public int fetchingThreads
public int parsingThreads
public int dnsThreads
public Filter<java.net.URI> fetchFilter
public Filter<Link> scheduleFilter
public Filter<URIResponse> parseFilter
public Filter<URIResponse> followFilter
public Filter<URIResponse> storeFilter
public long keepAliveTime
public long schemeAuthorityDelay
public long ipDelay
public double ipDelayFactor
BUbiNG uses a simple model to predict how many agents are accessing the same IP: if
an IP address has k > 1 associated hosts, BUbiNG predicts that
a fraction k/(k + 1) of the known agents are accessing that IP address. When ipDelayFactor
= 0, the
ipDelay
is used as such. When ipDelayFactor
= 1, the IP delay is multiplied
by k/(k + 1) (i.e., if there are two hosts with the same IP we predict that
half of the agents are crawling the same IP). In general, the IP delay is multiplied
by ipDelayFactor
· k/(k + 1) · Agent.getKnownCount()
,
so this parameter can be used to tune this behaviour. Note that in any case BUbiNG will wait at least
ipDelay
.
public long maxUrls
public double bloomFilterPrecision
maxUrls
).public java.lang.String[] seed
file:
,
it is assumed to point to an ASCII file containing on each line a seed URL.public java.lang.String[] blackListedIPv4Addresses
file:
,
it is assumed to point to an ASCII file containing on each line a blacklisted IPv4 address.public java.lang.String[] blackListedHosts
file:
,
it is assumed to point to an ASCII file containing on each line a blacklisted host.public int socketTimeout
public int connectionTimeout
public int fetchDataBufferByteSize
InspectableFileCachedInputStream
instances, in bytes. Each fetching thread holds such a buffer.public java.lang.String proxyHost
public int proxyPort
proxyHost
is not empty.public java.lang.String cookiePolicy
CookieSpecs
.public int cookieMaxByteSize
public java.lang.String userAgent
public java.lang.String userAgentFrom
From
header will be emitted.public long robotsExpiration
robots.txt
file is no longer considered valid.public boolean acceptAllCertificates
public java.lang.String rootDir
public java.lang.String storeDir
public java.lang.String responseCacheDir
FetchData
instances
(of fetchDataBufferByteSize
bytes) will be stored using an InspectableFileCachedInputStream
. It must not exist.public java.lang.String sieveDir
public java.lang.String frontierDir
ByteArrayDiskQueue
) related to the frontier. It must not exist.public int responseBodyMaxByteSize
public java.lang.String digestAlgorithm
public java.lang.String[] parserSpec
Parser
specification that will be parsed using an ObjectParser
.public boolean startPaused
public java.lang.Class<? extends Store> storeClass
Store
the resources.public long workbenchMaxByteSize
public long virtualizerMaxByteSize
public long urlCacheMaxByteSize
public int sieveSize
public int sieveStoreIOBufferByteSize
ByteBuffer.allocateDirect(int)
.public int sieveAuxFileIOBufferByteSize
public java.lang.Class<? extends org.apache.http.conn.DnsResolver> dnsResolverClass
DnsResolver
.it.unimi.di.law.bubing.frontier.dns
public int dnsCacheMaxSize
DnsJavaResolver
.public long dnsPositiveTtl
DnsJavaResolver
.public long dnsNegativeTtl
DnsJavaResolver
.public boolean crawlIsNew
public java.lang.String spamDetectorUri
SpamDetector
; this URI
should point to a serialized instance.public int spamDetectionThreshold
public int spamDetectionPeriodicity
Integer.MAX_VALUE
, spam detection is performed only
after spamDetectionThreshold
pages.public StartupConfiguration(Configuration configuration) throws ConfigurationException, java.lang.ClassNotFoundException
Field are parsed in the standard way (i.e., we invoke Configuration
's methods such as Configuration.getDouble(String)
)
with the exception of integer and long fields, which are passed to a
IntSizeStringParser
and LongSizeStringParser
, respectively. Thus, you can use any size specification
allowed by those parsers.
configuration
- the configuration on the basis of which this object should be constructed.ConfigurationException
- if some field is not set, or if it has the wrong type, or if it fails to satisfy the
corresponding check...
method.java.lang.IllegalArgumentException
java.lang.ClassNotFoundException
public StartupConfiguration(java.io.File file, Configuration additionalProp) throws ConfigurationException, java.lang.IllegalArgumentException, java.lang.ClassNotFoundException
file
- the file whence the base configuration should be read.additionalProp
- the set of additional properties, some of which may override the values found in the file.ConfigurationException
- if some field is not set, or if it has the wrong type, or if it fails to satisfy the
corresponding check...
method.java.lang.IllegalArgumentException
java.lang.ClassNotFoundException
public StartupConfiguration(java.lang.String fileName, Configuration additionalProp) throws ConfigurationException, java.lang.IllegalArgumentException, java.lang.ClassNotFoundException
fileName
- the name of the file whence the base configuration should be read.additionalProp
- the set of additional properties, some of which may override the values found in the file.ConfigurationException
- if some field is not set, or if it has the wrong type, or if it fails to satisfy the
corresponding check...
method.java.lang.IllegalArgumentException
java.lang.ClassNotFoundException
public StartupConfiguration(java.io.File file) throws ConfigurationException, java.lang.ClassNotFoundException
file
- the file whence the configuration should be read.ConfigurationException
- if some field is not set, or if it has the wrong type, or if it fails to satisfy the
corresponding check...
method.java.lang.ClassNotFoundException
public StartupConfiguration(java.lang.String fileName) throws ConfigurationException, java.lang.ClassNotFoundException
fileName
- the name of the file whence the configuration should be read.ConfigurationException
- if some field is not set, or if it has the wrong type, or if it fails to satisfy the
corresponding check...
method.java.lang.ClassNotFoundException
public static java.io.File subDir(java.lang.String parent, java.lang.String child)
File
object representing a child relative
to a parent, or just the child, if absolute.parent
- a parent directory.child
- a (possibly absolute) child file or directory.child
, if it is absolute; child
relative to parent
, if it is relative.public java.lang.String toString()
toString
in class java.lang.Object
public static long parseTime(java.lang.String timeSpec)