BUbiNG is a high-speed distributed crawler. BUbiNG agents coordinate autonomously to download data from the web.
BUbiNG requires a number of features from your system.
You can specify an alternate JGroups configuration file using the JVM property "it.unimi.di.law.bubing.jgroups.configurationFile".
* - nofile 65536 * - nproc 32768as a setup in limits.d (or equivalent configuration directory).
# increase TCP max buffer size settable using setsockopt() net.core.rmem_max = 8388608 net.core.wmem_max = 8388608 # increase conntrack size (or remove iptables modules) net.netfilter.nf_conntrack_max = 1048576 net.netfilter.nf_conntrack_buckets = 16386 # Wait a maximum of 5 * 2 = 10 seconds in the TIME_WAIT state after a FIN, to handle # any remaining packets in the network. net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 5 net.netfilter.nf_conntrack_tcp_timeout_time_wait = 5 # No protection from SYN flood attack (to run proxy tests) net.ipv4.tcp_syncookies = 0 # Disable packet forwarding. net.ipv4.ip_forward = 0 net.ipv6.conf.all.forwarding = 0 # Increase system file descriptor limit. Generally, set this to 64 * R, where # R is the amount of RAM in MB your box has (minus a buffer?) fs.file-max = 262144 # Send the first keepalive time after 60 seconds net.ipv4.tcp_keepalive_time = 60 # Timeout broken connections faster (amount of time to wait for FIN) net.ipv4.tcp_fin_timeout = 30 # Let the networking stack reuse TIME_WAIT connections when it thinks it's safe to do so net.ipv4.tcp_tw_reuse = 1 # Don't Log Spoofed Packets, Source Routed Packets, Redirect Packets net.ipv4.conf.all.log_martians = 0 # Make more local ports available net.ipv4.ip_local_port_range = 1024 65000
The above file is sourceable with sysctl.
allow-query { localhost; }; recursion yes; /* These must be coordinated with internal DNSJava settings. */ max-cache-size 10000; max-cache-ttl 3600; max-ncache-ttl 60;
Moreover, we run bind with the -4 option to avoid IPv6 resolution of each host and with the -U 8 option to avoid the constant logging about the maximum number of FD events.
tune2fs
utility to set the journal_data_writeback option, and to mount
the partition using the options noatime and data=writeback.
To check that your system is properly configured, it is essential that a number of trial runs are performed using the
NamedGraphServerHttpProxy
in random mode. For instance,
java -Xmx4G -server it.unimi.di.law.bubing.test.NamedGraphServerHttpProxy -s 100000000 -d 50 -m 3 -t 1000 -D .0001 -A1000 - &will start the proxy using 100 million hosts, average page degree 50, average depth 3, 1000 threads and a 0.01% pages being broken (connect or read timeouts). By starting BUbiNG with the configuration option proxyHost=localhost (the port, 8080, is understood) you can test that your memory and system settings are satisfactory without using any network bandwidth.
BUbiNG uses a configuration file to set up its internal values. Such values are fully documented
in StartupConfiguration
; the name of the keys are derived by reflection from
the field names. Default values are annotated using the StartupConfiguration.OptionalSpecification
annotation in StartupConfiguration
. This is an example configuration file:
rootDir=crawl-eu maxUrlsPerSchemeAuthority=1000 parsingThreads=64 dnsThreads=50 fetchingThreads=512 fetchFilter=true scheduleFilter=( SchemeEquals(http) or SchemeEquals(https) ) and HostEndsWithOneOf(.ad\,.al\,.at\,.be\,.bg\,.ch\,.cz\,.de\,.dk\,.ee\,.es\,.eu\,.fi\,.fo\,.fr\,.gb\,.gr\,.hr\,.hu\,.ie\,.im\,.is\,.it\,.li\,.lt\,.lu\,.lv\,.mc\,.md\,.me\,.nl\,.no\,.pl\,.pt\,.ro\,.se\,.si\,.sk\,.sm\,.uk\,.va) and not PathEndsWithOneOf(.axd\,.xls\,.rar\,.sflb\,.tmb\,.pdf\,.js\,.swf\,.rss\,.kml\,.m4v\,.tif\,.avi\,.iso\,.mov\,.ppt\,.bib\,.docx\,.css\,.fits\,.png\,.gif\,.jpg\,.jpeg\,.ico\,.doc\,.wmv\,.mp3\,.mp4\,.aac\,.ogg\,.wma\,.gz\,.bz2\,.Z\,.z\,.zip) and URLShorterThan(2048) and DuplicateSegmentsLessThan(3) followFilter=true parseFilter=( ContentTypeStartsWith(text/) or PathEndsWithOneOf(.html\,.htm\,.txt) ) and not IsProbablyBinary() storeFilter=( ContentTypeStartsWith(text/) or PathEndsWithOneOf(.html\,.htm\,.txt) ) and not IsProbablyBinary() schemeAuthorityDelay=10s ipDelay=2s maxUrls=500M bloomFilterPrecision=1E-8 seed=file:/some/path/eu.seed socketTimeout=60s connectionTimeout=60s fetchDataBufferByteSize=200K cookiePolicy=compatibility cookieMaxByteSize=2000 userAgent=BUbiNG (+http://law.di.unimi.it/BUbiNG.html) userAgentFrom=youremail@yourdomain.com robotsExpiration=1h responseBodyMaxByteSize=2M digestAlgorithm=MurmurHash3 startPaused=false workbenchMaxByteSize=512Mi urlCacheMaxByteSize=1Gi sieveSize=64Mi parserSpec=HTMLParser(MurmurHash3) keepAliveTime=1s
Note that almost all these properties are modifiable at runtime using the JMX interface.
There are two main activities related to the preparation of configuration files:
it.unimi.di.law.warc.filters
.
First of all, we suggest to run as root the command sysctl -p /etc/sysctl.d/bubing (where the file bubing contains the settings discussed before) as in our experience some of the parameters turn out to be not correctly set after boot, even in included in /etc/sysctl.d.
A BUbiNG is started using the Agent
class, as in:
java -server -Xss256K -Xms20G -XX:+UseNUMA -Djavax.net.ssl.sessionCacheSize=8192 \ -XX:+UseTLAB -XX:+ResizeTLAB -XX:NewRatio=4 -XX:MaxTenuringThreshold=15 -XX:+CMSParallelRemarkEnabled \ -verbose:gc -Xloggc:gc.log -XX:+PrintGCDetails \ -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 \ -Djava.rmi.server.hostname=192.168.0.20 \ -Djava.net.preferIPv4Stack=true \ -Djgroups.bind_addr=192.168.0.20 \ -Dlogback.configurationFile=bubing-logback.xml \ -Dcom.sun.management.jmxremote.port=9999 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false \ it.unimi.di.law.bubing.Agent -h 192.168.0.20 -P eu.properties -g eu agent -n 2>err >out
The JVM options are of course just an example for a 64-core, 64 GiB hardware, and should be customized. It is important, however, that the thread stack size (-Xss) is set to a small value (as BUbiNG uses thousands of threads).
The IP address specified will be
used for JGroups and JMX communication: if you have an internal high-speed interface, this is the place to use it. Otherwise,
just use the standard IP address of the machine. This crawl will use a property file named eu.properties, start an
agent with name agent part of a crawl group named eu. Logback
will be configured by the file bubing-logback.xml
(see below).
In the standard BUbiNG setup, agents in the same crawl groups coordinates autonomously using consistent hashing, so if you want to perform a multi-agent crawl you must just be sure to have properly configured your hardware and JGroups so they work together, and give each agent a different name. A simple way to check that this is true is to start the crawl in pause mode, check from the logs that all agents are visible to each other, and then start the crawl.
Note that if you use multiple agents you have no guarantee that the politeness IP delay will be respected, as multiple agents might crawl the same IP in parallel. To reduce pressure on IP addresses with a large number of hosts, if you have many agents you can try to tune the IP delay factor.
We use Logback for logging. We suggest a separate log for messages above the WARN level, to be able to find errors quickly. The file used in the command above could be as follows (assuming you want to write your logs in /tmp):
<?xml version="1.0" encoding="UTF-8"?> <configuration> <appender name="all" class="ch.qos.logback.core.rolling.RollingFileAppender"> <file>/tmp/bubing.log</file> <encoder> <pattern>%d %r %p [%t] %logger{1} - %m%n</pattern> </encoder> <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy"> <maxHistory>100</maxHistory> <FileNamePattern>/tmp/bubing.%d{yyyy-MM-dd_HH}.log</FileNamePattern> </rollingPolicy> </appender> <appender name="warn" class="ch.qos.logback.core.rolling.RollingFileAppender"> <file>/tmp/bubing.warn</file> <encoder> <pattern>%d %r %p [%t] %logger{1} - %m%n</pattern> </encoder> <filter class="ch.qos.logback.classic.filter.ThresholdFilter"> <level>WARN</level> </filter> <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy"> <maxHistory>100</maxHistory> <FileNamePattern>/tmp/bubing.%d.warn</FileNamePattern> </rollingPolicy> </appender> <root level="INFO"> <appender-ref ref="all"/> <appender-ref ref="warn"/> </root> <logger name="net.htmlparser.jericho" level="OFF"/> <logger name="it.unimi.di.law.bubing" level="INFO"/> <logger name="it.unimi.di.law.bubing.frontier.FetchingThread$DNSJavaClientConnectionManager" level="INFO"/> <logger name="org.apache.http.impl.client" level="WARN"/> <logger name="org.apache.http.wire" level="WARN"/> </configuration>
BUbiNG's object handling deeply violates the weak generational hypothesis, that is, the idea that most allocated objects die young. During a crawl, dozens (of even hundreds) of millions of small objects have a lifetime of minutes or hours, and then disappear. Using the default settings of the garbage collector leads to a constant promotion these objects to the old generation, causing frequent full collections that decrease progressively the crawl speed.
The settings suggested above modify radically the garbage collector behavior. One third of the available memory is allocated for new objects, and we try to delay tenuring as much as possible by swapping objects between the survivor spaces the maximum possible number of times (15, as the counter has only four bits). Note that frequent minor collections have little impact on the crawler, as even during collections the operating system is filling socket buffers, flushing file buffers, handling memory mapping, etc. We also require dynamic thread-local allocation buffers, as the allocation needs of, say, fetching threads and parsing threads are very different.
It is always a good idea to log garbage collection information (as suggested above) and to examine the resulting log using a tool like Chewiebug. Ideally, you should see frequent minor collection, and small sets of objects promoted to the old generation.
The actual values of the parameter should be chosen in the following order:
The last final judge of a correct configuration of the garbage collector is the stability of the crawling speed.
Once you have decided the amount of memory that you want to use for the crawler, you must decide how to divide this memory among BUbiNG's components. Most of the memory used by BUbiNG can be divided as follows:
StartupConfiguration.sieveSize
should be multiplied by 12 to obtain approximately the amount of memory allocated.StartupConfiguration.fetchDataBufferByteSize
. The cache
is transparently extended to StartupConfiguration.responseBodyMaxByteSize
using a file on disk, but too much disk access will slow
down the crawler.StartupConfiguration.maxUrls
and
StartupConfiguration.bloomFilterPrecision
, and it is explained here.StartupConfiguration.urlCacheMaxByteSize
. Look at the hit/miss ratio in the logs to understand
whether you have a large enough cache (the bigger, the better).StartupConfiguration.cookieMaxByteSize
. This is the only part of the data accumulated by BUbiNG that might, over
time, grow without bound.BUbiNG's activity requires a lot of disk I/O. Some of the I/O activity, like the flushes of the MercatorSieve, are stop-of-the-world events and require no particular care. However, there are three activities that happen concurrently during the crawling:
StartupConfiguration.frontierDir
.
StartupConfiguration.fetchDataBufferByteSize
bytes of a response are kept in main memory: the remaining
part, up to StartupConfiguration.responseBodyMaxByteSize
bytes, is dumped on a disk cache. There is one such cache
per parsing thread, so the disk activity can be quite intensive. The caches are stored in StartupConfiguration.responseCacheDir
.
WarcStore
, data will be continuously written in StartupConfiguration.storeDir
.Ideally, these four activities should use different disks, and, if possible, StartupConfiguration.frontierDir
should
live on solid-state storage.
There are two sources of information about the inner works of BUbiNG: JMX and the logs. Most internals of BUbiNG are modifiable at runtime using JMX, and looking at the logs reveals a lot information about what's going on. We suggest to regularly grep for ERROR and check standard error to catch problems at an early stage.
If you just want to pause a crawl, there is a JMX Agent.pause()
operation that will suspend crawling activity
until Agent.resume()
is invoked. Alternatively, you can
stop the crawl, which will cause the crawl state to be snapped to disk.
At that point, the crawl can restart by launching the agent with the same configuration but without the -n option.
Package | Description |
---|---|
it.unimi.di.law.bubing |
The main BUbing class,
Agent , and the companion classes
StartupConfiguration /RuntimeConfiguration ,
which describe BUbiNG's internals that can be configured or modified at runtime. |
it.unimi.di.law.bubing.frontier |
A set of classes orchestrating the movement of URLs to be fetched next by a BUbiNG agent.
|
it.unimi.di.law.bubing.frontier.dns | |
it.unimi.di.law.bubing.parser |
The system of parsers used for analyzing HTTP responses.
|
it.unimi.di.law.bubing.sieve |
Several implementation of the idea of an
AbstractSieve . |
it.unimi.di.law.bubing.spam | |
it.unimi.di.law.bubing.store |
Implementations of the
Store interface. |
it.unimi.di.law.bubing.test | |
it.unimi.di.law.bubing.tool | |
it.unimi.di.law.bubing.util |
Miscellaneous utility classes, including the BUbiNG queuing system
and its fast cache for URLs.
|
it.unimi.di.law.warc |
An implementation of the Web ARChive file format (WARC) specification.
|
it.unimi.di.law.warc.filters |
A comprehensive filtering system.
|
it.unimi.di.law.warc.filters.parser | |
it.unimi.di.law.warc.io |
I/O of WARC formatted files.
|
it.unimi.di.law.warc.io.gzarc |
An implementation of a (skippable) GZIP archive.
|
it.unimi.di.law.warc.processors |
Processors to manipulate WARC files.
|
it.unimi.di.law.warc.records |
WARC records.
|
it.unimi.di.law.warc.tool | |
it.unimi.di.law.warc.util |
Utility classes used by the
it.unimi.di.law.warc package. |