Overview (BUbiNG 0.9.15)

BUbiNG is a high-speed distributed crawler. BUbiNG agents coordinate autonomously to download data from the web.

Configuring your hardware

BUbiNG requires a number of features from your system.

If you want to run multiple agents, your network must be configured so that communication is possible. The default BUbiNG JGroups configuration file (that can be found in BUbiNG's jar under the name jgroups.xml) uses multicast over UDP, so you must be sure that a multicast route is set up. Please read the JGroups documentation to find how to set up the route and check that it works.
You can specify an alternate JGroups configuration file using the JVM property "it.unimi.di.law.bubing.jgroups.configurationFile".
Your system must be configured so that a large number of processes and files are allowed. Under linux, we use
```
        * - nofile 65536
        * - nproc  32768
    
```
as a setup in limits.d (or equivalent configuration directory).

A number of system configurations (typically set in /etc/sysctl.d, say by creating a file /etc/sysctl.d/bubing) must be altered:

# increase TCP max buffer size settable using setsockopt()
net.core.rmem_max = 8388608
net.core.wmem_max = 8388608

# increase conntrack size (or remove iptables modules)
net.netfilter.nf_conntrack_max = 1048576
net.netfilter.nf_conntrack_buckets = 16386

# Wait a maximum of 5 * 2 = 10 seconds in the TIME_WAIT state after a FIN, to handle
# any remaining packets in the network. 
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 5
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 5

# No protection from SYN flood attack (to run proxy tests)
net.ipv4.tcp_syncookies = 0

# Disable packet forwarding.
net.ipv4.ip_forward = 0
net.ipv6.conf.all.forwarding = 0

# Increase system file descriptor limit. Generally, set this to 64 * R, where
# R is the amount of RAM in MB your box has (minus a buffer?)
fs.file-max = 262144

# Send the first keepalive time after 60 seconds
net.ipv4.tcp_keepalive_time = 60

# Timeout broken connections faster (amount of time to wait for FIN)
net.ipv4.tcp_fin_timeout = 30

# Let the networking stack reuse TIME_WAIT connections when it thinks it's safe to do so
net.ipv4.tcp_tw_reuse = 1

# Don't Log Spoofed Packets, Source Routed Packets, Redirect Packets 
net.ipv4.conf.all.log_martians = 0

# Make more local ports available 
net.ipv4.ip_local_port_range = 1024 65000

The above file is sourceable with sysctl.

You must be sure that either you are using a resilient DNS server that is going to sustain your request rate, or run locally a recursive name server. This is the best practice, because BUbiNG is going to require million of resolutions and will cache them internally. Thus, your DNS server can have a quite small cache. Our settings for bind9 are as follows:
```
        allow-query     { localhost; };
        recursion yes;

        /* These must be coordinated with internal DNSJava settings. */
        max-cache-size 10000;
        max-cache-ttl 3600;
        max-ncache-ttl 60;
    
```
Moreover, we run bind with the -4 option to avoid IPv6 resolution of each host and with the -U 8 option to avoid the constant logging about the maximum number of FD events.
The file system should be set up for performance. If you have an ext4 partition, we suggest to use the tune2fs utility to set the journal_data_writeback option, and to mount the partition using the options noatime and data=writeback.
If you intend to access SSL sites with strong security, you must download and install the Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy Files.

To check that your system is properly configured, it is essential that a number of trial runs are performed using the NamedGraphServerHttpProxy in random mode. For instance,

   java -Xmx4G -server it.unimi.di.law.bubing.test.NamedGraphServerHttpProxy -s 100000000 -d 50 -m 3 -t 1000 -D .0001 -A1000 - &

will start the proxy using 100 million hosts, average page degree 50, average depth 3, 1000 threads and a 0.01% pages being broken (connect or read timeouts). By starting BUbiNG with the configuration option proxyHost=localhost (the port, 8080, is understood) you can test that your memory and system settings are satisfactory without using any network bandwidth.

Configuring BUbiNG

BUbiNG uses a configuration file to set up its internal values. Such values are fully documented in StartupConfiguration; the name of the keys are derived by reflection from the field names. Default values are annotated using the StartupConfiguration.OptionalSpecification annotation in StartupConfiguration. This is an example configuration file:

rootDir=crawl-eu
maxUrlsPerSchemeAuthority=1000
parsingThreads=64
dnsThreads=50
fetchingThreads=512
fetchFilter=true
scheduleFilter=( SchemeEquals(http) or SchemeEquals(https) ) and HostEndsWithOneOf(.ad\,.al\,.at\,.be\,.bg\,.ch\,.cz\,.de\,.dk\,.ee\,.es\,.eu\,.fi\,.fo\,.fr\,.gb\,.gr\,.hr\,.hu\,.ie\,.im\,.is\,.it\,.li\,.lt\,.lu\,.lv\,.mc\,.md\,.me\,.nl\,.no\,.pl\,.pt\,.ro\,.se\,.si\,.sk\,.sm\,.uk\,.va) and not PathEndsWithOneOf(.axd\,.xls\,.rar\,.sflb\,.tmb\,.pdf\,.js\,.swf\,.rss\,.kml\,.m4v\,.tif\,.avi\,.iso\,.mov\,.ppt\,.bib\,.docx\,.css\,.fits\,.png\,.gif\,.jpg\,.jpeg\,.ico\,.doc\,.wmv\,.mp3\,.mp4\,.aac\,.ogg\,.wma\,.gz\,.bz2\,.Z\,.z\,.zip) and URLShorterThan(2048) and DuplicateSegmentsLessThan(3)
followFilter=true
parseFilter=( ContentTypeStartsWith(text/) or PathEndsWithOneOf(.html\,.htm\,.txt) ) and not IsProbablyBinary()
storeFilter=( ContentTypeStartsWith(text/) or PathEndsWithOneOf(.html\,.htm\,.txt) ) and not IsProbablyBinary()
schemeAuthorityDelay=10s
ipDelay=2s
maxUrls=500M
bloomFilterPrecision=1E-8
seed=file:/some/path/eu.seed
socketTimeout=60s
connectionTimeout=60s
fetchDataBufferByteSize=200K
cookiePolicy=compatibility
cookieMaxByteSize=2000
userAgent=BUbiNG (+http://law.di.unimi.it/BUbiNG.html)
userAgentFrom=youremail@yourdomain.com
robotsExpiration=1h
responseBodyMaxByteSize=2M
digestAlgorithm=MurmurHash3
startPaused=false
workbenchMaxByteSize=512Mi
urlCacheMaxByteSize=1Gi
sieveSize=64Mi
parserSpec=HTMLParser(MurmurHash3)
keepAliveTime=1s

Note that almost all these properties are modifiable at runtime using the JMX interface.

There are two main activities related to the preparation of configuration files:

Choosing your filters: Every phase of BUbiNG's activity is controlled by a filter that you can specify using boolean operations and atoms from it.unimi.di.law.warc.filters.
Sizing the various component: BUbiNG will use a constant amount of memory for its activities, except for storing host information, for which it assumes to have enough core memory. The shown values are a good choice for a 64 GiB machine.

Starting a crawl

First of all, we suggest to run as root the command sysctl -p /etc/sysctl.d/bubing (where the file bubing contains the settings discussed before) as in our experience some of the parameters turn out to be not correctly set after boot, even in included in /etc/sysctl.d.

A BUbiNG is started using the Agent class, as in:

java -server -Xss256K -Xms20G -XX:+UseNUMA -Djavax.net.ssl.sessionCacheSize=8192 \
        -XX:+UseTLAB -XX:+ResizeTLAB -XX:NewRatio=4 -XX:MaxTenuringThreshold=15 -XX:+CMSParallelRemarkEnabled \
        -verbose:gc -Xloggc:gc.log -XX:+PrintGCDetails \
        -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 \
        -Djava.rmi.server.hostname=192.168.0.20 \
        -Djava.net.preferIPv4Stack=true \
        -Djgroups.bind_addr=192.168.0.20 \
        -Dlogback.configurationFile=bubing-logback.xml \
        -Dcom.sun.management.jmxremote.port=9999 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false \
        it.unimi.di.law.bubing.Agent -h 192.168.0.20 -P eu.properties -g eu agent -n 2>err >out

The JVM options are of course just an example for a 64-core, 64 GiB hardware, and should be customized. It is important, however, that the thread stack size (-Xss) is set to a small value (as BUbiNG uses thousands of threads).

The IP address specified will be used for JGroups and JMX communication: if you have an internal high-speed interface, this is the place to use it. Otherwise, just use the standard IP address of the machine. This crawl will use a property file named eu.properties, start an agent with name agent part of a crawl group named eu. Logback will be configured by the file bubing-logback.xml (see below).

In the standard BUbiNG setup, agents in the same crawl groups coordinates autonomously using consistent hashing, so if you want to perform a multi-agent crawl you must just be sure to have properly configured your hardware and JGroups so they work together, and give each agent a different name. A simple way to check that this is true is to start the crawl in pause mode, check from the logs that all agents are visible to each other, and then start the crawl.

Note that if you use multiple agents you have no guarantee that the politeness IP delay will be respected, as multiple agents might crawl the same IP in parallel. To reduce pressure on IP addresses with a large number of hosts, if you have many agents you can try to tune the IP delay factor.

We use Logback for logging. We suggest a separate log for messages above the WARN level, to be able to find errors quickly. The file used in the command above could be as follows (assuming you want to write your logs in /tmp):

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
  <appender name="all" class="ch.qos.logback.core.rolling.RollingFileAppender">
    <file>/tmp/bubing.log</file>
    <encoder>
      <pattern>%d %r %p [%t] %logger{1} - %m%n</pattern>
    </encoder>
    <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
      <maxHistory>100</maxHistory>
      <FileNamePattern>/tmp/bubing.%d{yyyy-MM-dd_HH}.log</FileNamePattern>
    </rollingPolicy>
  </appender>
  <appender name="warn" class="ch.qos.logback.core.rolling.RollingFileAppender">
    <file>/tmp/bubing.warn</file>
    <encoder>
      <pattern>%d %r %p [%t] %logger{1} - %m%n</pattern>
    </encoder>
    <filter class="ch.qos.logback.classic.filter.ThresholdFilter">
      <level>WARN</level>
    </filter>
    <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
      <maxHistory>100</maxHistory>
      <FileNamePattern>/tmp/bubing.%d.warn</FileNamePattern>
    </rollingPolicy>
  </appender>
  <root level="INFO">
    <appender-ref ref="all"/>
    <appender-ref ref="warn"/>
  </root>
  <logger name="net.htmlparser.jericho" level="OFF"/>
  <logger name="it.unimi.di.law.bubing" level="INFO"/>
  <logger name="it.unimi.di.law.bubing.frontier.FetchingThread$DNSJavaClientConnectionManager" level="INFO"/>
  <logger name="org.apache.http.impl.client" level="WARN"/>
  <logger name="org.apache.http.wire" level="WARN"/>
</configuration>

Tuning the garbage collector

BUbiNG's object handling deeply violates the weak generational hypothesis, that is, the idea that most allocated objects die young. During a crawl, dozens (of even hundreds) of millions of small objects have a lifetime of minutes or hours, and then disappear. Using the default settings of the garbage collector leads to a constant promotion these objects to the old generation, causing frequent full collections that decrease progressively the crawl speed.

The settings suggested above modify radically the garbage collector behavior. One third of the available memory is allocated for new objects, and we try to delay tenuring as much as possible by swapping objects between the survivor spaces the maximum possible number of times (15, as the counter has only four bits). Note that frequent minor collections have little impact on the crawler, as even during collections the operating system is filling socket buffers, flushing file buffers, handling memory mapping, etc. We also require dynamic thread-local allocation buffers, as the allocation needs of, say, fetching threads and parsing threads are very different.

It is always a good idea to log garbage collection information (as suggested above) and to examine the resulting log using a tool like Chewiebug. Ideally, you should see frequent minor collection, and small sets of objects promoted to the old generation.

The actual values of the parameter should be chosen in the following order:

you must decide the amount of heap, which should be as large as possible, considering that you will need to leave some memory for caching, memory-mapped files, etc. Running a simulated crawl for a few hundred million pages is a good idea, as it is difficult to estimate the actual space occupied by the JVM (e.g., there is a large amount of system memory used);
you must decide the ratio between the new and the old generation; in principle, the largest possible new generation (-XX:NewRatio=1) will give the highest throughput, but if you are low on memory you might end up with out-of-memory errors because of a too small old generation.

The last final judge of a correct configuration of the garbage collector is the stability of the crawling speed.

How BUbiNG uses memory

Once you have decided the amount of memory that you want to use for the crawler, you must decide how to divide this memory among BUbiNG's components. Most of the memory used by BUbiNG can be divided as follows:

The workbench: The workbench keeps around the URLs (actually, the path+queries) that will be fetched in a short time. Depending on the number of threads, your network feed, etc., the workbench must be large enough to keep a few URLs for a number of hosts that is sufficient to saturate your bandwidth. We suggest starting with a large value (e.g., a gigabyte), check in the logs which percentage is actually used, and then lower the amount.
The sieve: The sieve keeps track of which URLs have ever been seen. The default implementation, MercatorSieve, accumulates URLs on disk and associated 64-bit fingerprints in RAM until it is full. Then, it flushes the data and generates a stream of ready URLs. The largest the sieve, the less flushes, the less I/O. The StartupConfiguration.sieveSize should be multiplied by 12 to obtain approximately the amount of memory allocated.
The fetching thread cache: Each fetching thread must download data to a cache, whose size is determined by StartupConfiguration.fetchDataBufferByteSize. The cache is transparently extended to StartupConfiguration.responseBodyMaxByteSize using a file on disk, but too much disk access will slow down the crawler.
The Bloom filter: If you are using near-duplicate detection there is a large amount of memory used by the Bloom filter which keeps track of digests. The amount of memory used depends on StartupConfiguration.maxUrls and StartupConfiguration.bloomFilterPrecision, and it is explained here.
The URL cache: The URL cache keeps track of 128-bit MurmurHash3 fingerprints of recently seen URLs; it significantly reduces the number of URLs broadcast in a distributed crawl, besides lightening the load on the sieve. Its size must be fixed a priori using StartupConfiguration.urlCacheMaxByteSize. Look at the hit/miss ratio in the logs to understand whether you have a large enough cache (the bigger, the better).
Data associated to visit states: This is a relatively small amount of data (less than one kilobyte) associated with each host currently visited. It consists in some metadata associated with the host, its cookies, and its robots.txt exclusion information. The size of the cookies can be controlled using a StartupConfiguration.cookieMaxByteSize. This is the only part of the data accumulated by BUbiNG that might, over time, grow without bound.

How BUbiNG uses disks

BUbiNG's activity requires a lot of disk I/O. Some of the I/O activity, like the flushes of the MercatorSieve, are stop-of-the-world events and require no particular care. However, there are three activities that happen concurrently during the crawling:

Accessing the disk queues: Part of the URLs waiting to be crawled are virtualized—stored on disk. These URLs are accessed frequently by memory mapping in a semi-random fashion; they are stored in StartupConfiguration.frontierDir.
Dumping large responses: Only the first StartupConfiguration.fetchDataBufferByteSize bytes of a response are kept in main memory: the remaining part, up to StartupConfiguration.responseBodyMaxByteSize bytes, is dumped on a disk cache. There is one such cache per parsing thread, so the disk activity can be quite intensive. The caches are stored in StartupConfiguration.responseCacheDir.
Storing data: If you use a disk-based store, such as the default WarcStore, data will be continuously written in StartupConfiguration.storeDir.
Logging: Depending on your Logback configuration, logging can be quite I/O intensive.

Ideally, these four activities should use different disks, and, if possible, StartupConfiguration.frontierDir should live on solid-state storage.

Following a crawl

There are two sources of information about the inner works of BUbiNG: JMX and the logs. Most internals of BUbiNG are modifiable at runtime using JMX, and looking at the logs reveals a lot information about what's going on. We suggest to regularly grep for ERROR and check standard error to catch problems at an early stage.

Pausing, suspending or stopping a crawl

If you just want to pause a crawl, there is a JMX Agent.pause() operation that will suspend crawling activity until Agent.resume() is invoked. Alternatively, you can stop the crawl, which will cause the crawl state to be snapped to disk. At that point, the crawl can restart by launching the agent with the same configuration but without the -n option.

Package	Description
it.unimi.di.law.bubing	The main BUbing class, `Agent`, and the companion classes `StartupConfiguration`/`RuntimeConfiguration`, which describe BUbiNG's internals that can be configured or modified at runtime.
it.unimi.di.law.bubing.frontier	A set of classes orchestrating the movement of URLs to be fetched next by a BUbiNG agent.
it.unimi.di.law.bubing.frontier.dns
it.unimi.di.law.bubing.parser	The system of parsers used for analyzing HTTP responses.
it.unimi.di.law.bubing.sieve	Several implementation of the idea of an `AbstractSieve`.
it.unimi.di.law.bubing.spam
it.unimi.di.law.bubing.store	Implementations of the `Store` interface.
it.unimi.di.law.bubing.test
it.unimi.di.law.bubing.tool
it.unimi.di.law.bubing.util	Miscellaneous utility classes, including the BUbiNG queuing system and its fast cache for URLs.
it.unimi.di.law.warc	An implementation of the Web ARChive file format (WARC) specification.
it.unimi.di.law.warc.filters	A comprehensive filtering system.
it.unimi.di.law.warc.filters.parser
it.unimi.di.law.warc.io	I/O of WARC formatted files.
it.unimi.di.law.warc.io.gzarc	An implementation of a (skippable) GZIP archive.
it.unimi.di.law.warc.processors	Processors to manipulate WARC files.
it.unimi.di.law.warc.records	WARC records.
it.unimi.di.law.warc.tool
it.unimi.di.law.warc.util	Utility classes used by the `it.unimi.di.law.warc` package.