Skip navigation links
BUbiNG is a high-speed distributed crawler.

See: Description

Packages 
Package Description
it.unimi.di.law.bubing
The main BUbing class, Agent, and the companion classes StartupConfiguration/RuntimeConfiguration, which describe BUbiNG's internals that can be configured or modified at runtime.
it.unimi.di.law.bubing.frontier
A set of classes orchestrating the movement of URLs to be fetched next by a BUbiNG agent.
it.unimi.di.law.bubing.frontier.dns  
it.unimi.di.law.bubing.parser
The system of parsers used for analyzing HTTP responses.
it.unimi.di.law.bubing.sieve
Several implementation of the idea of an AbstractSieve.
it.unimi.di.law.bubing.spam  
it.unimi.di.law.bubing.store
Implementations of the Store interface.
it.unimi.di.law.bubing.test  
it.unimi.di.law.bubing.tool  
it.unimi.di.law.bubing.util
Miscellaneous utility classes, including the BUbiNG queuing system and its fast cache for URLs.
it.unimi.di.law.warc
An implementation of the Web ARChive file format (WARC) specification.
it.unimi.di.law.warc.filters
A comprehensive filtering system.
it.unimi.di.law.warc.filters.parser  
it.unimi.di.law.warc.io
I/O of WARC formatted files.
it.unimi.di.law.warc.io.gzarc
An implementation of a (skippable) GZIP archive.
it.unimi.di.law.warc.processors
Processors to manipulate WARC files.
it.unimi.di.law.warc.records
WARC records.
it.unimi.di.law.warc.tool  
it.unimi.di.law.warc.util
Utility classes used by the it.unimi.di.law.warc package.

BUbiNG is a high-speed distributed crawler. BUbiNG agents coordinate autonomously to download data from the web.

Configuring your hardware

BUbiNG requires a number of features from your system.

To check that your system is properly configured, it is essential that a number of trial runs are performed using the NamedGraphServerHttpProxy in random mode. For instance,

   java -Xmx4G -server it.unimi.di.law.bubing.test.NamedGraphServerHttpProxy -s 100000000 -d 50 -m 3 -t 1000 -D .0001 -A1000 - &
   
will start the proxy using 100 million hosts, average page degree 50, average depth 3, 1000 threads and a 0.01% pages being broken (connect or read timeouts). By starting BUbiNG with the configuration option proxyHost=localhost (the port, 8080, is understood) you can test that your memory and system settings are satisfactory without using any network bandwidth.

Configuring BUbiNG

BUbiNG uses a configuration file to set up its internal values. Such values are fully documented in StartupConfiguration; the name of the keys are derived by reflection from the field names. Default values are annotated using the StartupConfiguration.OptionalSpecification annotation in StartupConfiguration. This is an example configuration file:

rootDir=crawl-eu
maxUrlsPerSchemeAuthority=1000
parsingThreads=64
dnsThreads=50
fetchingThreads=512
fetchFilter=true
scheduleFilter=( SchemeEquals(http) or SchemeEquals(https) ) and HostEndsWithOneOf(.ad\,.al\,.at\,.be\,.bg\,.ch\,.cz\,.de\,.dk\,.ee\,.es\,.eu\,.fi\,.fo\,.fr\,.gb\,.gr\,.hr\,.hu\,.ie\,.im\,.is\,.it\,.li\,.lt\,.lu\,.lv\,.mc\,.md\,.me\,.nl\,.no\,.pl\,.pt\,.ro\,.se\,.si\,.sk\,.sm\,.uk\,.va) and not PathEndsWithOneOf(.axd\,.xls\,.rar\,.sflb\,.tmb\,.pdf\,.js\,.swf\,.rss\,.kml\,.m4v\,.tif\,.avi\,.iso\,.mov\,.ppt\,.bib\,.docx\,.css\,.fits\,.png\,.gif\,.jpg\,.jpeg\,.ico\,.doc\,.wmv\,.mp3\,.mp4\,.aac\,.ogg\,.wma\,.gz\,.bz2\,.Z\,.z\,.zip) and URLShorterThan(2048) and DuplicateSegmentsLessThan(3)
followFilter=true
parseFilter=( ContentTypeStartsWith(text/) or PathEndsWithOneOf(.html\,.htm\,.txt) ) and not IsProbablyBinary()
storeFilter=( ContentTypeStartsWith(text/) or PathEndsWithOneOf(.html\,.htm\,.txt) ) and not IsProbablyBinary()
schemeAuthorityDelay=10s
ipDelay=2s
maxUrls=500M
bloomFilterPrecision=1E-8
seed=file:/some/path/eu.seed
socketTimeout=60s
connectionTimeout=60s
fetchDataBufferByteSize=200K
cookiePolicy=compatibility
cookieMaxByteSize=2000
userAgent=BUbiNG (+http://law.di.unimi.it/BUbiNG.html)
userAgentFrom=youremail@yourdomain.com
robotsExpiration=1h
responseBodyMaxByteSize=2M
digestAlgorithm=MurmurHash3
startPaused=false
workbenchMaxByteSize=512Mi
urlCacheMaxByteSize=1Gi
sieveSize=64Mi
parserSpec=HTMLParser(MurmurHash3)
keepAliveTime=1s
    

Note that almost all these properties are modifiable at runtime using the JMX interface.

There are two main activities related to the preparation of configuration files:

Choosing your filters
Every phase of BUbiNG's activity is controlled by a filter that you can specify using boolean operations and atoms from it.unimi.di.law.warc.filters.
Sizing the various component
BUbiNG will use a constant amount of memory for its activities, except for storing host information, for which it assumes to have enough core memory. The shown values are a good choice for a 64 GiB machine.

Starting a crawl

First of all, we suggest to run as root the command sysctl -p /etc/sysctl.d/bubing (where the file bubing contains the settings discussed before) as in our experience some of the parameters turn out to be not correctly set after boot, even in included in /etc/sysctl.d.

A BUbiNG is started using the Agent class, as in:

java -server -Xss256K -Xms20G -XX:+UseNUMA -XX:+UseConcMarkSweepGC  \
        -XX:+UseTLAB -XX:+ResizeTLAB -XX:NewRatio=4 -XX:MaxTenuringThreshold=15 -XX:+CMSParallelRemarkEnabled \
        -verbose:gc -Xloggc:gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution \
        -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime \
        -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 \
        -Djava.rmi.server.hostname=192.168.0.20 \
        -Djava.net.preferIPv4Stack=true \
        -Djgroups.bind_addr=192.168.0.20 \
        -Dlogback.configurationFile=bubing-logback.xml \
        -Dcom.sun.management.jmxremote.port=9999 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false \
        it.unimi.di.law.bubing.Agent -h 192.168.0.20 -P eu.properties -g eu agent -n 2>err >out
        

The JVM options are of course just an example for a 64-core, 64 GiB hardware, and should be customized. It is important, however, that the thread stack size (-Xss) is set to a small value (as BUbiNG uses thousands of threads).

The IP address specified will be used for JGroups and JMX communication: if you have an internal high-speed interface, this is the place to use it. Otherwise, just use the standard IP address of the machine. This crawl will use a property file named eu.properties, start an agent with name agent part of a crawl group named eu. Logback will be configured by the file bubing-logback.xml (see below).

In the standard BUbiNG setup, agents in the same crawl groups coordinates autonomously using consistent hashing, so if you want to perform a multi-agent crawl you must just be sure to have properly configured your hardware and JGroups so they work together. A simple way to check that this is true is to start the crawl in pause mode, check from the logs that all agents are visible to each other, and then start the crawl.

We use Logback for logging. We suggest a separate log for messages above the WARN level, to be able to find errors quickly. The file used in the command above could be as follows (assuming you want to write your logs in /tmp):

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
  <appender name="all" class="ch.qos.logback.core.rolling.RollingFileAppender">
    <file>/tmp/bubing.log</file>
    <encoder>
      <pattern>%d %r %p [%t] %logger{1} - %m%n</pattern>
    </encoder>
    <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
      <maxHistory>100</maxHistory>
      <FileNamePattern>/tmp/bubing.%d{yyyy-MM-dd_HH}.log</FileNamePattern>
    </rollingPolicy>
  </appender>
  <appender name="warn" class="ch.qos.logback.core.rolling.RollingFileAppender">
    <file>/tmp/bubing.warn</file>
    <encoder>
      <pattern>%d %r %p [%t] %logger{1} - %m%n</pattern>
    </encoder>
    <filter class="ch.qos.logback.classic.filter.ThresholdFilter">
      <level>WARN</level>
    </filter>
    <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
      <maxHistory>100</maxHistory>
      <FileNamePattern>/tmp/bubing.%d.warn</FileNamePattern>
    </rollingPolicy>
  </appender>
  <root level="INFO">
    <appender-ref ref="all"/>
    <appender-ref ref="warn"/>
  </root>
  <logger name="net.htmlparser.jericho" level="OFF"/>
  <logger name="it.unimi.di.law.bubing" level="INFO"/>
  <logger name="it.unimi.di.law.bubing.frontier.FetchingThread$DNSJavaClientConnectionManager" level="INFO"/>
  <logger name="org.apache.http.impl.client" level="WARN"/>
  <logger name="org.apache.http.wire" level="WARN"/>
</configuration>

Tuning the garbage collector

BUbiNG's object handling deeply violates the weak generational hypothesis, that is, the idea that most allocated objects die young. During a crawl, dozens (of even hundreds) of millions of small objects have a lifetime of minutes or hours, and then disappear. Using the default settings of the garbage collector leads to a constant promotion these objects to the old generation, causing frequent full collections that decrease progressively the crawl speed.

The settings suggested above modify radically the garbage collector behavior. One third of the available memory is allocated for new objects, and we try to delay tenuring as much as possible by swapping objects between the survivor spaces the maximum possible number of times (15, as the counter has only four bits). Note that frequent minor collections have little impact on the crawler, as even during collections the operating system is filling socket buffers, flushing file buffers, handling memory mapping, etc. We also require dynamic thread-local allocation buffers, as the allocation needs of, say, fetching threads and parsing threads are very different.

It is always a good idea to log garbage collection information (as suggested above) and to examine the resulting log using a tool like Chewiebug. Ideally, you should see frequent minor collection, and small sets of objects promoted to the old generation.

The actual values of the parameter should be chosen in the following order:

The last final judge of a correct configuration of the garbage collector is the stability of the crawling speed.

How BUbiNG uses memory

Once you have decided the amount of memory that you want to use for the crawler, you must decide how to divide this memory among BUbiNG's components. Most of the memory used by BUbiNG can be divided as follows:

The workbench
The workbench keeps around the URLs (actually, the path+queries) that will be fetched in a short time. Depending on the number of threads, your network feed, etc., the workbench must be large enough to keep a few URLs for a number of hosts that is sufficient to saturate your bandwidth. We suggest starting with a large value (e.g., a gigabyte), check in the logs which percentage is actually used, and then lower the amount.
The sieve
The sieve keeps track of which URLs have ever been seen. The default implementation, MercatorSieve, accumulates URLs on disk and associated 64-bit fingerprints in RAM until it is full. Then, it flushes the data and generates a stream of ready URLs. The largest the sieve, the less flushes, the less I/O. The StartupConfiguration.sieveSize should be multiplied by 12 to obtain approximately the amount of memory allocated.
The fetching thread cache
Each fetching thread must download data to a cache, whose size is determined by StartupConfiguration.fetchDataBufferByteSize. The cache is transparently extended to StartupConfiguration.responseBodyMaxByteSize using a file on disk, but too much disk access will slow down the crawler.
The Bloom filter
If you are using near-duplicate detection there is a large amount of memory used by the Bloom filter which keeps track of digests. The amount of memory used depends on StartupConfiguration.maxUrls and StartupConfiguration.bloomFilterPrecision, and it is explained here.
The URL cache
The URL cache keeps track of 128-bit MurmurHash3 fingerprints of recently seen URLs; it significantly reduces the number of URLs broadcast in a distributed crawl, besides lightening the load on the sieve. Its size must be fixed a priori using StartupConfiguration.urlCacheMaxByteSize. Look at the hit/miss ratio in the logs to understand whether you have a large enough cache (the bigger, the better).
Data associated to visit states
This is a relatively small amount of data (less than one kilobyte) associated with each host currently visited. It consists in some metadata associated with the host, its cookies, and its robots.txt exclusion information. The size of the cookies can be controlled using a StartupConfiguration.cookieMaxByteSize. This is the only part of the data accumulated by BUbiNG that might, over time, grow without bound.

How BUbiNG uses disks

BUbiNG's activity requires a lot of disk I/O. Some of the I/O activity, like the flushes of the MercatorSieve, are stop-of-the-world events and require no particular care. However, there are three activities that happen concurrently during the crawling:

Accessing the disk queues
Part of the URLs waiting to be crawled are virtualized—stored on disk. These URLs are accessed frequently by memory mapping in a semi-random fashion; they are stored in StartupConfiguration.frontierDir.
Dumping large responses
Only the first StartupConfiguration.fetchDataBufferByteSize bytes of a response are kept in main memory: the remaining part, up to StartupConfiguration.responseBodyMaxByteSize bytes, is dumped on a disk cache. There is one such cache per parsing thread, so the disk activity can be quite intensive. The caches are stored in StartupConfiguration.responseCacheDir.
Storing data
If you use a disk-based store, such as the default WarcStore, data will be continuously written in StartupConfiguration.storeDir.
Logging
Depending on your Logback configuration, logging can be quite I/O intensive.

Ideally, these four activities should use different disks, and, if possible, StartupConfiguration.frontierDir should live on solid-state storage.

Following a crawl

There are two sources of information about the inner works of BUbiNG: JMX and the logs. Most internals of BUbiNG are modifiable at runtime using JMX, and looking at the logs reveals a lot information about what's going on. We suggest to regularly grep for ERROR and check standard error to catch problems at an early stage.

Pausing, suspending or stopping a crawl

If you just want to pause a crawl, there is a JMX Agent.pause() operation that will suspend crawling activity until Agent.resume() is invoked. Alternatively, you can stop the crawl, which will cause the crawl state to be snapped to disk. At that point, the crawl can restart by launching the agent with the same configuration but without the -n option.

Skip navigation links