java.io.Closeable, java.lang.AutoCloseablepublic class MercatorSieve<K,V> extends AbstractSieve<K,V>
AbstractSieve that stores hash
data on disk, much in the same way as it was suggested by Allan Heydon and Marc Najork in
“Mercator: A Scalable, Extensible Web Crawler”,
World Wide Web, (2)4:219−229, 1999, Springer.
Each key known to the sieve is stored as a 64-bit hash in a disk file. At each enqueue(Object, Object) the hash of the
enqeueud key is added to a bucket, and the key is saved in an auxiliary file. When the bucket is full it sorted and compared with
the set of keys known to the sieve. Note that the output order is guaranteed to be the same of the input order (i.e., keys
are appended in the same order in which they appeared the first time).
AbstractSieve.DefaultUpdateStrategy<K,V>, AbstractSieve.DiskNewFlow<T>, AbstractSieve.NewFlowReceiver<K>, AbstractSieve.SieveEntry<K,V>, AbstractSieve.UpdateStrategy<K,V>CHAR_SEQUENCE_HASHING_STRATEGY, hashingStrategy, keySerDeser, newFlowReceiver, updateStrategy, valueSerDeser| Constructor | Description |
|---|---|
MercatorSieve(boolean sieveIsNew,
java.io.File sieveDir,
int sieveSize,
int storeIOBufferSize,
int auxFileIOBufferSize,
AbstractSieve.NewFlowReceiver<K> newFlowReceiver,
ByteSerializerDeserializer<K> keySerDeser,
ByteSerializerDeserializer<V> valueSerDeser,
it.unimi.dsi.sux4j.mph.AbstractHashFunction<K> hashingStrategy,
AbstractSieve.UpdateStrategy<K,V> updateStrategy) |
Creates a new Mercator-like sieve.
|
| Modifier and Type | Method | Description |
|---|---|---|
void |
close() |
Closes (forever) this sieve.
|
boolean |
enqueue(K key,
V value) |
Add the given (key,value) pair to the store.
|
void |
flush() |
Forces the check+update of all pairs that have been enqueued.
|
int |
numberOfItems() |
setNewFlowRecevierpublic MercatorSieve(boolean sieveIsNew,
java.io.File sieveDir,
int sieveSize,
int storeIOBufferSize,
int auxFileIOBufferSize,
AbstractSieve.NewFlowReceiver<K> newFlowReceiver,
ByteSerializerDeserializer<K> keySerDeser,
ByteSerializerDeserializer<V> valueSerDeser,
it.unimi.dsi.sux4j.mph.AbstractHashFunction<K> hashingStrategy,
AbstractSieve.UpdateStrategy<K,V> updateStrategy)
throws java.io.IOException
sieveIsNew - whether we are creating a new sieve or opening an old one.sieveDir - a directory for storing the sieve files.sieveSize - the size of the size in longs.storeIOBufferSize - the size in bytes of the buffer used to read and write the hash store during flushes (allocated twice during flushes).auxFileIOBufferSize - the size in bytes of the buffer used to read and write the auxiliary file (always allocated; another allocation happens during flushes).newFlowReceiver - a receiver for the flow of new keys.keySerDeser - a serializer/deserializer for keys.valueSerDeser - a serializer/deserializer for values.hashingStrategy - a hashing strategy for keys.updateStrategy - the strategy used to update the values associated to duplicate keys.java.io.IOExceptionpublic void close()
throws java.io.IOException
AbstractSieveclose in interface java.lang.AutoCloseableclose in interface java.io.Closeableclose in class AbstractSieve<K,V>java.io.IOExceptionpublic boolean enqueue(K key, V value) throws java.io.IOException, java.lang.InterruptedException
AbstractSieveenqueue in class AbstractSieve<K,V>java.io.IOExceptionjava.lang.InterruptedExceptionpublic int numberOfItems()
public void flush()
throws java.io.IOException
AbstractSieveflush in class AbstractSieve<K,V>java.io.IOException