File size: 1,102 Bytes
c0fd234
 
8f07206
c0fd234
 
 
8f07206
c0fd234
d1d1aef
 
 
 
f174893
d1d1aef
 
 
a2db4b8
d1d1aef
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
---
viewer: false
license: 
- apache-2.0
language:
- en
---

**Model Summary**

In order to be able to reproduce GneissWeb,  we provide here a [Bloom filter](https://dl.acm.org/doi/10.1145/362686.362692) representing all the document ids of  FineWeb 1.1.0 whose documents are part of GneissWeb.  it is of size 28GB and is of the [rbloom](https://github.com/KenanHanke/rbloom) family of Bloom filters. It is to be probed with the id column of FineWeb 1.1.0 or of Common Crawl.

Please refer to the [GneissWeb](https://huggingface.co/datasets/ibm-granite/GneissWeb) page for more details.

     **Developers**: IBM Research

     **Release Date**: Feb 21st, 2025

     **License**: Apache 2.0.

**Testing**

The Bloom Filter was tested with 

   Positive Examples : ~10M uuids from 192 parquet files in GneissWeb. These span all 96 snapshots

   Negative Examples : 10,000 uuids in CC-MAIN-2024-51 (not present in FineWeb 1.1.0 and also not in GneissWeb)

The Bloom Filter was able to return correct answers for all of them