GneissWeb.bloom / README.md
bhatta1's picture
Update README.md
a2db4b8 verified
---
viewer: false
license:
- apache-2.0
language:
- en
---
**Model Summary**
In order to be able to reproduce GneissWeb, we provide here a [Bloom filter](https://dl.acm.org/doi/10.1145/362686.362692) representing all the document ids of FineWeb 1.1.0 whose documents are part of GneissWeb. it is of size 28GB and is of the [rbloom](https://github.com/KenanHanke/rbloom) family of Bloom filters. It is to be probed with the id column of FineWeb 1.1.0 or of Common Crawl.
Please refer to the [GneissWeb](https://huggingface.co/datasets/ibm-granite/GneissWeb) page for more details.
     **Developers**: IBM Research
     **Release Date**: Feb 21st, 2025
     **License**: Apache 2.0.
**Testing**
The Bloom Filter was tested with
   Positive Examples : ~10M uuids from 192 parquet files in GneissWeb. These span all 96 snapshots
   Negative Examples : 10,000 uuids in CC-MAIN-2024-51 (not present in FineWeb 1.1.0 and also not in GneissWeb)
The Bloom Filter was able to return correct answers for all of them