File size: 1,102 Bytes
c0fd234 8f07206 c0fd234 8f07206 c0fd234 d1d1aef f174893 d1d1aef a2db4b8 d1d1aef |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
---
viewer: false
license:
- apache-2.0
language:
- en
---
**Model Summary**
In order to be able to reproduce GneissWeb, we provide here a [Bloom filter](https://dl.acm.org/doi/10.1145/362686.362692) representing all the document ids of FineWeb 1.1.0 whose documents are part of GneissWeb. it is of size 28GB and is of the [rbloom](https://github.com/KenanHanke/rbloom) family of Bloom filters. It is to be probed with the id column of FineWeb 1.1.0 or of Common Crawl.
Please refer to the [GneissWeb](https://huggingface.co/datasets/ibm-granite/GneissWeb) page for more details.
**Developers**: IBM Research
**Release Date**: Feb 21st, 2025
**License**: Apache 2.0.
**Testing**
The Bloom Filter was tested with
Positive Examples : ~10M uuids from 192 parquet files in GneissWeb. These span all 96 snapshots
Negative Examples : 10,000 uuids in CC-MAIN-2024-51 (not present in FineWeb 1.1.0 and also not in GneissWeb)
The Bloom Filter was able to return correct answers for all of them
|