|
--- |
|
viewer: false |
|
license: |
|
- apache-2.0 |
|
language: |
|
- en |
|
--- |
|
|
|
**Model Summary** |
|
|
|
In order to be able to reproduce GneissWeb, we provide here a [Bloom filter](https://dl.acm.org/doi/10.1145/362686.362692) representing all the document ids of FineWeb 1.1.0 whose documents are part of GneissWeb. it is of size 28GB and is of the [rbloom](https://github.com/KenanHanke/rbloom) family of Bloom filters. It is to be probed with the id column of FineWeb 1.1.0 or of Common Crawl. |
|
|
|
Please refer to the [GneissWeb](https://huggingface.co/datasets/ibm-granite/GneissWeb) page for more details. |
|
|
|
**Developers**: IBM Research |
|
|
|
**Release Date**: Feb 21st, 2025 |
|
|
|
**License**: Apache 2.0. |
|
|
|
**Testing** |
|
|
|
The Bloom Filter was tested with |
|
|
|
Positive Examples : ~10M uuids from 192 parquet files in GneissWeb. These span all 96 snapshots |
|
|
|
Negative Examples : 10,000 uuids in CC-MAIN-2024-51 (not present in FineWeb 1.1.0 and also not in GneissWeb) |
|
|
|
The Bloom Filter was able to return correct answers for all of them |
|
|