MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels Paper • 2405.07526 • Published May 13, 2024 • 22
Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach Paper • 2405.15613 • Published May 24, 2024 • 18
A Touch, Vision, and Language Dataset for Multimodal Alignment Paper • 2402.13232 • Published Feb 20, 2024 • 16
How Do Large Language Models Acquire Factual Knowledge During Pretraining? Paper • 2406.11813 • Published Jun 17, 2024 • 32
DataComp-LM: In search of the next generation of training sets for language models Paper • 2406.11794 • Published Jun 17, 2024 • 55
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs Paper • 2406.11833 • Published Jun 17, 2024 • 64
From Pixels to Prose: A Large Dataset of Dense Image Captions Paper • 2406.10328 • Published Jun 14, 2024 • 18
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens Paper • 2406.11271 • Published Jun 17, 2024 • 21
StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images Paper • 2406.13735 • Published Jun 19, 2024 • 5
Stylebreeder: Exploring and Democratizing Artistic Styles through Text-to-Image Models Paper • 2406.14599 • Published Jun 20, 2024 • 17
Scaling Synthetic Data Creation with 1,000,000,000 Personas Paper • 2406.20094 • Published Jun 28, 2024 • 105
Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity Paper • 2406.17720 • Published Jun 25, 2024 • 8
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation Paper • 2407.02371 • Published Jul 2, 2024 • 55
TabReD: A Benchmark of Tabular Machine Learning in-the-Wild Paper • 2406.19380 • Published Jun 27, 2024 • 51
Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge Paper • 2407.03958 • Published Jul 4, 2024 • 22
MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions Paper • 2407.06358 • Published Jul 8, 2024 • 19
Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes Paper • 2407.10957 • Published Jul 15, 2024 • 25
YouTube-SL-25: A Large-Scale, Open-Domain Multilingual Sign Language Parallel Corpus Paper • 2407.11144 • Published Jul 15, 2024 • 10
VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks Paper • 2407.19795 • Published Jul 29, 2024 • 11
Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation Paper • 2408.00205 • Published Aug 1, 2024 • 6
VidGen-1M: A Large-Scale Dataset for Text-to-video Generation Paper • 2408.02629 • Published Aug 5, 2024 • 15
MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine Paper • 2408.02900 • Published Aug 6, 2024 • 31
Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond Paper • 2408.03900 • Published Aug 7, 2024 • 10
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models Paper • 2408.04594 • Published Aug 8, 2024 • 15
VGGHeads: A Large-Scale Synthetic Dataset for 3D Human Heads Paper • 2407.18245 • Published Jul 25, 2024 • 11
MovieSum: An Abstractive Summarization Dataset for Movie Screenplays Paper • 2408.06281 • Published Aug 12, 2024 • 9
InfinityMATH: A Scalable Instruction Tuning Dataset in Programmatic Mathematical Reasoning Paper • 2408.07089 • Published Aug 9, 2024 • 14
D5RL: Diverse Datasets for Data-Driven Deep Reinforcement Learning Paper • 2408.08441 • Published Aug 15, 2024 • 8
VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language Models for Trait Discovery from Biological Images Paper • 2408.16176 • Published Aug 28, 2024 • 8
ClimDetect: A Benchmark Dataset for Climate Change Detection and Attribution Paper • 2408.15993 • Published Aug 28, 2024 • 8
The MERIT Dataset: Modelling and Efficiently Rendering Interpretable Transcripts Paper • 2409.00447 • Published Aug 31, 2024 • 2
HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation Paper • 2407.17438 • Published Jul 24, 2024 • 27
TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation Paper • 2411.04709 • Published Nov 5, 2024 • 27
Improving the detection of technical debt in Java source code with an enriched dataset Paper • 2411.05457 • Published Nov 8, 2024 • 2
GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models Paper • 2411.05830 • Published Nov 5, 2024 • 22
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions Paper • 2411.07461 • Published Nov 12, 2024 • 24
EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation Paper • 2411.08380 • Published Nov 13, 2024 • 27
RedPajama: an Open Dataset for Training Large Language Models Paper • 2411.12372 • Published Nov 19, 2024 • 57
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection Paper • 2411.14794 • Published Nov 22, 2024 • 13
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation Paper • 2412.00927 • Published Dec 1, 2024 • 29
VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information Paper • 2412.00947 • Published Dec 1, 2024 • 8
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation Paper • 2412.03304 • Published Dec 4, 2024 • 21
HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing Paper • 2412.04280 • Published Dec 5, 2024 • 14
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale Paper • 2412.05237 • Published Dec 6, 2024 • 48
BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks Paper • 2412.04626 • Published Dec 5, 2024 • 14
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations Paper • 2412.08580 • Published Dec 11, 2024 • 46
MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation Paper • 2412.07147 • Published Dec 10, 2024 • 5
VisionArena: 230K Real World User-VLM Conversations with Preference Labels Paper • 2412.08687 • Published Dec 11, 2024 • 13
OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training Paper • 2501.08197 • Published Jan 14 • 9
The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models Paper • 2501.09653 • Published Jan 16 • 12
Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation Paper • 2501.15907 • Published Jan 27 • 17
OpenCharacter: Training Customizable Role-Playing LLMs with Large-Scale Synthetic Personas Paper • 2501.15427 • Published Jan 26 • 6
WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training Paper • 2501.18511 • Published Jan 30 • 20
COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation Paper • 2502.02589 • Published Feb 4 • 10
Generating Multi-Image Synthetic Data for Text-to-Image Customization Paper • 2502.01720 • Published Feb 3 • 8
Expect the Unexpected: FailSafe Long Context QA for Finance Paper • 2502.06329 • Published Feb 10 • 132
TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation Paper • 2502.07870 • Published Feb 11 • 45
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment Paper • 2502.10391 • Published Feb 14 • 35
REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation Paper • 2502.13270 • Published Feb 18 • 6
VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation Paper • 2503.01739 • Published Mar 3 • 8
Qilin: A Multimodal Information Retrieval Dataset with APP-level User Sessions Paper • 2503.00501 • Published Mar 1 • 12
KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding Paper • 2503.02951 • Published Mar 4 • 32
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia Paper • 2503.07920 • Published Mar 10 • 100
CineBrain: A Large-Scale Multi-Modal Brain Dataset During Naturalistic Audiovisual Narrative Processing Paper • 2503.06940 • Published Mar 10 • 11
DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation Paper • 2503.06053 • Published Mar 8 • 138
ELTEX: A Framework for Domain-Driven Synthetic Data Generation Paper • 2503.15055 • Published Mar 19 • 6
TeleAntiFraud-28k: A Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection Paper • 2503.24115 • Published Mar 31 • 12
COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values Paper • 2504.05535 • Published Apr 7 • 45
DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning Paper • 2504.11456 • Published Apr 15 • 13
SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning Paper • 2504.09081 • Published Apr 12 • 17
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training Paper • 2504.13161 • Published Apr 17 • 93
MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space Paper • 2504.13835 • Published Apr 18 • 38
LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs Paper • 2504.14655 • Published Apr 20 • 20
QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining Paper • 2504.16511 • Published Apr 23 • 20
R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training Paper • 2505.00358 • Published May 1 • 26
PrismLayers: Open Data for High-Quality Multi-Layer Transparent Image Generative Models Paper • 2505.22523 • Published May 28 • 7
SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis Paper • 2506.02096 • Published Jun 2 • 51
One Missing Piece for Open-Source Reasoning Models: A Dataset to Mitigate Cold-Starting Short CoT LLMs in RL Paper • 2506.02338 • Published Jun 3 • 4
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Paper • 2506.05209 • Published Jun 5 • 46
Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery Paper • 2506.05673 • Published Jun 6 • 10
CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models Paper • 2506.07463 • Published Jun 9 • 10
Phantom-Data : Towards a General Subject-Consistent Video Generation Dataset Paper • 2506.18851 • Published Jun 23 • 29
ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation Paper • 2506.18095 • Published Jun 22 • 65
BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset Paper • 2507.03483 • Published Jul 4 • 23
Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents Paper • 2507.04009 • Published Jul 5 • 41
Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data Paper • 2507.07095 • Published Jul 9 • 54
SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation Paper • 2507.09862 • Published Jul 14 • 49
A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models Paper • 2507.13563 • Published Jul 17 • 51
MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning Paper • 2507.16812 • Published Jul 22 • 62
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning Paper • 2507.16746 • Published Jul 22 • 34
FACTORY: A Challenging Human-Verified Prompt Set for Long-Form Factuality Paper • 2508.00109 • Published Jul 31 • 4
Speech-to-LaTeX: New Models and Datasets for Converting Spoken Equations and Sentences Paper • 2508.03542 • Published Aug 5 • 4
Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation Paper • 2508.09987 • Published 26 days ago • 25
A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers Paper • 2508.21148 • Published 11 days ago • 135
TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis Paper • 2508.13618 • Published 21 days ago • 17
Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions? Paper • 2509.04292 • Published 4 days ago • 49