# LMMS-Eval v0.5: Multimodal Expansion Release ## Introduction LMMs-Eval v0.5 represents a significant expansion in multimodal evaluation capabilities, introducing comprehensive audio understanding support alongside continued vision and reasoning enhancements. ## Table of Contents - [Introduction](#introduction) - [Major Features](#major-features) - [1. Response Caching System](#1-response-caching-system) - [2. Audio Evaluation Suite](#2-audio-evaluation-suite) - [3. New Model Support](#3-new-model-support) - [4. New Benchmarks](#4-new-benchmarks) - [5. Model Context Protocol (MCP) Integration](#5-model-context-protocol-mcp-integration) - [6. Async OpenAI Improvements](#6-async-openai-improvements) - [Usage Examples](#usage-examples) - [Technical Details](#technical-details) - [Migration Guide](#migration-guide) - [Bug Fixes and Improvements](#bug-fixes-and-improvements) - [Deprecated Features](#deprecated-features) - [Contributing](#contributing) - [Acknowledgments](#acknowledgments) - [Getting Help](#getting-help) ## Major Features ### 1. Response Caching System A production-ready JSONL-based caching system that dramatically speeds up re-evaluation and reduces API costs: **Key Features:** - **Per-document caching**: Cached at `(task_name, doc_id)` level - **Distributed-safe**: Separate cache files per rank/world size - **Zero-overhead**: Automatic cache hits with no code changes - **Multi-backend**: Works with async OpenAI, vLLM, and custom models **Enable Caching:** ```bash export LMMS_EVAL_USE_CACHE=True export LMMS_EVAL_HOME="/path/to/cache_root" # optional python -m lmms_eval \ --model async_openai \ --model_args model_version=gpt-4o-2024-11-20,base_url=$OPENAI_API_BASE \ --tasks mmmu_val \ --batch_size 1 \ --output_path ./logs/ ``` **Cache Location:** - Default: `~/.cache/lmms-eval/eval_cache//{task_name}_rank{rank}_world_size{world_size}.jsonl` - Each line: `{"doc_id": , "response": }` **API Integration:** ```python def generate_until(self, requests): self.load_cache() cached, pending = self.get_response_from_cache(requests) results = [c["response"] for c in cached] for req in pending: out = call_backend(req) self.add_request_response_to_cache(req, out) results.append(out) return results ``` See full documentation in `docs/caching.md`. ### 2. Audio Evaluation Suite Comprehensive audio understanding capabilities with three major benchmark families: #### Step2 Audio Paralinguistic (11 tasks) Fine-grained paralinguistic feature evaluation: - **Acoustic Features**: pitch, rhythm, speed, voice_tone, voice_styles - **Speaker Attributes**: age, gender, emotions - **Environmental**: scene, event, vocalsound - Sematic Match metrics ```bash python -m lmms_eval \ --model async_openai \ --model_args model_version=gpt-4o-audio-preview-2024-12-17 \ --tasks step2_audio_paralinguistic \ --batch_size 1 ``` #### VoiceBench (9 main categories, 30+ subtasks) Comprehensive voice and speech evaluation: - **Instruction Following**: ifeval, alpacaeval, advbench - **Reasoning**: bbh (Big Bench Hard), commoneval - **Knowledge**: mmsu (13 subject areas: biology, chemistry, physics, etc.) - **Q&A**: openbookqa - **Accent Diversity**: sd-qa (11 regional variants: USA, UK, India, Australia, etc.) - **Expressiveness**: wildvoice - Metrics vary by task type, including accuracy(1-5), failure rate, LLM eval, etc. ```bash # Full VoiceBench python -m lmms_eval \ --model async_openai \ --model_args model_version=gpt-4o-audio-preview-2024-12-17 \ --tasks voicebench \ --batch_size 1 # Specific accent evaluation python -m lmms_eval \ --tasks voicebench_sd-qa_ind_n,voicebench_sd-qa_ind_s \ --batch_size 1 ``` #### WenetSpeech (2 splits) Large-scale ASR and speech evaluation: - **dev**: Development set for validation - **test_meeting**: Meeting domain evaluation - MER (Mixed Error Rate) metrics ```bash python -m lmms_eval \ --model async_openai \ --model_args model_version=gpt-4o-audio-preview-2024-12-17 \ --tasks wenet_speech_dev,wenet_speech_test_meeting \ --batch_size 1 ``` **Audio Pipeline Features:** - HuggingFace audio dataset integration - Unified audio message format - Multiple metric support (Accuracy, WER, GPT-4 Judge) - Task grouping for multi-subset benchmarks ### 3. New Model Support Five new model integrations expanding audio and vision capabilities: | Model | Type | Key Features | Usage Example | |-------|------|--------------|---------------| | **GPT-4o Audio Preview** | Audio+Text | Paralinguistic understanding, multi-turn audio | `--model async_openai --model_args model_version=gpt-4o-audio-preview-2024-12-17` | | **Gemma-3** | Vision+Text | Enhanced video handling, efficient architecture | `--model gemma3 --model_args pretrained=google/gemma-3-2b-vision-it` | | **LLaVA-OneVision 1.5** | Vision+Text | Improved vision understanding, latest LLaVA | `--model llava_onevision1_5 --model_args pretrained=lmms-lab/llava-onevision-1.5-7b` | | **LongViLA-R1** | Video+Text | Long-context video, efficient video processing | `--model longvila --model_args pretrained=Efficient-Large-Model/LongViLA-R1-7B` | | **Thyme** | Vision+Text | Reasoning-focused, enhanced image handling | `--model thyme --model_args pretrained=thyme-ai/thyme-7b` | **Example Usage:** ```bash # GPT-4o Audio Preview for audio tasks python -m lmms_eval \ --model async_openai \ --model_args model_version=gpt-4o-audio-preview-2024-12-17 \ --tasks step2_audio_paralinguistic,voicebench \ --batch_size 1 # LongViLA for video understanding python -m lmms_eval \ --model longvila \ --model_args pretrained=Efficient-Large-Model/LongViLA-R1-7B \ --tasks videomme,egoschema \ --batch_size 1 ``` ### 4. New Benchmarks Beyond audio, v0.5 adds diverse vision and reasoning benchmarks significantly expanding LMMS-Eval's coverage into specialized domains: #### Vision & Reasoning Benchmarks | Benchmark | Variants | Focus | Metrics | |-----------|----------|-------|---------| | **CSBench** | 3 (MCQ, Assertion, Combined) | Code understanding, debugging | Accuracy | | **SciBench** | 4 (Math, Physics, Chemistry, Combined) | College-level STEM | GPT-4 Judge, Accuracy | | **MedQA** | 1 | Medical question answering | Accuracy | | **SuperGPQA** | 1 | Graduate-level science Q&A | Accuracy | | **Lemonade** | 1 | Video action recognition | Accuracy | | **CharXiv** | 3 (Descriptive, Reasoning, Combined) | Scientific chart interpretation | Accuracy, GPT-4 Judge | **Example Usage:** ```bash # Code understanding python -m lmms_eval --tasks csbench --batch_size 1 # STEM reasoning python -m lmms_eval --tasks scibench --batch_size 1 # Chart reasoning python -m lmms_eval --tasks charxiv --batch_size 1 ``` #### Reproducibility Validation We validated our benchmark implementations against official results using two popular language models. The table below compares lmms-eval scores with officially reported results to demonstrate reproducibility: | Model | Task | lmms-eval | Reported | Δ | Status | |-------|------|----------|-----------|-----|--------| | **Qwen-2.5-7B-Instruct** | MedQA | 53.89 | 54.28 | -0.39 | ✓ | | | SciBench | 43.86 | 42.97 | +0.89 | ✓ | | | CSBench | 69.01 | 69.51 | -0.50 | ✓ | | | SuperGPQA | 29.24 | 28.78 | +0.46 | ✓ | | **Llama-3.1-8B** | MedQA | 64.49 | 67.01 | -2.52 | ✓ | | | SciBench | 15.35 | 10.78 | +4.57 | +- | | | CSBench | 62.49 | 57.87 | +4.62 | +- | | | SuperGPQA | 21.94 | 19.72 | +2.22 | ✓ | **Status Legend**: ✓ = Strong agreement (Δ ≤ 2.5%) | +- = Acceptable variance (2.5% < Δ ≤ 5%) ### 5. Model Context Protocol (MCP) Integration Support for MCP-enabled models with tool calling: ```bash python -m lmms_eval \ --model async_openai \ --model_args model_version=gpt-4o-2024-11-20,mcp_server_path=/path/to/mcp_server.py \ --tasks mmmu_val \ --batch_size 1 ``` **Features:** - Tool call parsing and execution - Multi-step reasoning with tools - Custom MCP server integration - See `examples/chat_templates/tool_call_qwen2_5_vl.jinja` for templates ### 6. Async OpenAI Improvements Enhanced async API integration: - Better rate limit handling - Configurable retry logic with delays - Improved error handling - Batch size optimization for OpenAI-compatible endpoints **Common Args Support:** ```python # Now supports additional parameters python -m lmms_eval \ --model async_openai \ --model_args model_version=gpt-4o,temperature=0.7,top_p=0.95,max_tokens=2048 \ --tasks mmstar ``` ## Usage Examples ### Audio Evaluation with Caching ```bash # Enable caching for expensive audio API calls export LMMS_EVAL_USE_CACHE=True export OPENAI_API_KEY="your-key" python -m lmms_eval \ --model async_openai \ --model_args model_version=gpt-4o-audio-preview-2024-12-17 \ --tasks step2_audio_paralinguistic,voicebench \ --batch_size 8 \ --output_path ./audio_results/ \ --log_samples # Second run will use cache - much faster! ``` ### Multi-Benchmark Evaluation ```bash # Evaluate across audio, vision, and reasoning tasks python -m lmms_eval \ --model async_openai \ --model_args model_version=gpt-4o-2024-11-20 \ --tasks voicebench_mmsu,csbench,scibench_math,charxiv \ --batch_size 4 \ --output_path ./multimodal_results/ ``` ### Distributed Evaluation with Caching ```bash export LMMS_EVAL_USE_CACHE=True torchrun --nproc_per_node=8 -m lmms_eval \ --model qwen2_5_vl \ --model_args pretrained=Qwen/Qwen2.5-VL-7B-Instruct \ --tasks step2_audio_paralinguistic,csbench,scibench \ --batch_size 16 \ --output_path ./distributed_results/ ``` ### Programmatic API with Caching ```python import os from lmms_eval.evaluator import simple_evaluate from lmms_eval.models.chat.async_openai import AsyncOpenAICompatibleChat # Enable caching os.environ["LMMS_EVAL_USE_CACHE"] = "True" model = AsyncOpenAICompatibleChat( model_version="gpt-4o-audio-preview-2024-12-17", base_url="https://api.openai.com/v1" ) results = simple_evaluate( model=model, tasks=["voicebench", "step2_audio_paralinguistic"], batch_size=8, device="cuda" ) print(f"Results: {results['results']}") ``` ## Technical Details ### Caching Architecture **Design Philosophy:** - **Simplicity**: JSONL format for easy inspection and debugging - **Distributed-safe**: Per-rank files avoid write contention - **Transparent**: No code changes needed for models using the API **Cache Key:** `(task_name, doc_id)` - Stable across runs if task and document IDs don't change - Model hash derived from `model_version` and task list **File Structure:** ``` ~/.cache/lmms-eval/eval_cache/ └── / ├── task1_rank0_world_size1.jsonl ├── task1_rank1_world_size1.jsonl └── task2_rank0_world_size1.jsonl ``` **Performance:** - Initial run: Full model inference - Cached run: ~100x faster (I/O bound only) - Distributed: Linear scaling with cache hits ### Audio Processing Pipeline **Data Flow:** 1. Load HuggingFace audio datasets 2. Convert to unified message format with audio URLs 3. Process through audio-capable models 4. Apply task-specific metrics (WER, accuracy, GPT-4 judge) 5. Aggregate across task groups **Message Format:** ```python { "role": "user", "content": [ {"type": "audio", "url": "path/to/audio.wav"}, {"type": "text", "text": "Question about the audio"} ] } ``` ### Model Context Protocol MCP enables models to call external tools during evaluation: - Custom server implementation - Tool definition and parsing - Multi-step reasoning with tool results - Compatible with OpenAI-style function calling ## Migration Guide ### From v0.4 to v0.5 **No Breaking Changes**: v0.5 is fully backward compatible with v0.4. **New Features to Adopt:** 1. **Enable Caching for API Models:** ```bash # Add these environment variables export LMMS_EVAL_USE_CACHE=True ``` 2. **Use New Audio Models:** ```bash # GPT-4o Audio Preview --model async_openai \ --model_args model_version=gpt-4o-audio-preview-2024-12-17 ``` 3. **Leverage New Benchmarks:** ```bash # Add audio, code, and STEM benchmarks --tasks step2_audio_paralinguistic,voicebench,csbench,scibench ``` 4. **Optimize Async OpenAI Calls:** ```python # Use additional parameters for better control model_args="model_version=gpt-4o,temperature=0.7,max_tokens=2048" ``` ### Updating Existing Workflows **Before (v0.4):** ```bash python -m lmms_eval \ --model async_openai \ --model_args model_version=gpt-4o-2024-08-06 \ --tasks mmmu_val \ --batch_size 1 ``` **After (v0.5 with caching):** ```bash export LMMS_EVAL_USE_CACHE=True python -m lmms_eval \ --model async_openai \ --model_args model_version=gpt-4o-2024-11-20 \ --tasks mmmu_val,voicebench,csbench \ --batch_size 8 # Higher batch size with caching ``` ## Bug Fixes and Improvements ### Fixed Issues 1. **`write_out` Flag Deprecated**: The `--write_out` flag is now deprecated in favor of `--log_samples` ```bash # Old (deprecated) --write_out # New --log_samples ``` 2. **TypeError in `write_out` with `log_samples`**: Fixed crash when using both flags together 3. **Batch Size in OpenAI Endpoint**: Corrected batch size handling for OpenAI-compatible servers 4. **Gemma-3 Loading**: Fixed model loading to use `Gemma3ForConditionalGeneration` correctly 5. **SRT API Bugfix**: Resolved issues in subtitle/caption processing 6. **CharXiv Improvements**: Fixed chart understanding task configurations 7. **Async OpenAI Caching Order**: Corrected cache lookup order to avoid unnecessary API calls ### Performance Improvements - **10-100x speedup** on cached evaluations - **Better async handling** for API-based models - **Reduced memory usage** in distributed settings - **Faster audio dataset loading** from HuggingFace ## Deprecated Features ### Deprecated Flags - **`--write_out`**: Use `--log_samples` instead ```bash # Deprecated python -m lmms_eval --write_out # Use instead python -m lmms_eval --log_samples ``` ### Model Notes - Models should implement caching API for best performance - Legacy simple models continue to work but miss caching benefits - See `lmms_eval.api.model.lmms` for caching integration ## Contributing We welcome contributions to LMMS-Eval! The v0.5 release demonstrates the value of community contributions across models, benchmarks, and infrastructure. ### High-Priority Areas for v0.5.x 1. **Audio Model Integrations**: Help add support for more audio-capable models 2. **Audio Benchmark Implementations**: Expand audio evaluation coverage 3. **Caching Optimizations**: Improve cache hit rates and performance 4. **Documentation**: Enhance guides and examples for audio evaluation 5. **MCP Server Examples**: Create reference implementations for tool calling ### How to Contribute 1. **Fork the repository** and create a feature branch from `dev/v0d5` 2. **Follow the development guidelines** in `CLAUDE.md`: - Use `uv` for package management (never pip) - Add type hints and docstrings - Run `uv run ruff format .` and `uv run ruff check . --fix` - Run `uv run pyright` for type checking 3. **Test thoroughly**: - Add tests for new features - Verify caching works if implementing a model - Test with realistic datasets 4. **Submit a pull request** with clear description ### Adding New Audio Benchmarks Follow the pattern in existing audio tasks: ```python # In tasks/your_audio_task/utils.py def doc_to_messages(doc): return [{ "role": "user", "content": [ {"type": "audio", "url": doc["audio_path"]}, {"type": "text", "text": doc["question"]} ] }] ``` See `lmms_eval/tasks/step2_audio_paralinguistic/` and `lmms_eval/tasks/voicebench/` for examples. ### Adding Caching to Custom Models Implement the caching API in your model's `generate_until`: ```python class MyModel(lmms): def generate_until(self, requests): # Load cache self.load_cache() # Separate cached vs pending cached, pending = self.get_response_from_cache(requests) # Process pending requests for req in pending: response = self.my_inference_logic(req) self.add_request_response_to_cache(req, response) return [c["response"] for c in cached] + pending_responses ``` See `lmms_eval/models/chat/async_openai.py` for a complete example. ## Acknowledgments The v0.5 release was made possible by contributions from the LMMS-Eval community: ### Core Contributors - **Audio Evaluation Suite**: Implementation of Step2 Audio Paralinguistic, VoiceBench, and WenetSpeech benchmarks - **Caching Infrastructure**: Design and implementation of the JSONL caching system - **Model Integrations**: Support for GPT-4o Audio Preview, Gemma-3, LLaVA-OneVision 1.5, LongViLA-R1, and Thyme - **Benchmark Additions**: CSBench, SciBench, Lemonade, and CharXiv implementations - **MCP Integration**: Model Context Protocol client and tool calling support - **Bug Fixes**: Numerous fixes to async OpenAI, batch handling, and model loading ### Special Thanks - Community members who reported issues and provided feedback - Contributors who improved documentation and examples - Researchers who shared benchmark datasets and evaluation protocols ## Getting Help ### Documentation - **Main README**: `README.md` - Quick start and overview - **Model Guide**: `docs/model_guide.md` - Adding new models - **Task Guide**: `docs/task_guide.md` - Implementing new benchmarks - **Caching Guide**: `docs/caching.md` - Detailed caching documentation - **Commands**: `docs/commands.md` - CLI reference ### Support Channels - **GitHub Issues**: Report bugs or request features at [lmms-eval/issues](https://github.com/EvolvingLMMs-Lab/lmms-eval/issues) - **GitHub Discussions**: Ask questions and share ideas at [lmms-eval/discussions](https://github.com/EvolvingLMMs-Lab/lmms-eval/discussions) - **Documentation**: Check the `docs/` directory for implementation guides ### FAQs **Q: How do I enable caching?** ```bash export LMMS_EVAL_USE_CACHE=True ``` **Q: Where are cache files stored?** ```bash ~/.cache/lmms-eval/eval_cache// ``` **Q: How do I evaluate audio models?** ```bash python -m lmms_eval \ --model async_openai \ --model_args model_version=gpt-4o-audio-preview-2024-12-17 \ --tasks step2_audio_paralinguistic,voicebench ``` **Q: Can I use caching with distributed evaluation?** Yes! Caching works seamlessly with multi-GPU/multi-node evaluation. Each rank maintains its own cache file. **Q: What's the difference between `--write_out` and `--log_samples`?** `--write_out` is deprecated. Use `--log_samples` to save individual sample results. --- **Version**: 0.5.0 **Release Date**: October 2025 **Previous Version**: [v0.4 Release Notes](lmms-eval-0.4.md)