This repository contains the DataMind-Qwen2.5-7B model, which was presented in the paper Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study.

Paper Abstract: Large Language Models (LLMs) hold promise in automating data analysis tasks, yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. By curating a seed dataset of diverse, realistic scenarios, we evaluate model behavior across three core dimensions: data understanding, code generation, and strategic planning. Our analysis reveals three key findings: (1) Strategic planning quality serves as the primary determinant of model performance; (2) Interaction design and task complexity significantly influence reasoning capabilities; (3) Data quality demonstrates a greater impact than diversity in achieving optimal performance. We leverage these insights to develop a data synthesis methodology, demonstrating significant improvements in open-source LLMs' analytical reasoning capabilities.

For more details, visit the official DataMind GitHub repository.

✨ DataMind

🔧 Installation

🔩Manual Environment Configuration

Conda virtual environments offer a light and flexible setup.

Prerequisites

  • Anaconda Installation
  • GPU support (recommended CUDA version: 12.4)

Configure Steps

  1. Clone the repository:
git clone https://github.com/zjunlp/DataMind.git
  1. Enter the working directory, and all subsequent commands should be executed in this directory.
cd DataMind/eval
  1. Create a virtual environment using Anaconda.
conda create -n DataMind python=3.10
conda activate DataMind
  1. Install all required Python packages.
pip install -r requirements.txt

Usage (Text Generation for Data Analysis)

You can use this model with the Hugging Face transformers library for text generation, particularly for data analysis and code generation tasks.

First, ensure you have the transformers library installed:

pip install transformers torch

Then, you can load and use the model as follows:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "zjunlp/DataMind-Qwen2.5-7B" # Or zjunlp/DataMind-Qwen2.5-14B, if available

# Load the model and tokenizer
# Use torch_dtype=torch.bfloat16 for better performance on compatible GPUs
# Use device_map="auto" to automatically distribute the model across available devices
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Example: Generate Python code for data analysis
messages = [
    {"role": "user", "content": "I have a CSV file named 'sales_data.csv' with columns 'Date', 'Product', 'Quantity', 'Price'. Write Python code using pandas to calculate the total revenue for each product and save it to a new CSV file named 'product_revenue.csv'."}
]

# Apply chat template for Qwen models
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# Generate response
generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.8,
    repetition_penalty=1.05,
    eos_token_id=tokenizer.eos_token_id, # Ensure generation stops at EOS token
)

# Decode and print the generated text
response = tokenizer.batch_decode(generated_ids[0][len(model_inputs.input_ids[0]):], skip_special_tokens=True)[0]
print(response)

🧐 Evaluation

Note:

  • Ensure that your working directory is set to the eval folder in a virtual environment.
  • If you have more questions, feel free to open an issue with us.
  • If you need to use local model, you need to deploy it according to (Optional)local_model.sh.

Step 1: Prepare the parameter configuration

The evaluation datasets we used are in QRData and DiscoveryBench. The script expects data to be at data/QRData/benchmark/data/*.csv and data/DiscoveryBench/*.csv.

You can also download our sft models directly from Hugging Face: DataMind-Qwen2.5-7B ,DataMind-Qwen2.5-14B .

Here is the example: config.yaml

api_key: your_api_key # your API key for the model with API service. No need for open-source models.
data_root: /path/to/your/project/DataMind/eval/data # Root directory for data. (absolute path)

run_eval.sh

python do_generate.py \
  --model_name DataMind-Qwen2.5-7B \  # Model name to use.
  --check_model gpt-4o-mini \  # Check model to use.
  --output results \  # Output directory path.
  --dataset_name QRData \  # Dataset name to use, chosen from QRData, DiscoveryBench.
  --max_round 25 \  # Maximum number of steps.
  --api_port 8000 \  # API port number, it is necessary if the local model is used.
  --bidx 0 \  # Begin index (inclusive), `None` indicates that there is no restriction.
  --eidx None \  # End index (exclusive), `None` indicates that there is no restriction.
  --temperature 0.0 \  # Temperature for sampling.
  --top_p 1 \  # Top p for sampling.
  --add_random False \  # Whether to add random files.

(Optional)local_model.sh

CUDA_VISIBLE_DEVICES=$i python -m vllm.entrypoints.openai.api_server \
  --model $MODEL_PATH \ # Local model path.
  --served-model-name $MODEL_NAME \ # The model name specified by you.
  --tensor-parallel-size $i \ # Set the size of tensor parallel processing.
  --port $port # API port number, which is consistent with the `api_port` above.

Step 2: Run the shell script

(Optional) Deploy the local model if you need.

bash local_model.sh

Run the shell script to start the process.

bash run_eval.sh

✍️ Citation

If you find our work helpful, please use the following citations.

@article{zhu2025open,
  title={Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study},
  author={Zhu, Yuqi and Zhong, Yi and Zhang, Jintian and Zhang, Ziheng and Qiao, Shuofei and Luo, Yujie and Du, Lun and Zheng, Da and Chen, Huajun and Zhang, Ningyu},
  journal={arXiv preprint arXiv:2506.19794},
  year={2025}
}
Downloads last month
22
Safetensors
Model size
7.62B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for zjunlp/DataMind-Qwen2.5-7B

Base model

Qwen/Qwen2.5-7B
Finetuned
(2621)
this model

Collection including zjunlp/DataMind-Qwen2.5-7B