openbmb
/

MiniCPM4.1-8B

@@ -18,3 +18,329 @@ library_name: transformers
 <p align="center">
 👋 Contact us in <a href="https://discord.gg/3cGQn9b3YM" target="_blank">Discord</a> and <a href="https://github.com/OpenBMB/MiniCPM/blob/main/assets/wechat.jpg" target="_blank">WeChat</a>
 </p>

 <p align="center">
 👋 Contact us in <a href="https://discord.gg/3cGQn9b3YM" target="_blank">Discord</a> and <a href="https://github.com/OpenBMB/MiniCPM/blob/main/assets/wechat.jpg" target="_blank">WeChat</a>
 </p>
+## What's New
+- [2025.09.05] **MiniCPM4.1** series are released! This series is a hybrid reasoning model, which can be used in
+both deep reasoning mode and non-reasoning mode. 🔥🔥🔥
+- [2025.06.06] **MiniCPM4** series are released! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end-side chips! You can find technical report [here](https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf).🔥🔥🔥
+## MiniCPM4 and MiniCPM4.1 Series
+MiniCPM4 and MiniCPM4.1 series are highly efficient large language models (LLMs) designed explicitly for end-side devices, which achieves this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems.
+- [MiniCPM4.1-8B](https://huggingface.co/openbmb/MiniCPM4.1-8B): The latest version of MiniCPM4, with 8B parameters, support fusion thinking. (**<-- you are here**)
+- [MiniCPM4.1-8B-GPTQ](https://huggingface.co/openbmb/MiniCPM4.1-8B-GPTQ): MiniCPM4.1-8B in GPTQ format.
+- [MiniCPM4.1-8B-AutoAWQ](https://huggingface.co/openbmb/MiniCPM4.1-8B-AutoAWQ): MiniCPM4.1-8B in AutoAWQ format.
+- [MiniCPM-4.1-8B-Marlin](https://huggingface.co/openbmb/MiniCPM-4.1-8B-Marlin): MiniCPM4.1-8B in Marlin format.
+- [MiniCPM4.1-8B-GGUF](https://huggingface.co/openbmb/MiniCPM4.1-8B-GGUF): MiniCPM4.1-8B in GGUF format.
+- [MiniCPM4.1-8B-MLX](https://huggingface.co/openbmb/MiniCPM4.1-8B-MLX): MiniCPM4.1-8B in MLX format.
+- [MiniCPM4.1-8B-Eagle3](https://huggingface.co/openbmb/MiniCPM4.1-8B-Eagle3): Eagle3 model for MiniCPM4.1-8B.
+- **MiniCPM4 Series**
+    <details>
+    <summary>Click to expand all MiniCPM4 series models</summary>
+    - [**MiniCPM4-8B**](https://huggingface.co/openbmb/MiniCPM4-8B): The flagship model with 8B parameters, trained on 8T tokens
+    - [**MiniCPM4-0.5B**](https://huggingface.co/openbmb/MiniCPM4-0.5B): Lightweight version with 0.5B parameters, trained on 1T tokens
+    - [**MiniCPM4-8B-Eagle-FRSpec**](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-FRSpec): Eagle head for FRSpec, accelerating speculative inference
+    - [**MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu**](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu): Eagle head with QAT for FRSpec, integrating speculation and quantization for ultra acceleration
+    - [**MiniCPM4-8B-Eagle-vLLM**](https://huggingface.co/openbmb/MiniCPM4-8B-Eagle-vLLM): Eagle head in vLLM format for speculative inference
+    - [**MiniCPM4-8B-marlin-Eagle-vLLM**](https://huggingface.co/openbmb/MiniCPM4-8B-marlin-Eagle-vLLM): Quantized Eagle head for vLLM format
+    - [**BitCPM4-0.5B**](https://huggingface.co/openbmb/BitCPM4-0.5B): Extreme ternary quantization of MiniCPM4-0.5B, achieving 90% bit width reduction
+    - [**BitCPM4-1B**](https://huggingface.co/openbmb/BitCPM4-1B): Extreme ternary quantization of MiniCPM3-1B, achieving 90% bit width reduction
+    - [**MiniCPM4-Survey**](https://huggingface.co/openbmb/MiniCPM4-Survey): Generates trustworthy, long-form survey papers from user queries
+    - [**MiniCPM4-MCP**](https://huggingface.co/openbmb/MiniCPM4-MCP): Integrates MCP tools to autonomously satisfy user requirements
+    </details>
+## Introduction
+MiniCPM4 and MiniCPM4.1 are extremely efficient edge-side large model that has undergone efficient optimization across four dimensions: model architecture, learning algorithms, training data, and inference systems, achieving ultimate efficiency improvements.
+- 🏗️ **Efficient Model Architecture:**
+  - InfLLM v2 -- Trainable Sparse Attention Mechanism: Adopts a trainable sparse attention mechanism architecture where each token only needs to compute relevance with less than 5% of tokens in 128K long text processing, significantly reducing computational overhead for long texts
+- 🧠 **Efficient Learning Algorithms:**
+  - Model Wind Tunnel 2.0 -- Efficient Predictable Scaling: Introduces scaling prediction methods for performance of downstream tasks, enabling more precise model training configuration search
+  - BitCPM -- Ultimate Ternary Quantization: Compresses model parameter bit-width to 3 values, achieving 90% extreme model bit-width reduction
+  - Efficient Training Engineering Optimization: Adopts FP8 low-precision computing technology combined with Multi-token Prediction training strategy
+- 📚 **High-Quality Training Data:**
+  - UltraClean -- High-quality Pre-training Data Filtering and Generation: Builds iterative data cleaning strategies based on efficient data verification, open-sourcing high-quality Chinese and English pre-training dataset [UltraFinweb](https://huggingface.co/datasets/openbmb/Ultra-FineWeb)
+  - UltraChat v2 -- High-quality Supervised Fine-tuning Data Generation: Constructs large-scale high-quality supervised fine-tuning datasets covering multiple dimensions including knowledge-intensive data, reasoning-intensive data, instruction-following data, long text understanding data, and tool calling data
+- ⚡ **Efficient Inference System:**
+  - CPM.cu -- Lightweight and Efficient CUDA Inference Framework: Integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding
+  - ArkInfer -- Cross-platform Deployment System: Supports efficient deployment across multiple backend environments, providing flexible cross-platform adaptation capabilities
+## Usage
+### Inference with [CPM.cu](https://github.com/OpenBMB/cpm.cu)
+We recommend using [CPM.cu](https://github.com/OpenBMB/cpm.cu) for the inference of MiniCPM4 and MiniCPM4.1. CPM.cu is a CUDA inference framework developed by OpenBMB, which integrates efficient sparse, speculative sampling, and quantization techniques, fully leveraging the efficiency advantages of MiniCPM4 and MiniCPM4.1.
+You can install CPM.cu by running the following command:
+```bash
+git clone https://github.com/OpenBMB/cpm.cu.git --recursive
+cd cpm.cu
+python3 setup.py install
+```
+MiniCPM4.1 natively supports context lengths of up to 65,536(64k) tokens. To reproduce the long-text acceleration effect in the paper, we recommend using the LongRoPE factors that have been validated. Change the `rope_scaling` field in the `config.json` file as the following to enable LongRoPE.
+```json
+{
+    ...,
+    "rope_scaling": {
+        "rope_type": "longrope",
+        "long_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
+        "short_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
+        "original_max_position_embeddings": 32768
+    }
+}
+```
+After modification, you can run the following command to reproduce the long-context acceleration effect (the script will automatically download the model weights from HuggingFace)
+```bash
+python3 tests/test_generate.py
+```
+For more details about CPM.cu, please refer to [the repo CPM.cu](https://github.com/OpenBMB/cpm.cu).
+### Hybird Reasoning Mode
+MiniCPM4.1 supports hybrid reasoning mode, which can be used in both deep reasoning mode and non-reasoning mode. To enable hybrid reasoning mode. User can set `enable_thinking=True` in `tokenizer.apply_chat_template` to enable hybrid reasoning mode, and set `enable_thinking=False` to enable non-reasoning mode. Similarly, user can directly add `\no_think` at the end of the query to enable non-reasoning mode. If not add any special token or add `\think` at the end of the query, the model will enable reasoning mode.
+```python
+# Enable reasoning mode
+prompt_text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True,
+    enable_thinking=True
+)
+# Enable non-reasoning mode
+prompt_text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True,
+    enable_thinking=False
+)
+```
+### Inference with Transformers
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+torch.manual_seed(0)
+path = 'openbmb/MiniCPM4.1-8B'
+device = "cuda"
+tokenizer = AutoTokenizer.from_pretrained(path)
+model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)
+# User can directly use the chat interface
+# responds, history = model.chat(tokenizer, "Write an article about Artificial Intelligence.", temperature=0.7, top_p=0.7)
+# print(responds)
+# User can also use the generate interface
+messages = [
+    {"role": "user", "content": "Write an article about Artificial Intelligence."},
+]
+prompt_text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True,
+)
+model_inputs = tokenizer([prompt_text], return_tensors="pt").to(device)
+model_outputs = model.generate(
+    **model_inputs,
+    max_new_tokens=8192,
+    top_p=0.7,
+    temperature=0.7
+)
+output_token_ids = [
+    model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs['input_ids']))
+]
+responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
+print(responses)
+```
+MiniCPM4.1-8B supports `InfLLM v2`, a sparse attention mechanism designed for efficient long-sequence inference. It requires the [infllmv2_cuda_impl](https://github.com/OpenBMB/infllmv2_cuda_impl) library.
+You can install it by running the following command:
+```bash
+git clone -b feature_infer https://github.com/OpenBMB/infllmv2_cuda_impl.git
+cd infllmv2_cuda_impl
+git submodule update --init --recursive
+pip install -e . # or python setup.py install
+```
+To enable InfLLM v2, you need to add the `sparse_config` field in `config.json`:
+```json
+{
+    ...,
+    "sparse_config": {
+        "kernel_size": 32,
+        "kernel_stride": 16,
+        "init_blocks": 1,
+        "block_size": 64,
+        "window_size": 2048,
+        "topk": 64,
+        "use_nope": false,
+        "dense_len": 8192
+    }
+}
+```
+These parameters control the behavior of InfLLM v2:
+* `kernel_size` (default: 32): The size of semantic kernels.
+* `kernel_stride` (default: 16): The stride between adjacent kernels.
+* `init_blocks` (default: 1): The number of initial blocks that every query token attends to. This ensures attention to the beginning of the sequence.
+* `block_size` (default: 64): The block size for key-value blocks.
+* `window_size` (default: 2048): The size of the local sliding window.
+* `topk` (default: 64): The specifies that each token computes attention with only the top-k most relevant key-value blocks.
+* `use_nope` (default: false): Whether to use the NOPE technique in block selection for improved performance.
+* `dense_len` (default: 8192): Since Sparse Attention offers limited benefits for short sequences, the model can use standard (dense) attention for shorter texts. The model will use dense attention for sequences with a token length below `dense_len` and switch to sparse attention for sequences exceeding this length. Set this to `-1` to always use sparse attention regardless of sequence length.
+MiniCPM4.1 natively supports context lengths of up to 65,536(64k) tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques for effective handling of long texts. We have validated the model's performance on context lengths of up to 131,072 tokens by modifying the LongRoPE factor.
+You can apply the LongRoPE factor modification by modifying the model files. Specifically, in the `config.json` file, adjust the `rope_scaling` fields.
+```json
+{
+    ...,
+    "rope_scaling": {
+        "rope_type": "longrope",
+        "long_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
+        "short_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
+        "original_max_position_embeddings": 32768
+    }
+}
+```
+### Inference with [SGLang](https://github.com/sgl-project/sglang)
+For now, you need to install our forked version of SGLang.
+```bash
+git clone -b openbmb https://github.com/OpenBMB/sglang.git
+cd sglang
+pip install --upgrade pip
+pip install -e "python[all]"
+```
+You can start the inference server by running the following command:
+```bash
+python -m sglang.launch_server --model openbmb/MiniCPM4.1-8B --trust-remote-code --port 30000 --chat-template chatml
+```
+Then you can use the chat interface by running the following command:
+```python
+import openai
+client = openai.Client(base_url=f"http://localhost:30000/v1", api_key="None")
+response = client.chat.completions.create(
+    model="openbmb/MiniCPM4.1-8B",
+    messages=[
+        {"role": "user", "content": "Write an article about Artificial Intelligence."},
+    ],
+    temperature=0.7,
+    max_tokens=8192,
+)
+print(response.choices[0].message.content)
+```
+### Inference with [vLLM](https://github.com/vllm-project/vllm)
+For now, you need to install the latest version of vLLM.
+```
+pip install -U vllm \
+    --pre \
+    --extra-index-url https://wheels.vllm.ai/nightly
+```
+Then you can inference MiniCPM4.1-8B with vLLM:
+```python
+from transformers import AutoTokenizer
+from vllm import LLM, SamplingParams
+model_name = "openbmb/MiniCPM4.1-8B"
+prompt = [{"role": "user", "content": "Please recommend 5 tourist attractions in Beijing. "}]
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+input_text = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
+llm = LLM(
+    model=model_name,
+    trust_remote_code=True,
+    max_num_batched_tokens=32768,
+    dtype="bfloat16",
+    gpu_memory_utilization=0.8,
+)
+sampling_params = SamplingParams(top_p=0.7, temperature=0.7, max_tokens=1024, repetition_penalty=1.02)
+outputs = llm.generate(prompts=input_text, sampling_params=sampling_params)
+print(outputs[0].outputs[0].text)
+```
+Also, you can start the inference server by running the following command:
+> **Note**: In vLLM's chat API, `add_special_tokens` is `False` by default. This means important special tokens—such as the beginning-of-sequence (BOS) token—will not be added automatically. To ensure the input prompt is correctly formatted for the model, you should explicitly set `extra_body={"add_special_tokens": True}`.
+```bash
+vllm serve openbmb/MiniCPM4.1-8B
+```
+Then you can use the chat interface by running the following code:
+```python
+import openai
+client = openai.Client(base_url="http://localhost:8000/v1", api_key="EMPTY")
+response = client.chat.completions.create(
+    model="openbmb/MiniCPM4.1-8B",
+    messages=[
+        {"role": "user", "content": "Write an article about Artificial Intelligence."},
+    ],
+    temperature=0.7,
+    max_tokens=1024,
+    extra_body=dict(add_special_tokens=True),  # Ensures special tokens are added for chat template
+)
+print(response.choices[0].message.content)
+```
+## Evaluation Results
+On two typical end-side chips, Jetson AGX Orin and RTX 4090, MiniCPM4 demonstrates significantly faster processing speed compared to similar-size models in long text processing tasks. As text length increases, MiniCPM4's efficiency advantage becomes more pronounced. On the Jetson AGX Orin platform, compared to Qwen3-8B, MiniCPM4 achieves approximately 7x decoding speed improvement.
+![benchmark](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm4/efficiency.png?raw=true)
+#### Comprehensive Evaluation
+MiniCPM4.1 launches end-side versions with 8B parameter scale, both achieving best-in-class performance in their respective categories.
+![benchmark](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm4/benchmark4.1.png?raw=true)
+#### Long Text Evaluation
+MiniCPM4 is pre-trained on 32K long texts and achieves length extension through YaRN technology. In the 128K long text needle-in-a-haystack task, MiniCPM4 demonstrates outstanding performance.
+![long-niah](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm4/128k-niah.png?raw=true)
+## Statement
+- As a language model, MiniCPM generates content by learning from a vast amount of text.
+- However, it does not possess the ability to comprehend or express personal opinions or value judgments.
+- Any content generated by MiniCPM does not represent the viewpoints or positions of the model developers.
+- Therefore, when using content generated by MiniCPM, users should take full responsibility for evaluating and verifying it on their own.
+## LICENSE
+- This repository and MiniCPM models are released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
+## Citation
+- Please cite our [paper](https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf) if you find our work valuable.
+```bibtex
+@article{minicpm4,
+  title={{MiniCPM4}: Ultra-Efficient LLMs on End Devices},
+  author={MiniCPM Team},
+  year={2025}
+}
+```

modeling_minicpm.py CHANGED Viewed

@@ -24,7 +24,7 @@ import torch.utils.checkpoint
 from torch import nn
 from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
 from transformers.activations import ACT2FN
-from transformers.cache_utils import Cache, DynamicCache
 from transformers.modeling_attn_mask_utils import (
     AttentionMaskConverter,
     _prepare_4d_attention_mask,
@@ -57,6 +57,7 @@ try:
         infllmv2_attn_varlen_func,
         infllmv2_attn_with_kvcache,
         max_pooling_1d,
     )
 except:
     pass
@@ -79,8 +80,7 @@ def compressed_attention(
     sm_scale: float = None,
     init_blocks: int = 1,
     local_blocks: int = 2,
-    parallel_topk_compute: Union[str, bool] = 'auto',
-    total_seq_lens=-1,
 ) -> Tuple[torch.Tensor, torch.Tensor]:
     """Attention between query and compressed key and value. Compute attention output and topk block idx used in topk_sparse_attention.
@@ -99,31 +99,32 @@ def compressed_attention(
         sm_scale (float, optional): softmax scale. Defaults to None, means 1/sqrt(head_dim).
         init_blocks (int, optional): Number of init blocks for each query. Defaults to 1.
         local_blocks (int, optional): Number of local blocks for each query. Defaults to 2.
-        parallel_topk_compute (str, optional): Only set it to False when the sequence length is too long. This can avoid a current bug.
-            We'll fix this issue later. Defaults to auto, it will be set to False when the sequence length is greater than 32k and True otherwise.
     Returns:
         Tuple[torch.Tensor, torch.Tensor]: attention output and topk_idx used in topk_sparse_attention
     """
     with torch.no_grad():
-        cache_len = 0
         batch_size = cu_seqlens_q.shape[0] - 1
-        if total_seq_lens == -1:
-            total_seq_lens = max_seqlen_q
-            q_idx = torch.cat(
-                [
-                    torch.arange(cu_seqlens_q[i + 1] - cu_seqlens_q[i], device=q.device) + total_seq_lens - (cu_seqlens_q[i + 1] - cu_seqlens_q[i])
-                    for i in range(batch_size)
-                ],
-                dim=0,
-            )
-            q_idx = q_idx // block_size
         else:
-            cache_len = total_seq_lens - max_seqlen_q
-            assert batch_size == 1, 'batch_size must be 1 when total_seq_lens is set'
-            q_idx = torch.tensor([total_seq_lens - 1], device=q.device, dtype=torch.int32) // block_size
         score = infllmv2_attn_stage1(
             q.contiguous(),
             k.contiguous(),
@@ -132,22 +133,27 @@ def compressed_attention(
             cu_seqlens_k=cu_seqlens_k,
             max_seqlen_q=max_seqlen_q,
             max_seqlen_k=max_seqlen_k,
-            causal=q_idx.shape[0] > 1)
         score = score[:, :q_idx.shape[0], :]
-        # Replace transform_score with max_pooling_1d
-        block_score = max_pooling_1d(
             score.contiguous(),
-            cache_len=cache_len,
             local_blocks=local_blocks,
             init_blocks=init_blocks,
             block_size=block_size,
-            stride=kernel_stride,
-        )
         # get topk
         topk = min(topk, block_score.shape[-1])
         topk_idx = block_score.topk(topk, dim=-1).indices.sort(-1).values
-        topk_idx[topk_idx >= q_idx[None, :, None]] = -1
         topk_idx = topk_idx.to(torch.int32)
     return topk_idx
@@ -246,299 +252,89 @@ class CompressK(torch.nn.Module):
         return compressed_k, cu_seqlens_compressed
-class DynamicCacheQKV(DynamicCache):
-    """
-    A cache that grows dynamically as more tokens are generated. This is the default for generative models.
-    It stores the Key and Value states as a list of tensors, one for each layer. The expected shape for each tensor is
-    `[batch_size, num_heads, seq_len, head_dim]`.
-    Example:
-        ```python
-        >>> from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache
-        >>> model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
-        >>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
-        >>> inputs = tokenizer(text="My name is Qwen2", return_tensors="pt")
-        >>> # Prepare a cache class and pass it to model's forward
-        >>> past_key_values = DynamicCache()
-        >>> outputs = model(**inputs, past_key_values=past_key_values, use_cache=True)
-        >>> outputs.past_key_values # access cache filled with key/values from generation
-        DynamicCache()
-        ```
-    """
-    def __init__(self, num_hidden_layers: Optional[int] = None) -> None:
         super().__init__()
-        if num_hidden_layers is None:
-            self.key_cache: List[torch.Tensor] = []
-            self.value_cache: List[torch.Tensor] = []
-            self.compress_k_cache: List[torch.Tensor] = []
-            self.no_compress_k_cache: List[torch.Tensor] = []
-            self.cached_compressed_cu_seqlens: List[torch.Tensor] = []
-            self.no_rope_key_cache: List[torch.Tensor] = []
         else:
-            self.key_cache: List[torch.Tensor] = [[] for _ in range(num_hidden_layers)]
-            self.value_cache: List[torch.Tensor] = [[] for _ in range(num_hidden_layers)]
-            self.compress_k_cache: List[torch.Tensor] = [[] for _ in range(num_hidden_layers)]
-            self.no_compress_k_cache: List[torch.Tensor] = [[] for _ in range(num_hidden_layers)]
-            self.cached_compressed_cu_seqlens: List[torch.Tensor] = [[] for _ in range(num_hidden_layers)]
-            self.no_rope_key_cache: List[torch.Tensor] = [[] for _ in range(num_hidden_layers)]
-        self._seen_tokens = 0  # Used in `generate` to keep tally of how many tokens the cache has seen
-    def __getitem__(self, layer_idx: int) -> List[Tuple[torch.Tensor]]:
-        """
-        Support for backwards-compatible `past_key_value` indexing, e.g. `past_key_value[0][0].shape[2]` to get the
-        sequence length.
-        """
-        if layer_idx < len(self):
-            return (self.key_cache[layer_idx], self.value_cache[layer_idx])
         else:
-            raise KeyError(f'Cache only has {len(self)} layers, attempted to access layer with index {layer_idx}')
-    def __iter__(self):
-        """
-        Support for backwards-compatible `past_key_value` iteration, e.g. `for x in past_key_value:` to iterate over
-        keys and values
-        """
-        for layer_idx in range(len(self)):
-            yield (self.key_cache[layer_idx], self.value_cache[layer_idx])
-    def __len__(self):
-        """
-        Support for backwards-compatible `past_key_value` length, e.g. `len(past_key_value)`. This value corresponds
-        to the number of layers in the model.
-        """
-        return len(self.key_cache)
-    def update(
-        self,
-        key_states: torch.Tensor,
-        value_states: torch.Tensor,
-        layer_idx: int,
-        cache_kwargs: Optional[Dict[str, Any]] = None
-    ) -> Tuple[torch.Tensor, torch.Tensor]:
-        """
-        Updates the cache with the new `key_states` and `value_states` for the layer `layer_idx`.
-        Parameters:
-            key_states (`torch.Tensor`):
-                The new key states to cache.
-            value_states (`torch.Tensor`):
-                The new value states to cache.
-            layer_idx (`int`):
-                The index of the layer to cache the states for.
-            cache_kwargs (`Dict[str, Any]`, `optional`):
-                Additional arguments for the cache subclass. No additional arguments are used in `DynamicCache`.
-        Return:
-            A tuple containing the updated key and value states.
-        """
-        # Update the number of seen tokens
         if layer_idx == 0:
             self._seen_tokens += key_states.shape[-2]
-        # Update the cache
-        if len(self.key_cache) <= layer_idx:
-            self.key_cache.append(key_states)
-            self.value_cache.append(value_states)
-        # content on layer cache can be a tensor and checking not tensor causes errors
-        # so we explicitly check for the empty list
-        elif self.key_cache[layer_idx] == []:
-            self.key_cache[layer_idx] = key_states
-            self.value_cache[layer_idx] = value_states
-        else:
-            self.key_cache[layer_idx] = torch.cat([self.key_cache[layer_idx], key_states], dim=-2)
-            self.value_cache[layer_idx] = torch.cat([self.value_cache[layer_idx], value_states], dim=-2)
-        return self.key_cache[layer_idx], self.value_cache[layer_idx]
-    def update_no_rope_key(
-            self,
-            key_states: torch.Tensor,
-            layer_idx: int,
-            cache_kwargs: Optional[Dict[str, Any]] = None):
-        # Update the cache
-        if len(self.no_rope_key_cache) <= layer_idx:
-            self.no_rope_key_cache.append(key_states)
-        # content on layer cache can be a tensor and checking not tensor causes errors
-        # so we explicitly check for the empty list
-        elif self.no_rope_key_cache[layer_idx] == []:
-            self.no_rope_key_cache[layer_idx] = key_states
-        else:
-            self.no_rope_key_cache[layer_idx] = torch.cat([self.no_rope_key_cache[layer_idx], key_states], dim=1)
-        return self.no_rope_key_cache[layer_idx]
-    def update_compress_k(
-        self,
-        key_states: torch.Tensor,
-        layer_idx: int,
-        cache_kwargs: Optional[Dict[str, Any]] = None
-    ) -> Tuple[torch.Tensor, torch.Tensor]:
-        """
-        Updates the cache with the new `key_states` and `value_states` for the layer `layer_idx`.
-        Parameters:
-            key_states (`torch.Tensor`):
-                The new key states to cache.
-            value_states (`torch.Tensor`):
-                The new value states to cache.
-            layer_idx (`int`):
-                The index of the layer to cache the states for.
-            cache_kwargs (`Dict[str, Any]`, `optional`):
-                Additional arguments for the cache subclass. No additional arguments are used in `DynamicCache`.
-        Return:
-            A tuple containing the updated key and value states.
-        """
-        # Update the cache
-        if len(self.compress_k_cache) <= layer_idx:
-            self.compress_k_cache.append(key_states)
-        # content on layer cache can be a tensor and checking not tensor causes errors
-        # so we explicitly check for the empty list
-        elif self.compress_k_cache[layer_idx] == []:
-            self.compress_k_cache[layer_idx] = key_states
-        else:
-            self.compress_k_cache[layer_idx] = torch.cat([self.compress_k_cache[layer_idx], key_states], dim=0)
-        return self.compress_k_cache[layer_idx]
-    def update_no_compress_k(
-        self,
-        key_states: torch.Tensor,
-        layer_idx: int,
-        kernel_size: int = 32,
-        kernel_stride: int = 16,
-        cache_kwargs: Optional[Dict[str, Any]] = None
-    ) -> Tuple[torch.Tensor, torch.Tensor]:
-        """
-        Updates the cache with the new `key_states` and `value_states` for the layer `layer_idx`.
-        Parameters:
-            key_states (`torch.Tensor`):
-                The new key states to cache.
-            value_states (`torch.Tensor`):
-                The new value states to cache.
-            layer_idx (`int`):
-                The index of the layer to cache the states for.
-            cache_kwargs (`Dict[str, Any]`, `optional`):
-                Additional arguments for the cache subclass. No additional arguments are used in `DynamicCache`.
-        Return:
-            A tuple containing the updated key and value states.
-        """
-        # Update the cache
-        if len(self.no_compress_k_cache) <= layer_idx:
-            self.no_compress_k_cache.append(key_states)
-        # content on layer cache can be a tensor and checking not tensor causes errors
-        # so we explicitly check for the empty list
-        elif self.no_compress_k_cache[layer_idx] == []:
-            self.no_compress_k_cache[layer_idx] = key_states
-        else:
-            self.no_compress_k_cache[layer_idx] = torch.cat([self.no_compress_k_cache[layer_idx], key_states], dim=0)
-        current_len = self.no_compress_k_cache[layer_idx].shape[0]
-        if current_len >= kernel_size:
-            k_chunk = self.no_compress_k_cache[layer_idx][:kernel_size]
-            self.no_compress_k_cache[layer_idx] = self.no_compress_k_cache[layer_idx][kernel_stride:]
-            return k_chunk
-        else:
-            return None
-    def get_seq_length(self, layer_idx: Optional[int] = 0) -> int:
-        """Returns the sequence length of the cached states. A layer index can be optionally passed."""
-        # TODO: deprecate this function in favor of `cache_position`
-        if len(self.key_cache) <= layer_idx or (len(self.key_cache) > layer_idx and self.key_cache[layer_idx] == []):
-            return 0
-        return self.key_cache[layer_idx].shape[-2]
-    def get_max_length(self) -> Optional[int]:
-        """Returns the maximum sequence length of the cached states. DynamicCache does not have a maximum length."""
-        return None
-    def to_legacy_cache(self) -> Tuple[Tuple[torch.Tensor], Tuple[torch.Tensor]]:
-        """Converts the `DynamicCache` instance into the its equivalent in the legacy cache format. Used for
-        backward compatibility."""
-        legacy_cache = ()
-        for layer_idx in range(len(self)):
-            legacy_cache += ((self.key_cache[layer_idx], self.value_cache[layer_idx]),)
-        return legacy_cache
-    # @classmethod
-    # def from_legacy_cache(
-    #     cls, past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, num_hidden_layers: int = None
-    # ) -> "DynamicCacheQKV":
-    #     """Converts a cache in the legacy cache format into an equivalent `DynamicCache`. Used for
-    #     backward compatibility."""
-    #     cache = cls(num_hidden_layers)
-    #     if past_key_values is not None:
-    #         for layer_idx in range(len(past_key_values)):
-    #             key_states, value_states, query_status = past_key_values[layer_idx]
-    #             cache.update(key_states, value_states, query_status,layer_idx)
-    #     return cache
-    def crop(self, max_length: int):
-        """Crop the past key values up to a new `max_length` in terms of tokens. `max_length` can also be
-        negative to remove `max_length` tokens. This is used in assisted decoding and contrastive search."""
-        # In case it is negative
-        if max_length < 0:
-            max_length = self.get_seq_length() - abs(max_length)
-        if self.get_seq_length() <= max_length:
-            return
-        self._seen_tokens = max_length
-        for idx in range(len(self.key_cache)):
-            if self.key_cache[idx] != []:
-                self.key_cache[idx] = self.key_cache[idx][..., :max_length, :]
-                self.value_cache[idx] = self.value_cache[idx][..., :max_length, :]
-    def batch_split(self, full_batch_size: int, split_size: int, num_hidden_layers: int) -> List['DynamicCacheQKV']:
-        """Split the current instance into a list of `DynamicCache` by the batch size. This will be used by
-        `_split_model_inputs()` in `generation.utils`"""
-        out = []
-        for i in range(0, full_batch_size, split_size):
-            current_split = DynamicCacheQKV(num_hidden_layers)
-            current_split._seen_tokens = self._seen_tokens
-            current_split.key_cache = [tensor[i: i + split_size] for tensor in self.key_cache]
-            current_split.value_cache = [tensor[i: i + split_size] for tensor in self.value_cache]
-            out.append(current_split)
-        return out
-    @classmethod
-    def from_batch_splits(cls, splits: List['DynamicCacheQKV'], num_hidden_layers: int) -> 'DynamicCacheQKV':
-        """This is the opposite of the above `batch_split()` method. This will be used by `stack_model_outputs` in
-        `generation.utils`"""
-        cache = cls(num_hidden_layers)
-        for idx in range(len(splits[0])):
-            key_cache = [current.key_cache[idx] for current in splits if current.key_cache[idx] != []]
-            value_cache = [current.key_cache[idx] for current in splits if current.key_cache[idx] != []]
-            query_cache = [current.key_cache[idx] for current in splits if current.key_cache[idx] != []]
-            if key_cache != []:
-                layer_keys = torch.cat(key_cache, dim=0)
-                layer_values = torch.cat(value_cache, dim=0)
-                layer_query = torch.cat(query_cache, dim=0)
-                cache.update(layer_keys, layer_values, idx, query_states=layer_query)
-        return cache
-    def batch_repeat_interleave(self, repeats: int):
-        """Repeat the cache `repeats` times in the batch dimension. Used in contrastive search."""
-        for layer_idx in range(len(self)):
-            self.key_cache[layer_idx] = self.key_cache[layer_idx].repeat_interleave(repeats, dim=0)
-            self.value_cache[layer_idx] = self.value_cache[layer_idx].repeat_interleave(repeats, dim=0)
-    def batch_select_indices(self, indices: torch.Tensor):
-        """Only keep the `indices` in the batch dimension of the cache. Used in contrastive search."""
-        for layer_idx in range(len(self)):
-            self.key_cache[layer_idx] = self.key_cache[layer_idx][indices, ...]
-            self.value_cache[layer_idx] = self.value_cache[layer_idx][indices, ...]
 # This makes `_prepare_4d_causal_attention_mask` a leaf function in the FX graph.
@@ -567,22 +363,6 @@ def _get_unpad_data(attention_mask):
     )
-def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None):
-    warnings.warn(
-        'Calling `transformers.models.minicpm.modeling_minicpm._prepare_4d_attention_mask` is deprecated and will be removed in v4.37. Use `transformers.modeling_attn_mask_utils._prepare_4d_attention_mask'
-    )
-    return _prepare_4d_attention_mask(mask=mask, dtype=dtype, tgt_len=tgt_len)
-def _make_causal_mask(
-    input_ids_shape: torch.Size, dtype: torch.dtype, device: torch.device, past_key_values_length: int = 0
-):
-    warnings.warn(
-        'Calling `transformers.models.minicpm.modeling_minicpm._make_causal_mask` is deprecated and will be removed in v4.37. Use `transformers.models.minicpm.modeling_minicpm.AttentionMaskConverter._make_causal_mask'
-    )
-    return AttentionMaskConverter._make_causal_mask(
-        input_ids_shape=input_ids_shape, dtype=dtype, device=device, past_key_values_length=past_key_values_length
-    )
 # @torch.jit.script  # type: ignore
@@ -796,6 +576,21 @@ class MiniCPMMLP(nn.Module):
         return down_proj
 def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
     """
@@ -927,15 +722,7 @@ class MiniCPMAttention(nn.Module):
         key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
         value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
-        kv_seq_len = key_states.shape[-2]
-        if past_key_value is not None:
-            if self.layer_idx is None:
-                raise ValueError(
-                    f'The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} '
-                    'for auto-regressive decoding with k/v caching, please make sure to initialize the attention class '
-                    'with a layer index.'
-                )
-            kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
         cos, sin = self.rotary_emb(value_states.to(torch.float32), seq_len=kv_seq_len)
         query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
@@ -1037,9 +824,7 @@ class MiniCPMFlashAttention2(MiniCPMAttention):
         key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
         value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
-        kv_seq_len = key_states.shape[-2]
-        if past_key_value is not None:
-            kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
         cos, sin = self.rotary_emb(value_states.to(torch.float32), seq_len=kv_seq_len)
         query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
@@ -1211,7 +996,7 @@ class MiniCPMInfLLMv2Attention(MiniCPMAttention):
         self.dense_len = self.config.sparse_config.get('dense_len', 8192)
         self.local_blocks = self.window_size // self.block_size  # local_blocks
-        self.topk = self.config.sparse_config.get('topk', 64)
         self.use_nope = self.config.sparse_config.get('use_nope', False)
         self.compress_k = CompressK(self.num_key_value_heads, self.head_dim, kernel_size=self.kernel_size, kernel_stride=self.kernel_stride)
@@ -1237,7 +1022,6 @@ class MiniCPMInfLLMv2Attention(MiniCPMAttention):
         output_attentions = False
         bsz, q_len, _ = hidden_states.size()
-        assert bsz == 1, 'Only batch_size=1 is supported at the moment.'
         query_states = self.q_proj(hidden_states)
         key_states = self.k_proj(hidden_states)
@@ -1255,9 +1039,7 @@ class MiniCPMInfLLMv2Attention(MiniCPMAttention):
         key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
         value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
-        kv_seq_len = key_states.shape[-2]
-        if past_key_value is not None:
-            kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
         cos, sin = self.rotary_emb(value_states.to(torch.float32), seq_len=kv_seq_len)
         query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
@@ -1271,12 +1053,11 @@ class MiniCPMInfLLMv2Attention(MiniCPMAttention):
         key_states = key_states.transpose(1, 2)
         value_states = value_states.transpose(1, 2)
         if self.use_nope:
             no_rope_param = {
                 'key_states_no_rope': key_states_no_rope,
                 'query_states_no_rope': query_states_no_rope,
             }
-            if kv_seq_len <= self.dense_len:
-                past_key_value.update_no_rope_key(key_states_no_rope, self.layer_idx)
         else:
             no_rope_param = None
@@ -1308,15 +1089,11 @@ class MiniCPMInfLLMv2Attention(MiniCPMAttention):
         if kv_seq_len < self.dense_len:
             attn_output = self._flash_attention_forward_dense(
                 query_states, key_states, value_states, attention_mask, q_len, dropout=dropout_rate)
-        elif past_key_value is None or q_len != 1:    # prefilling
-            attn_output = self._flash_attention_forward(
                 query_states, key_states, value_states, attention_mask, q_len, dropout=dropout_rate,
                 no_rope_param=no_rope_param,  # if past_key_value is not None else None,
                 past_key_value=past_key_value)
-        else:
-            attn_output = self._flash_attention_forward_with_kv_cache(
-                query_states, key_states, value_states, attention_mask, q_len, dropout=dropout_rate, no_rope_param=no_rope_param, past_key_value=past_key_value)
         attn_output = attn_output.reshape(bsz, q_len, self.hidden_size).contiguous()
         attn_output = self.o_proj(attn_output)
@@ -1325,122 +1102,146 @@ class MiniCPMInfLLMv2Attention(MiniCPMAttention):
         return attn_output, attn_weights, past_key_value
-    def _flash_attention_forward(
-        self, query_states, key_states, value_states, attention_mask, query_length, dropout=0.0, softmax_scale=None, no_rope_param=None, past_key_value=None
-    ):
-        """
-        Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
-        first unpad the input, then computes the attention scores and pad the final attention scores.
-        Args:
-            query_states (`torch.Tensor`):
-                Input query states to be passed to Flash Attention API
-            key_states (`torch.Tensor`):
-                Input key states to be passed to Flash Attention API
-            value_states (`torch.Tensor`):
-                Input value states to be passed to Flash Attention API
-            attention_mask (`torch.Tensor`):
-                The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
-                position of padding tokens and 1 for the position of non-padding tokens.
-            dropout (`int`, *optional*):
-                Attention dropout
-            softmax_scale (`float`, *optional*):
-                The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
-        """
-        if not self._flash_attn_uses_top_left_mask:
-            causal = self.is_causal
-        else:
-            # TODO: Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1. For details, please see the comment in MiniCPMFlashAttention2 __init__.
-            causal = self.is_causal and query_length != 1
-        # Contains at least one padding token in the sequence
-        if attention_mask is not None:
-            batch_size = query_states.shape[0]
-            query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
-                query_states, key_states, value_states, attention_mask, query_length
-            )
-            if no_rope_param is not None:
-                # nope unpad
-                no_rope_param['query_states_no_rope'] = no_rope_param['query_states_no_rope'].squeeze(0)
-                no_rope_param['key_states_no_rope'] = no_rope_param['key_states_no_rope'].squeeze(0)
-            cu_seqlens_q, cu_seqlens_k = cu_seq_lens
-            max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
-            attn_output_unpad = self.sparse_forward(
-                query_states,
-                key_states,
-                value_states,
-                cu_seqlens_q,
-                cu_seqlens_k,
-                max_seqlen_in_batch_q,
-                max_seqlen_in_batch_k,
-                no_rope_param=no_rope_param,
-                past_key_value=past_key_value,
-            )
-            attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
-        else:
-            raise ValueError('Need attention mask')
-        return attn_output
-    def _flash_attention_forward_with_kv_cache(
-        self, query_states, key_states, value_states, attention_mask, query_length, dropout=0.0, softmax_scale=None, no_rope_param=None, past_key_value=None
-    ):
         """
-        Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
-        first unpad the input, then computes the attention scores and pad the final attention scores.
         Args:
-            query_states (`torch.Tensor`):
-                Input query states to be passed to Flash Attention API
-            key_states (`torch.Tensor`):
-                Input key states to be passed to Flash Attention API
-            value_states (`torch.Tensor`):
-                Input value states to be passed to Flash Attention API
-            attention_mask (`torch.Tensor`):
-                The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
-                position of padding tokens and 1 for the position of non-padding tokens.
-            dropout (`int`, *optional*):
-                Attention dropout
-            softmax_scale (`float`, *optional*):
-                The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
         """
-        if not self._flash_attn_uses_top_left_mask:
-            causal = self.is_causal
-        else:
-            # TODO: Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1. For details, please see the comment in MiniCPMFlashAttention2 __init__.
-            causal = self.is_causal and query_length != 1
-        # Contains at least one padding token in the sequence
-        if attention_mask is not None:
-            batch_size = query_states.shape[0]
-            # query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
-            #     query_states, key_states, value_states, attention_mask, query_length=query_length
-            # )
-            assert batch_size == 1, 'Only batch_size=1 is supported at the moment.'
-            # prepare past kv ,new kv
-            new_q = query_states
-            new_k = key_states[:, -1:, :, :].contiguous()
-            new_v = value_states[:, -1:, :, :].contiguous()
-            past_k = key_states[:, :-1, :, :].contiguous()
-            past_v = value_states[:, :-1, :, :].contiguous()
-            if no_rope_param is not None:
-                # nope unpad
-                no_rope_param['query_states_no_rope'] = no_rope_param['query_states_no_rope'].squeeze(0)
-                no_rope_param['key_states_no_rope'] = no_rope_param['key_states_no_rope'].squeeze(0)
-            attn_output = self.sparse_forward_with_kv_cache(
-                past_k=past_k, past_v=past_v, new_k=new_k, new_v=new_v, new_q=new_q, batch_size=batch_size, no_rope_param=no_rope_param, past_key_value=past_key_value)
-            # attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
         else:
-            raise ValueError('need attention mask')
-        return attn_output
     def sparse_forward(self,
                        query_layer,
@@ -1451,24 +1252,18 @@ class MiniCPMInfLLMv2Attention(MiniCPMAttention):
                        max_seqlen_in_batch_q,
                        max_seqlen_in_batch_k,
                        no_rope_param=None,
-                       past_key_value=None):
-        stage1_k = key_layer if no_rope_param is None else no_rope_param['key_states_no_rope']
-        compressed_k, compressed_cu_seqlens = self.compress_k(stage1_k, cu_seqlens_k)
-        compressed_v = compressed_k.clone()
-        if past_key_value is not None:
-            # Compute the start indices of keys (k) that were not compressed, Only batch_size=1 is supported at the moment.
-            no_compress_k_start = compressed_k.shape[0] * self.kernel_stride
-            past_key_value.update_compress_k(
-                compressed_k, self.layer_idx
-            )
-            past_key_value.update_no_compress_k(
-                key_layer[no_compress_k_start:], self.layer_idx, no_compress_k_start)
-            past_key_value.cached_compressed_cu_seqlens.append(compressed_cu_seqlens)
         compressed_seqlens = compressed_cu_seqlens[1:] - compressed_cu_seqlens[:-1]
         topk_idx = compressed_attention(
             query_layer if no_rope_param is None else no_rope_param['query_states_no_rope'],
             compressed_k,
-            compressed_v,
             self.kernel_size,
             self.kernel_stride,
             self.block_size,
@@ -1480,8 +1275,8 @@ class MiniCPMInfLLMv2Attention(MiniCPMAttention):
             None,
             init_blocks=self.init_blocks,
             local_blocks=self.local_blocks,
         )
         topk_attn_output = infllmv2_attn_varlen_func(
             query_layer,
             key_layer,
@@ -1493,102 +1288,14 @@ class MiniCPMInfLLMv2Attention(MiniCPMAttention):
             dropout_p=0.0,
             deterministic=False,
             softmax_scale=None,
-            causal=True,
             return_attn_probs=False,
-            block_window_size=self.window_size // self.block_size,
             topk_idx=topk_idx
         )
         return topk_attn_output
-    def sparse_forward_with_kv_cache(self, past_k=None, past_v=None, new_k=None, new_v=None, new_q=None, batch_size=None, no_rope_param=None, past_key_value=None):
-        # stage1_k = new_k.squeeze(0) if no_rope_param is None else no_rope_param['key_states_no_rope']
-        if past_k.shape[1] + new_k.shape[1] == self.dense_len and (past_key_value.compress_k_cache == [] or len(past_key_value.compress_k_cache) < self.layer_idx + 1 or past_key_value.compress_k_cache[self.layer_idx] == []):
-            if no_rope_param is not None:
-                stage1_k = past_key_value.no_rope_key_cache[self.layer_idx].squeeze(0).contiguous()  # just batch_size ==1
-            else:
-                stage1_k = torch.cat([past_k, new_k], dim=1).contiguous().squeeze(0).contiguous()  # just batch_size ==1
-            compressed_k, compressed_cu_seqlens = self.compress_k(stage1_k, torch.tensor([0, stage1_k.shape[0]], device=stage1_k.device, dtype=torch.int32))  # just batch_size ==1
-            # Compute the start indices of keys (k) that were not compressed, Only batch_size=1 is supported at the moment.
-            no_compress_k_start = compressed_k.shape[0] * self.kernel_stride
-            past_key_value.update_compress_k(
-                compressed_k, self.layer_idx
-            )
-            past_key_value.update_no_compress_k(
-                stage1_k[no_compress_k_start:], self.layer_idx, no_compress_k_start)
-            past_key_value.cached_compressed_cu_seqlens.append(compressed_cu_seqlens)
-        else:
-            stage1_k = new_k.squeeze(0) if no_rope_param is None else no_rope_param['key_states_no_rope']
-            no_compress_k = past_key_value.update_no_compress_k(
-                stage1_k, self.layer_idx, kernel_stride=self.kernel_stride, kernel_size=self.kernel_size)
-            if no_compress_k is not None:
-                compressed_k = no_compress_k.mean(dim=0, keepdim=True)  # [1, n_heads_k, head_dim]
-                compressed_k = past_key_value.update_compress_k(
-                    compressed_k, self.layer_idx)  # [seqlen, nheads_k, head_dim]
-                past_key_value.cached_compressed_cu_seqlens[self.layer_idx][-1] += 1    # !Increment the last entry in sequence lengths by 1; currently supports only batch_size = 1
-                compressed_cu_seqlens = past_key_value.cached_compressed_cu_seqlens[self.layer_idx]
-            else:
-                compressed_k = past_key_value.compress_k_cache[self.layer_idx]  # [seqlen, nheads_k, head_dim]
-                compressed_cu_seqlens = past_key_value.cached_compressed_cu_seqlens[self.layer_idx]
-        compressed_v = compressed_k.clone()
-        compressed_seqlens = compressed_cu_seqlens[1:] - compressed_cu_seqlens[:-1]
-        torch.cuda.synchronize()
-        # Manually verify that the lengths match
-        assert compressed_k.shape[0] == compressed_seqlens.sum().item(), 'The length of compressed_k does not match the sum of compressed_seqlens'
-        topk_idx = compressed_attention(
-            new_q.squeeze(0).contiguous() if no_rope_param is None else no_rope_param['query_states_no_rope'],
-            compressed_k,
-            compressed_v,
-            self.kernel_size,
-            self.kernel_stride,
-            self.block_size,
-            self.topk,
-            torch.tensor([0, 1], device=compressed_k.device, dtype=torch.int32),
-            compressed_cu_seqlens,
-            1,
-            compressed_seqlens.max().item(),
-            None,
-            init_blocks=self.init_blocks,
-            local_blocks=self.local_blocks,
-            total_seq_lens=past_k.shape[1] + 1,  # !Only batch_size=1 is supported at the moment.
-        )
-        repeat_times = 1
-        if repeat_times > 1:
-            new_q = new_q.repeat_interleave(repeat_times, dim=-2)
-        else:
-            new_q = new_q
-        cache_batch_idx = torch.arange(batch_size, device=new_q.device, dtype=torch.int32)
-        seqlen_k = past_k.shape[1] + new_k.shape[1]  # !Only batch_size=1 is supported at the moment.
-        seqlens_k = torch.full((batch_size,), seqlen_k - 1, dtype=torch.int32, device=new_q.device)
-        past_k = torch.cat([past_k, torch.zeros_like(new_k, dtype=new_k.dtype)], dim=1).contiguous()   # Append one zero vector to avoid potential out-of-bounds access
-        past_v = torch.cat([past_v, torch.zeros_like(new_v, dtype=new_v.dtype)], dim=1).contiguous()   # Append one zero vector to avoid potential out-of-bounds access
-        topk_attn_output = infllmv2_attn_with_kvcache(
-            q=new_q,
-            k_cache=past_k,
-            v_cache=past_v,
-            topk_idx=topk_idx,
-            block_window_size=self.window_size // self.block_size,
-            k=new_k,                   # [batch_size, 1, nheads_k, d]
-            v=new_v,                   # [batch_size, 1, nheads_k, d]
-            cache_seqlens=seqlens_k,   # current_seqlens_k-1
-            rotary_cos=None,           # No rotary embeddings
-            rotary_sin=None,           # No rotary embeddings
-            cache_batch_idx=cache_batch_idx,
-            causal=False,              # Renaming to match function signature
-        )
-        return topk_attn_output
     def _flash_attention_forward_dense(
         self, query_states, key_states, value_states, attention_mask, query_length, dropout=0.0, softmax_scale=None
     ):
@@ -1727,9 +1434,7 @@ class MiniCPMSdpaAttention(MiniCPMAttention):
         key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
         value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
-        kv_seq_len = key_states.shape[-2]
-        if past_key_value is not None:
-            kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
         cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
         query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
@@ -2052,11 +1757,13 @@ class MiniCPMModel(MiniCPMPreTrainedModel):
                 raise ValueError(
                     'You must use the new past_key_values format, such as the Cache class, instead of the old tuple format.'
                 )
-                past_key_values = DynamicCache.from_legacy_cache(past_key_values)
-            past_key_values_length = past_key_values.get_usable_length(seq_length)
             if self.config.sparse_config is not None and torch.cuda.is_available() and past_key_values_length == 0:
-                past_key_values = DynamicCacheQKV()
         if position_ids is None:
             device = input_ids.device if input_ids is not None else inputs_embeds.device
@@ -2282,12 +1989,17 @@ class MiniCPMForCausalLM(MiniCPMPreTrainedModel):
     ):
         if past_key_values is not None:
             if isinstance(past_key_values, Cache):
                 cache_length = past_key_values.get_seq_length()
-                past_length = past_key_values.seen_tokens
-                max_cache_length = None     # past_key_values.get_max_length()
-            else:
-                cache_length = past_length = past_key_values[0][0].shape[2]
                 max_cache_length = None
             # Keep only the unprocessed tokens:
             # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where

 from torch import nn
 from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
 from transformers.activations import ACT2FN
+from transformers.cache_utils import Cache, DynamicCache, CacheLayerMixin, DynamicLayer
 from transformers.modeling_attn_mask_utils import (
     AttentionMaskConverter,
     _prepare_4d_attention_mask,
         infllmv2_attn_varlen_func,
         infllmv2_attn_with_kvcache,
         max_pooling_1d,
+        max_pooling_1d_varlen
     )
 except:
     pass
     sm_scale: float = None,
     init_blocks: int = 1,
     local_blocks: int = 2,
+    cache_lens: torch.Tensor = None,
 ) -> Tuple[torch.Tensor, torch.Tensor]:
     """Attention between query and compressed key and value. Compute attention output and topk block idx used in topk_sparse_attention.
         sm_scale (float, optional): softmax scale. Defaults to None, means 1/sqrt(head_dim).
         init_blocks (int, optional): Number of init blocks for each query. Defaults to 1.
         local_blocks (int, optional): Number of local blocks for each query. Defaults to 2.
+        cache_lens (torch.Tensor, optional): shape [batch_size], used to record the cache length of each query. Defaults to None.
     Returns:
         Tuple[torch.Tensor, torch.Tensor]: attention output and topk_idx used in topk_sparse_attention
     """
     with torch.no_grad():
         batch_size = cu_seqlens_q.shape[0] - 1
+        # Check if it's prefilling stage
+        is_prefilling = cache_lens is None or (cache_lens == 0).all().item()
+        # prefilling stage
+        if is_prefilling:
+            # Calculate q_idx for each query position in each batch
+            cache_lens = torch.zeros(batch_size, dtype=torch.int32, device=q.device)
+            q_idx = torch.cat([
+                (torch.arange(cu_seqlens_q[i + 1] - cu_seqlens_q[i], device=q.device) +
+                 max_seqlen_q - (cu_seqlens_q[i + 1] - cu_seqlens_q[i])) // block_size
+                for i in range(batch_size)
+            ], dim=0)  # shape: [total_q_len]
+        # decoding stage
         else:
+            # Each batch has only one query (last position). Shape: [batch_size] = [total_q_len] in decoding
+            q_idx = cache_lens // block_size
+        # compute attention score
         score = infllmv2_attn_stage1(
             q.contiguous(),
             k.contiguous(),
             cu_seqlens_k=cu_seqlens_k,
             max_seqlen_q=max_seqlen_q,
             max_seqlen_k=max_seqlen_k,
+            causal=is_prefilling)
+        # Shape: [num_heads, total_q_len, num_blocks]
         score = score[:, :q_idx.shape[0], :]
+        # Shape: [num_heads, total_q_len, num_blocks]
+        block_score = max_pooling_1d_varlen(
             score.contiguous(),
+            cu_seqlens_q,
+            cu_seqlens_k,
+            cache_lens,
+            max_seqlen_q,
+            max_seqlen_k,
             local_blocks=local_blocks,
             init_blocks=init_blocks,
             block_size=block_size,
+            stride=kernel_stride)
         # get topk
         topk = min(topk, block_score.shape[-1])
         topk_idx = block_score.topk(topk, dim=-1).indices.sort(-1).values
+        topk_idx[topk_idx > q_idx[None, :, None]] = -1
         topk_idx = topk_idx.to(torch.int32)
     return topk_idx
         return compressed_k, cu_seqlens_compressed
+class InfLLMv2CacheLayer(DynamicLayer):
+    def __init__(self):
         super().__init__()
+        # Initialize any additional attributes specific to InfLLMv2CacheLayer
+        self.no_rope_keys = torch.tensor([], dtype=torch.float32)
+        self.compress_k_cache = []
+        self.no_compress_k_cache = []
+        self.cached_compressed_cu_seqlens = torch.tensor([], dtype=torch.int32)
+        self.compress_k_cache_varlen = torch.tensor([], dtype=torch.float32)
+    def update_no_rope_key(self, key_states):
+        if self.no_rope_keys.numel() == 0:
+            self.no_rope_keys = key_states
         else:
+            self.no_rope_keys = torch.cat([self.no_rope_keys, key_states], dim=1)
+        return self.no_rope_keys
+    def update_compress_k(self, key_states, cu_seqlens=None):
+        if len(self.compress_k_cache) == 0:
+            if cu_seqlens is not None:
+                self.cached_compressed_cu_seqlens = cu_seqlens.clone()
+            self.compress_k_cache_varlen = key_states
+            split_sizes = (cu_seqlens[1:] - cu_seqlens[:-1]).tolist()
+            self.compress_k_cache = list(torch.split(key_states, split_sizes))
         else:
+            for index, k in enumerate(key_states):
+                if k is not None:
+                    self.compress_k_cache[index] = torch.cat([self.compress_k_cache[index], k], dim=0)
+            new_seq_lens = torch.tensor([tensor.shape[0] for tensor in self.compress_k_cache], dtype=torch.int32)
+            new_cumsum = torch.cumsum(new_seq_lens, dim=0, dtype=torch.int32)
+            self.compress_k_cache_varlen = torch.cat(self.compress_k_cache, dim=0)
+            self.cached_compressed_cu_seqlens = torch.cat([torch.tensor([0], dtype=torch.int32), new_cumsum]).to(self.compress_k_cache_varlen.device)
+        return self.compress_k_cache_varlen, self.cached_compressed_cu_seqlens
+    def update_no_compress_k(self, key_states, kernel_size=32, kernel_stride=16):
+        k_chunk_list = []
+        for index, k in enumerate(key_states):
+            if len(self.no_compress_k_cache) <= index:
+                self.no_compress_k_cache.append(k)
+            else:
+                self.no_compress_k_cache[index] = torch.cat([self.no_compress_k_cache[index], k], dim=0)
+                current_len = self.no_compress_k_cache[index].shape[0]
+                if current_len >= kernel_size:
+                    k_chunk_list.append(self.no_compress_k_cache[index][:kernel_size])
+                    self.no_compress_k_cache[index] = self.no_compress_k_cache[index][kernel_stride:]
+                else:
+                    k_chunk_list.append(None)
+        return k_chunk_list
+class InfLLMv2Cache(DynamicCache):
+    def __init__(self,
+                 config,num_hidden_layers: Optional[int] = None) -> None:
+        super().__init__(config=config)
+        self.layers = [InfLLMv2CacheLayer() for _ in range(num_hidden_layers)] if num_hidden_layers else []
+        self._seen_tokens = 0
+    def update(self, key_states, value_states, layer_idx, cache_kwargs=None):
         if layer_idx == 0:
             self._seen_tokens += key_states.shape[-2]
+        return self.layers[layer_idx].update(key_states, value_states, cache_kwargs)
+    def update_no_rope_key(self, key_states, layer_idx, cache_kwargs=None):
+        return self.layers[layer_idx].update_no_rope_key(key_states)
+    def update_compress_k(self, key_states, layer_idx, cu_seqlens=None, cache_kwargs=None):
+        return self.layers[layer_idx].update_compress_k(key_states, cu_seqlens)
+    def update_no_compress_k(self, key_states, layer_idx, kernel_size=32, kernel_stride=16, cache_kwargs=None):
+        return self.layers[layer_idx].update_no_compress_k(key_states, kernel_size, kernel_stride)
+    def crop(self, max_length):
+        for layer in self.layers:
+            layer.crop(max_length)
+    def batch_repeat_interleave(self, repeats):
+        for layer in self.layers:
+            layer.batch_repeat_interleave(repeats)
+    def batch_select_indices(self, indices):
+        for layer in self.layers:
+            layer.batch_select_indices(indices)
 # This makes `_prepare_4d_causal_attention_mask` a leaf function in the FX graph.
     )
 # @torch.jit.script  # type: ignore
         return down_proj
+def _unpad_one_tensor(hidden_states, attention_mask):
+    # Unpad the hidden states using the indices
+    indices, cu_seqlens, max_seqlen_in_batch = _get_unpad_data(attention_mask)
+    batch_size, seq_len = hidden_states.shape[:2]
+    # Get the remaining dimensions
+    remaining_dims = hidden_states.shape[2:]
+    # Reshape to (batch_size * seq_len, *remaining_dims)
+    reshaped_states = hidden_states.reshape(batch_size * seq_len, *remaining_dims)
+    # Apply unpadding using indices
+    unpadded_states = index_first_axis(reshaped_states, indices)
+    return unpadded_states, indices, cu_seqlens, max_seqlen_in_batch
 def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
     """
         key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
         value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+        kv_seq_len = position_ids.max().item() + 1
         cos, sin = self.rotary_emb(value_states.to(torch.float32), seq_len=kv_seq_len)
         query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
         key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
         value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+        kv_seq_len = position_ids.max().item() + 1
         cos, sin = self.rotary_emb(value_states.to(torch.float32), seq_len=kv_seq_len)
         query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
         self.dense_len = self.config.sparse_config.get('dense_len', 8192)
         self.local_blocks = self.window_size // self.block_size  # local_blocks
+        self.topk = self.config.sparse_config.get('topk', 64) + (self.window_size//self.block_size)
         self.use_nope = self.config.sparse_config.get('use_nope', False)
         self.compress_k = CompressK(self.num_key_value_heads, self.head_dim, kernel_size=self.kernel_size, kernel_stride=self.kernel_stride)
         output_attentions = False
         bsz, q_len, _ = hidden_states.size()
         query_states = self.q_proj(hidden_states)
         key_states = self.k_proj(hidden_states)
         key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
         value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+        kv_seq_len = position_ids.max().item() + 1
         cos, sin = self.rotary_emb(value_states.to(torch.float32), seq_len=kv_seq_len)
         query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
         key_states = key_states.transpose(1, 2)
         value_states = value_states.transpose(1, 2)
         if self.use_nope:
+            key_states_no_rope = past_key_value.update_no_rope_key(key_states_no_rope, self.layer_idx)
             no_rope_param = {
                 'key_states_no_rope': key_states_no_rope,
                 'query_states_no_rope': query_states_no_rope,
             }
         else:
             no_rope_param = None
         if kv_seq_len < self.dense_len:
             attn_output = self._flash_attention_forward_dense(
                 query_states, key_states, value_states, attention_mask, q_len, dropout=dropout_rate)
+        else:
+            attn_output = self._sparse_attention_forward(
                 query_states, key_states, value_states, attention_mask, q_len, dropout=dropout_rate,
                 no_rope_param=no_rope_param,  # if past_key_value is not None else None,
                 past_key_value=past_key_value)
         attn_output = attn_output.reshape(bsz, q_len, self.hidden_size).contiguous()
         attn_output = self.o_proj(attn_output)
         return attn_output, attn_weights, past_key_value
+    def _sparse_attention_forward(
+            self,
+            query_states,
+            key_states,
+            value_states,
+            attention_mask,
+            query_length,
+            dropout=0.0,
+            softmax_scale=None,
+            no_rope_param=None,
+            past_key_value=None):
+            """
+            Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
+            first unpad the input, then computes the attention scores and pad the final attention scores.
+            Args:
+                query_states (`torch.Tensor`):
+                    Input query states to be passed to Flash Attention API
+                key_states (`torch.Tensor`):
+                    Input key states to be passed to Flash Attention API
+                value_states (`torch.Tensor`):
+                    Input value states to be passed to Flash Attention API
+                attention_mask (`torch.Tensor`):
+                    The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
+                    position of padding tokens and 1 for the position of non-padding tokens.
+                dropout (`int`, *optional*):
+                    Attention dropout
+                softmax_scale (`float`, *optional*):
+                    The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
+            """
+            if not self._flash_attn_uses_top_left_mask:
+                causal = self.is_causal
+            else:
+                # TODO: Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1. For details, please see the comment in MiniCPMFlashAttention2 __init__.
+                causal = self.is_causal and query_length != 1
+            # Contains at least one padding token in the sequence
+            if attention_mask is not None:
+                batch_size = query_states.shape[0]
+                # assert batch_size == 1, 'Only batch_size=1 is supported at the moment.'
+                if past_key_value!=None:
+                    compressed_k, compressed_cu_seqlens = self.get_compress_k(
+                        key_states=key_states if self.use_nope ==False else no_rope_param['key_states_no_rope'],  # This can be optimized a bit;
+                        attention_mask=attention_mask,
+                        past_key_value=past_key_value)
+                query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
+                    query_states, key_states, value_states, attention_mask, query_length
+                )
+                cu_seqlens_q, cu_seqlens_k = cu_seq_lens
+                max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
+                if no_rope_param != None:
+                    if max_seqlen_in_batch_q == 1:
+                        no_rope_param['query_states_no_rope'] = no_rope_param['query_states_no_rope'].squeeze(1)
+                    else:
+                        no_rope_param['query_states_no_rope'],_, _, _ = _unpad_one_tensor(no_rope_param['query_states_no_rope'],attention_mask=attention_mask)
+                if past_key_value==None:
+                    # compress_k use varlen form
+                    compressed_k, compressed_cu_seqlens = self.compress_k(key_states,cu_seqlens_k)
+                attn_output_unpad = self.sparse_forward(
+                    query_states,
+                    key_states,
+                    value_states,
+                    cu_seqlens_q,
+                    cu_seqlens_k,
+                    max_seqlen_in_batch_q,
+                    max_seqlen_in_batch_k,
+                    no_rope_param=no_rope_param,
+                    compressed_k=compressed_k,
+                    compressed_cu_seqlens=compressed_cu_seqlens)
+                attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
+            else:
+                raise ValueError('Need attention mask')
+            return attn_output
+    def get_compress_k(self, key_states, attention_mask, past_key_value):
         """
+        Get compressed key states and corresponding cumulative sequence lengths.
         Args:
+            key_states: Key states tensor
+            cu_seqlens_k: Cumulative sequence lengths for keys
+            past_key_value: Past key-value cache
+            no_rope_param: Optional parameter containing key states without rope
+        Returns:
+            Tuple of (compressed_k, compressed_cu_seqlens)
         """
+        # Check if this is prefilling or initial compression condition
+        is_prefilling = (
+            key_states.shape[1] >= self.dense_len and
+            (
+                not past_key_value.layers[self.layer_idx].compress_k_cache
+            )
+        )
+        if is_prefilling:
+            unpadded_key_states, indices, cu_seqlens, max_seqlen_in_batch = _unpad_one_tensor(key_states,attention_mask=attention_mask)
+            # Compress the keys
+            compressed_k, compressed_cu_seqlens = self.compress_k(unpadded_key_states, cu_seqlens)
+            past_key_value.update_compress_k(
+                compressed_k, self.layer_idx, compressed_cu_seqlens)
+            no_compress_k_list = []
+            # Compute and update no_compress_k
+            for i in range(len(compressed_cu_seqlens)-1):
+                no_compress_k_start = (compressed_cu_seqlens[i+1]- compressed_cu_seqlens[i]) * self.kernel_stride
+                no_compress_k_list.append(unpadded_key_states[cu_seqlens[i]+no_compress_k_start:cu_seqlens[i+1]].clone())
+            past_key_value.update_no_compress_k(
+                no_compress_k_list, self.layer_idx,kernel_stride=self.kernel_stride,
+                kernel_size=self.kernel_size)
         else:
+            # Decode case: incremental update
+            batch_size = key_states.shape[0] # key_states.shape = [batch_size, seq, k_head_num, head_dim]
+            key_states_split = list(torch.split(
+                key_states[:,-1:].squeeze(1), #[batch_size, seq, k_head_num, head_dim]->[batch_size, 1, k_head_num, head_dim]-> [batch_size, k_head_num, head_dim]
+                [1] * batch_size,dim=0,
+            ))
+            # Try to update no_compress_k buffer
+            no_compress_k_list = past_key_value.update_no_compress_k(
+                key_states_split, self.layer_idx,
+                kernel_stride=self.kernel_stride,
+                kernel_size=self.kernel_size)
+            new_compressed_k_list = []
+            for no_compress_k in no_compress_k_list:
+                if no_compress_k is not None:
+                    # We have enough tokens to compress
+                    new_compressed_k = no_compress_k.mean(dim=0, keepdim=True)  # [1, n_heads_k, head_dim]
+                    new_compressed_k_list.append(new_compressed_k)
+                else:
+                    new_compressed_k_list.append(None)
+            compressed_k, compressed_cu_seqlens = past_key_value.update_compress_k(new_compressed_k_list, self.layer_idx,)
+        return compressed_k, compressed_cu_seqlens
     def sparse_forward(self,
                        query_layer,
                        max_seqlen_in_batch_q,
                        max_seqlen_in_batch_k,
                        no_rope_param=None,
+                       compressed_k=None,
+                       compressed_cu_seqlens=None):
         compressed_seqlens = compressed_cu_seqlens[1:] - compressed_cu_seqlens[:-1]
+        cache_lens = None
+        if max_seqlen_in_batch_q==1 and max_seqlen_in_batch_k>1: #decoding
+            seq_lens_k =  cu_seqlens_k[1:] - cu_seqlens_k[:-1]
+            cache_lens = seq_lens_k-1
         topk_idx = compressed_attention(
             query_layer if no_rope_param is None else no_rope_param['query_states_no_rope'],
             compressed_k,
+            compressed_k.clone(),
             self.kernel_size,
             self.kernel_stride,
             self.block_size,
             None,
             init_blocks=self.init_blocks,
             local_blocks=self.local_blocks,
+            cache_lens=cache_lens
         )
         topk_attn_output = infllmv2_attn_varlen_func(
             query_layer,
             key_layer,
             dropout_p=0.0,
             deterministic=False,
             softmax_scale=None,
+            causal=max_seqlen_in_batch_q != 1,
             return_attn_probs=False,
+            # block_window_size=self.window_size // self.block_size,
             topk_idx=topk_idx
         )
         return topk_attn_output
     def _flash_attention_forward_dense(
         self, query_states, key_states, value_states, attention_mask, query_length, dropout=0.0, softmax_scale=None
     ):
         key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
         value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
+        kv_seq_len = position_ids.max().item() + 1
         cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
         query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
                 raise ValueError(
                     'You must use the new past_key_values format, such as the Cache class, instead of the old tuple format.'
                 )
+            # Calculate the usable length of past key values
+            past_key_values_length = past_key_values.get_seq_length() if isinstance(past_key_values, InfLLMv2Cache) else 0
+            # Initialize InfLLMv2Cache if needed
             if self.config.sparse_config is not None and torch.cuda.is_available() and past_key_values_length == 0:
+                past_key_values = InfLLMv2Cache(config = self.config, num_hidden_layers=self.config.num_hidden_layers)
         if position_ids is None:
             device = input_ids.device if input_ids is not None else inputs_embeds.device
     ):
         if past_key_values is not None:
             if isinstance(past_key_values, Cache):
+                # Use the new Cache class methods
                 cache_length = past_key_values.get_seq_length()
+                if self.config.sparse_config is not None and torch.cuda.is_available() and cache_length == 0:
+                    past_key_values = InfLLMv2Cache(config = self.config, num_hidden_layers=self.config.num_hidden_layers)
+                past_length = cache_length
                 max_cache_length = None
+            else:
+                raise ValueError(
+                    'You must use the new past_key_values format, such as the Cache class, instead of the old tuple format.'
+                )
             # Keep only the unprocessed tokens:
             # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where