File size: 14,494 Bytes
39fac4d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 |
---
language:
- en
- ja
license: llama3.1
pipeline_tag: text-generation
model_type: llama
datasets:
- bigcode/the-stack-v2
- bigcode/jupyter-code-text-pairs
- bigcode/the-stack-github-issues
tags:
- llama-3
- code
---
# Llama 3.1 Future Code Ja
Llama 3.1 Future Code Ja is a large language model with 8B parameters built on top of the [Meta Llama 3.1](https://huggingface.co/meta-llama/Llama-3.1-8B) model.
The model was first experienced continual pre-trained on the mixture of code and mostly-Japanese natural language data.
The training data is mainly from [The Stack V2 dataset](bigcode/the-stack-v2) and the subset of [LLM-jp Corpus v3](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v3), which comprises 204.9B code and 85.7B natural language tokens after carefully designed data cleaning.
The model was then merged with the instruct variant of the Meta Llama 3.1 model to acquire abilities to follow general task instructions, followed by supervised fine-tuning (SFT) and direct preference optimization (DPO) on our own magpie-generated code instruction data.
The model officially supports Japanese and English for natural languages and more than 40 programming languages ranging from popular Python, Java etc. to some legacy languages such as COBOL.
In addition to causal (left-to-right) inference, the model supports Fill-in-the-Middle (FIM) capability, where the model fills in the blank attending to bidirectional context, a common use case in IDEs.
The model outperforms the original Llama 3.1 model in both Japanese, and English-instructed code completion tasks in various programming languages, and outperforms Qwen families in Japanese generation tasks, attaining a good balance between specialty in code-related tasks and general ability in Japanese.
## Usage
Here are the sample inference scripts with transformers.
We recommend using [vLLM](https://github.com/vllm-project/vllm) for faster inference.
```bash
pip install torch transformers accelerate
```
### Chat
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "future-architect/Llama-3.1-Future-Code-Ja-8B"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
# we recommend using the following system prompt:
# for Japanese completion : "あなたは様々なソフトウェア開発タスクをサポートするAIアシスタントです。"
# for English completion : "You are an AI assistant who support various software development tasks."
message = [
{
"role": "system",
"content": "あなたは様々なソフトウェア開発タスクをサポートするAIアシスタントです。"
},
{
"role": "user",
"content": "PythonでFizzBuzzを書いてください。",
},
]
input_ids = tokenizer.apply_chat_template(
message, add_generation_prompt=True, return_tensors="pt", return_dict=True
).to(model.device)
output = model.generate(**input_ids, max_new_tokens=1024)
print(tokenizer.decode(output[0, input_ids["input_ids"].shape[1]:]))
```
### Fill-in-the-Middle
**With the idea that the users may not want line breaks just after their cursor positions, we did not create any middle splits that start with newline symbols (`\n`), but included them at the end of the prefix instead.**
**This also holds true for the boundaries of suffix and middle splits, causing great sensitivity against which split to include newline symbols.**
**Please remove one new line symbol (if exists) from the beginning of the suffix for improved performance.**
You may set a larger repetition penalty to avoid nonsense generations with too many signs.
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
FIM_PREFIX = "<|fim_prefix|>"
FIM_MIDDLE = "<|fim_middle|>"
FIM_SUFFIX = "<|fim_suffix|>"
model_name = "future-architect/Llama-3.1-Future-Code-Ja-8B"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
# prepend <|begin_of_text|> to inform that this is the beginning of "the content" (not whole sequence with special tokens)
prefix = "<|begin_of_text|>def fizzbuzz(n"
suffix = "return n"
# PSM mode (infilling)
input_txt = FIM_PREFIX + prefix + FIM_SUFFIX + suffix + FIM_MIDDLE
# SPM mode (reverse infilling)
# input_txt = FIM_PREFIX + FIM_SUFFIX + suffix + FIM_MIDDLE + prefix
# set add_special_tokens to False, so that the tokenizer does NOT add <|begin_of_text|> before special tokens
input_ids = tokenizer(input_txt, add_special_tokens=False, return_tensors="pt").to(model.device)
output = model.generate(**input_ids, max_new_tokens=1024, temperature=0.2, top_p=0.95)
print(tokenizer.decode(output[0, input_ids["input_ids"].shape[1]:]))
```
## Model Performance
### Code completion (Japanese)
- [JHumanEval](https://huggingface.co/datasets/kogi-jwu/jhumaneval) (Sato et al., 2024)
- [JMultiPL-E](https://huggingface.co/datasets/tohoku-nlp/JMultiPL-E) (Taneguchi et al., 2025)
Note: We do not report scores for two programming languages (Julia and Racket), which we did not include in the training data. All the scores below are pass@1 with 10 trials.
| model | size | py | cpp | cs | d | go | java | js | php | pl | r | rb | rs | scala | sh | swift | ts |
|--------------------------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|---------|--------|---------|--------|
| Llama 3.1 Future Code Ja | 8B | 0.6335 | 0.5267 | 0.3633 | 0.1564 | 0.6286 | 0.4696 | 0.5528 | 0.4814 | 0.2919 | 0.2969 | 0.1870 | 0.4487 | 0.4425 | 0.3285 | 0.3861 | 0.5623 |
| Llama 3.1 | 8B | 0.5061 | 0.4391 | 0.2835 | 0.2147 | 0.5519 | 0.3753 | 0.4640 | 0.4248 | 0.2584 | 0.2360 | 0.3112 | 0.3269 | 0.3175 | 0.2665 | 0.3323 | 0.4799 |
| Llama 3.1 Swallow | 8B | 0.4213 | 0.3329 | 0.2456 | 0.1026 | 0.6370 | 0.3468 | 0.3112 | 0.3273 | 0.1758 | 0.1807 | 0.0503 | 0.2090 | 0.2487 | 0.1525 | 0.2354 | 0.3258 |
| Qwen2.5 | 7B | 0.6018 | 0.5106 | 0.3601 | 0.2353 | 0.7500 | 0.5044 | 0.5416 | 0.5267 | 0.3075 | 0.3466 | 0.3683 | 0.5071 | 0.3969 | 0.3380 | 0.4576 | 0.6025 |
| Qwen2.5-Coder | 7B | 0.6695 | 0.6379 | 0.4601 | 0.1660 | 0.7110 | 0.5468 | 0.6696 | 0.5894 | 0.3497 | 0.4174 | 0.3565 | 0.6032 | 0.4950 | 0.3544 | 0.5285 | 0.6358 |
| Qwen3 | 8B | 0.6256 | 0.5683 | 0.3709 | 0.1583 | 0.5156 | 0.4778 | 0.5814 | 0.5547 | 0.3969 | 0.2466 | 0.3217 | 0.4763 | 0.4075 | 0.3418 | 0.3715 | 0.5239 |
| Gemma 2 | 9B | 0.5549 | 0.4590 | 0.3608 | 0.0897 | 0.7052 | 0.4601 | 0.2863 | 0.4733 | 0.1099 | 0.1615 | 0.1205 | 0.3417 | 0.3850 | 0.1209 | 0.3272 | 0.2346 |
### Code completion (English)
- [HumanEval](https://huggingface.co/datasets/openai/openai_humaneval) (Chen et al., 2021)
- [MultiPL-E](https://huggingface.co/datasets/nuprl/MultiPL-E) (Cassano et al., 2022)
Note: We do not report scores for two programming languages (Julia and Racket), which we did not include in the training data. All the scores below are pass@1 with 10 trials.
| model | size | py | cpp | cs | d | go | java | js | php | pl | r | rb | rs | scala | sh | swift | ts |
|--------------------------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|---------|--------|---------|--------|
| Llama 3.1 Future Code Ja | 8B | 0.6835 | 0.5795 | 0.3829 | 0.1692 | 0.6279 | 0.4987 | 0.6149 | 0.5565 | 0.3652 | 0.3317 | 0.1752 | 0.4846 | 0.4662 | 0.3595 | 0.4525 | 0.6390 |
| Llama 3.1 | 8B | 0.6311 | 0.4795 | 0.3184 | 0.2083 | 0.5909 | 0.4715 | 0.5571 | 0.4658 | 0.3236 | 0.2696 | 0.4267 | 0.3744 | 0.3856 | 0.2994 | 0.3741 | 0.5717 |
| Llama 3.1 Swallow | 8B | 0.4701 | 0.3720 | 0.2646 | 0.1224 | 0.6519 | 0.3759 | 0.3006 | 0.3733 | 0.1752 | 0.1447 | 0.0590 | 0.2103 | 0.2744 | 0.1614 | 0.2190 | 0.3786 |
| Qwen2.5 | 7B | 0.6732 | 0.5491 | 0.4253 | 0.2455 | 0.7000 | 0.6013 | 0.6137 | 0.5913 | 0.3373 | 0.3832 | 0.4429 | 0.5923 | 0.4263 | 0.3715 | 0.5095 | 0.6535 |
| Qwen2.5-Coder | 7B | 0.7890 | 0.7373 | 0.5152 | 0.1936 | 0.3935 | 0.6184 | 0.7385 | 0.6528 | 0.3969 | 0.4224 | 0.4230 | 0.6545 | 0.5725 | 0.4158 | 0.5797 | 0.7434 |
| Qwen3 | 8B | 0.7134 | 0.6702 | 0.4285 | 0.2295 | 0.4721 | 0.5747 | 0.6602 | 0.6236 | 0.4441 | 0.3627 | 0.4261 | 0.6154 | 0.5363 | 0.4089 | 0.4304 | 0.6082 |
| Gemma 2 | 9B | 0.6128 | 0.5118 | 0.3728 | 0.1045 | 0.6552 | 0.4791 | 0.3758 | 0.4863 | 0.0783 | 0.1186 | 0.0795 | 0.3853 | 0.4162 | 0.1437 | 0.3506 | 0.3723 |
### Fill-in-the-Middle
- [SantaCoder-FIM](https://huggingface.co/datasets/bigcode/santacoder-fim-task) (Allal et al., 2023)
Note: The models with asterisk (*) do not support FIM. We used the SPM prompt in [Gong et al., 2024](https://arxiv.org/pdf/2403.04814) and truncated the generated output just before the point that matched the beginning of the provided suffix. The scores of Llama models on PSM mode are not reported here since we got almost 0 scores for all those settings. All the scores below are exact match (EM) with 1 trial.
| model | size | PSM (py) | SPM (py) | PSM (js) | SPM (js) | PSM (java) | SPM (java) |
|--------------------------|--------|------------|------------|------------|------------|--------------|--------------|
| Llama 3.1 Future Code Ja | 8B | 0.5216 | 0.5139 | 0.6018 | 0.6049 | 0.5517 | 0.5478 |
| Qwen2.5-Coder | 7B | 0.5829 | 0.4084 | 0.6612 | 0.5597 | 0.6433 | 0.6180 |
| Llama 3.1 8B * | 8B | - | 0.4468 | - | 0.3951 | - | 0.3506 |
| Llama 3.1 70B * | 70B | - | 0.5964 | - | 0.5084 | - | 0.2910 |
### Japanese tasks
- JCommonSenseQA (Kurihara et al., 2022, Exact Match)
- JEMHopQA (Ishii et al., 2024, chr-F1)
- NIILC (Sekine, 2003, chr-F1)
- JSQuAD (Kurihara et al., 2022, chr-F1)
- XL-Sum (Hasan et al., 2021, ROUGE-2)
- MGSM (Shi et al., 2023, Exact Match)
- WMT20 en-ja (Barrault et al., 2020, BLEU)
- WMT20 ja-en (Barrault et al., 2020, BLEU)
| model | size | JCommonsenseQA | JEMHopQA | NIILC | JSQuAD | XL-SUM | MGSM | WMT20 en-ja | WMT20 ja-en |
|--------------------------|--------|------------------|------------|---------|----------|----------|--------|---------------|---------------|
| Llama 3.1 Future Code Ja | 8B | 0.9124 | 0.4983 | 0.5118 | 0.8758 | 0.1779 | 0.5480 | 0.2624 | 0.2028 |
| Llama 3.1 | 8B | 0.8829 | 0.4537 | 0.4050 | 0.8868 | 0.1486 | 0.5080 | 0.2195 | 0.2008 |
| Llama 3.1 Swallow | 8B | 0.9240 | 0.5228 | 0.5805 | 0.8957 | 0.1920 | 0.5480 | 0.2818 | 0.2263 |
| Qwen2.5 | 7B | 0.9142 | 0.4394 | 0.3998 | 0.8908 | 0.1690 | 0.6240 | 0.2091 | 0.1909 |
| Qwen2.5-Coder | 7B | 0.8472 | 0.3014 | 0.3045 | 0.8906 | 0.1533 | 0.5360 | 0.1816 | 0.1598 |
| Qwen3 | 8B | 0.9169 | 0.4265 | 0.4197 | 0.8943 | 0.1882 | 0.7720 | 0.2450 | 0.2133 |
| Gemma 2 | 9B | 0.9312 | 0.5288 | 0.5306 | 0.8774 | 0.0873 | 0.4680 | 0.2305 | 0.2017 |
### English tasks
- TriviaQA (Joshi et al., 2017, Exact Match)
- SQuAD2 (Rajpurkar et al., 2018, Exact Match)
- GSM8K (Cobbe et al., 2021, Exact Match)
| model | size | TriviaQA | SQuAD2 | GSM8K |
|--------------------------|--------|------------|----------|---------|
| Llama 3.1 Future Code Ja | 8B | 0.6233 | 0.3754 | 0.7111 |
| Llama 3.1 | 8B | 0.6991 | 0.3784 | 0.7475 |
| Llama 3.1 Swallow | 8B | 0.6296 | 0.3628 | 0.6126 |
| Qwen2.5 | 7B | 0.5176 | 0.2624 | 0.7430 |
| Qwen2.5-Coder | 7B | 0.4517 | 0.3388 | 0.7020 |
| Qwen3 | 8B | 0.5631 | 0.3922 | 0.8749 |
| Gemma 2 | 9B | 0.6573 | 0.3944 | 0.7908 |
### Evaluation Details
We used the [Code Generation LM Evaluation Harness](https://github.com/bigcode-project/bigcode-evaluation-harness) toolkit to evaluate code completion and FIM capabilities.
We adopted the settings below for decoding.
We mostly followed the recommendations however, we set `max_new_tokens` instead of `max_tokens` to avoid truncation while handling long input sequences.
- Temperature: 0.2
- Top-p: 0.95
- Number of completions to generate: 10 (for completion tasks), 1 (for FIM tasks)
- Maximum number of new tokens: 512
We followed the evaluation strategy adopted in the Swallow project for Japanese and English tasks.
More specifically, we used the [llm-jp-eval](https://github.com/llm-jp/llm-jp-eval) toolkit for Japanese tasks and the [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) toolkit for English (and some Japanese) tasks.
We adopted the default decoding strategy for all the tasks.
## Risks and Limitations
The model is trained on general tasks related to software development, not on organization-specific, and/or non-standardized tasks.
We recommend further fine-tuning the model to make it work better with those tasks.
The model may produce incorrect output and all the suggestions from the model must be carefully examined before adopting in real-world applications.
## Acknowledgements
The model is developed as part of the Generative AI Accelerator Challenge (GENIAC) project.
We thank great support from the New Energy and Industrial Technology Development Organization (NEDO) and the Ministry of Economy, Trade and Industry (METI) for financial support.
## Contact
- pj-geniac at future.co.jp
## License
[META LLAMA 3.1 COMMUNITY LICENSE](https://www.llama.com/llama3_1/license/)
Copyright © 2025 by Future Corporation |