--- language: - en - ja license: llama3.1 pipeline_tag: text-generation model_type: llama datasets: - bigcode/the-stack-v2 - bigcode/jupyter-code-text-pairs - bigcode/the-stack-github-issues tags: - llama-3 - code --- # Llama 3.1 Future Code Ja Llama 3.1 Future Code Ja is a large language model with 8B parameters built on top of the [Meta Llama 3.1](https://huggingface.co/meta-llama/Llama-3.1-8B) model. The model was first experienced continual pre-trained on the mixture of code and mostly-Japanese natural language data. The training data is mainly from [The Stack V2 dataset](bigcode/the-stack-v2) and the subset of [LLM-jp Corpus v3](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v3), which comprises 204.9B code and 85.7B natural language tokens after carefully designed data cleaning. The model was then merged with the instruct variant of the Meta Llama 3.1 model to acquire abilities to follow general task instructions, followed by supervised fine-tuning (SFT) and direct preference optimization (DPO) on our own magpie-generated code instruction data. The model officially supports Japanese and English for natural languages and more than 40 programming languages ranging from popular Python, Java etc. to some legacy languages such as COBOL. In addition to causal (left-to-right) inference, the model supports Fill-in-the-Middle (FIM) capability, where the model fills in the blank attending to bidirectional context, a common use case in IDEs. The model outperforms the original Llama 3.1 model in both Japanese, and English-instructed code completion tasks in various programming languages, and outperforms Qwen families in Japanese generation tasks, attaining a good balance between specialty in code-related tasks and general ability in Japanese. ## Usage Here are the sample inference scripts with transformers. We recommend using [vLLM](https://github.com/vllm-project/vllm) for faster inference. ```bash pip install torch transformers accelerate ``` ### Chat ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "future-architect/Llama-3.1-Future-Code-Ja-8B" model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto") tokenizer = AutoTokenizer.from_pretrained(model_name) # we recommend using the following system prompt: # for Japanese completion : "あなたは様々なソフトウェア開発タスクをサポートするAIアシスタントです。" # for English completion : "You are an AI assistant who support various software development tasks." message = [ { "role": "system", "content": "あなたは様々なソフトウェア開発タスクをサポートするAIアシスタントです。" }, { "role": "user", "content": "PythonでFizzBuzzを書いてください。", }, ] input_ids = tokenizer.apply_chat_template( message, add_generation_prompt=True, return_tensors="pt", return_dict=True ).to(model.device) output = model.generate(**input_ids, max_new_tokens=1024) print(tokenizer.decode(output[0, input_ids["input_ids"].shape[1]:])) ``` ### Fill-in-the-Middle **With the idea that the users may not want line breaks just after their cursor positions, we did not create any middle splits that start with newline symbols (`\n`), but included them at the end of the prefix instead.** **This also holds true for the boundaries of suffix and middle splits, causing great sensitivity against which split to include newline symbols.** **Please remove one new line symbol (if exists) from the beginning of the suffix for improved performance.** You may set a larger repetition penalty to avoid nonsense generations with too many signs. ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer FIM_PREFIX = "<|fim_prefix|>" FIM_MIDDLE = "<|fim_middle|>" FIM_SUFFIX = "<|fim_suffix|>" model_name = "future-architect/Llama-3.1-Future-Code-Ja-8B" model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto") tokenizer = AutoTokenizer.from_pretrained(model_name) # prepend <|begin_of_text|> to inform that this is the beginning of "the content" (not whole sequence with special tokens) prefix = "<|begin_of_text|>def fizzbuzz(n" suffix = "return n" # PSM mode (infilling) input_txt = FIM_PREFIX + prefix + FIM_SUFFIX + suffix + FIM_MIDDLE # SPM mode (reverse infilling) # input_txt = FIM_PREFIX + FIM_SUFFIX + suffix + FIM_MIDDLE + prefix # set add_special_tokens to False, so that the tokenizer does NOT add <|begin_of_text|> before special tokens input_ids = tokenizer(input_txt, add_special_tokens=False, return_tensors="pt").to(model.device) output = model.generate(**input_ids, max_new_tokens=1024, temperature=0.2, top_p=0.95) print(tokenizer.decode(output[0, input_ids["input_ids"].shape[1]:])) ``` ## Model Performance ### Code completion (Japanese) - [JHumanEval](https://huggingface.co/datasets/kogi-jwu/jhumaneval) (Sato et al., 2024) - [JMultiPL-E](https://huggingface.co/datasets/tohoku-nlp/JMultiPL-E) (Taneguchi et al., 2025) Note: We do not report scores for two programming languages (Julia and Racket), which we did not include in the training data. All the scores below are pass@1 with 10 trials. | model | size | py | cpp | cs | d | go | java | js | php | pl | r | rb | rs | scala | sh | swift | ts | |--------------------------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|---------|--------|---------|--------| | Llama 3.1 Future Code Ja | 8B | 0.6335 | 0.5267 | 0.3633 | 0.1564 | 0.6286 | 0.4696 | 0.5528 | 0.4814 | 0.2919 | 0.2969 | 0.1870 | 0.4487 | 0.4425 | 0.3285 | 0.3861 | 0.5623 | | Llama 3.1 | 8B | 0.5061 | 0.4391 | 0.2835 | 0.2147 | 0.5519 | 0.3753 | 0.4640 | 0.4248 | 0.2584 | 0.2360 | 0.3112 | 0.3269 | 0.3175 | 0.2665 | 0.3323 | 0.4799 | | Llama 3.1 Swallow | 8B | 0.4213 | 0.3329 | 0.2456 | 0.1026 | 0.6370 | 0.3468 | 0.3112 | 0.3273 | 0.1758 | 0.1807 | 0.0503 | 0.2090 | 0.2487 | 0.1525 | 0.2354 | 0.3258 | | Qwen2.5 | 7B | 0.6018 | 0.5106 | 0.3601 | 0.2353 | 0.7500 | 0.5044 | 0.5416 | 0.5267 | 0.3075 | 0.3466 | 0.3683 | 0.5071 | 0.3969 | 0.3380 | 0.4576 | 0.6025 | | Qwen2.5-Coder | 7B | 0.6695 | 0.6379 | 0.4601 | 0.1660 | 0.7110 | 0.5468 | 0.6696 | 0.5894 | 0.3497 | 0.4174 | 0.3565 | 0.6032 | 0.4950 | 0.3544 | 0.5285 | 0.6358 | | Qwen3 | 8B | 0.6256 | 0.5683 | 0.3709 | 0.1583 | 0.5156 | 0.4778 | 0.5814 | 0.5547 | 0.3969 | 0.2466 | 0.3217 | 0.4763 | 0.4075 | 0.3418 | 0.3715 | 0.5239 | | Gemma 2 | 9B | 0.5549 | 0.4590 | 0.3608 | 0.0897 | 0.7052 | 0.4601 | 0.2863 | 0.4733 | 0.1099 | 0.1615 | 0.1205 | 0.3417 | 0.3850 | 0.1209 | 0.3272 | 0.2346 | ### Code completion (English) - [HumanEval](https://huggingface.co/datasets/openai/openai_humaneval) (Chen et al., 2021) - [MultiPL-E](https://huggingface.co/datasets/nuprl/MultiPL-E) (Cassano et al., 2022) Note: We do not report scores for two programming languages (Julia and Racket), which we did not include in the training data. All the scores below are pass@1 with 10 trials. | model | size | py | cpp | cs | d | go | java | js | php | pl | r | rb | rs | scala | sh | swift | ts | |--------------------------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|---------|--------|---------|--------| | Llama 3.1 Future Code Ja | 8B | 0.6835 | 0.5795 | 0.3829 | 0.1692 | 0.6279 | 0.4987 | 0.6149 | 0.5565 | 0.3652 | 0.3317 | 0.1752 | 0.4846 | 0.4662 | 0.3595 | 0.4525 | 0.6390 | | Llama 3.1 | 8B | 0.6311 | 0.4795 | 0.3184 | 0.2083 | 0.5909 | 0.4715 | 0.5571 | 0.4658 | 0.3236 | 0.2696 | 0.4267 | 0.3744 | 0.3856 | 0.2994 | 0.3741 | 0.5717 | | Llama 3.1 Swallow | 8B | 0.4701 | 0.3720 | 0.2646 | 0.1224 | 0.6519 | 0.3759 | 0.3006 | 0.3733 | 0.1752 | 0.1447 | 0.0590 | 0.2103 | 0.2744 | 0.1614 | 0.2190 | 0.3786 | | Qwen2.5 | 7B | 0.6732 | 0.5491 | 0.4253 | 0.2455 | 0.7000 | 0.6013 | 0.6137 | 0.5913 | 0.3373 | 0.3832 | 0.4429 | 0.5923 | 0.4263 | 0.3715 | 0.5095 | 0.6535 | | Qwen2.5-Coder | 7B | 0.7890 | 0.7373 | 0.5152 | 0.1936 | 0.3935 | 0.6184 | 0.7385 | 0.6528 | 0.3969 | 0.4224 | 0.4230 | 0.6545 | 0.5725 | 0.4158 | 0.5797 | 0.7434 | | Qwen3 | 8B | 0.7134 | 0.6702 | 0.4285 | 0.2295 | 0.4721 | 0.5747 | 0.6602 | 0.6236 | 0.4441 | 0.3627 | 0.4261 | 0.6154 | 0.5363 | 0.4089 | 0.4304 | 0.6082 | | Gemma 2 | 9B | 0.6128 | 0.5118 | 0.3728 | 0.1045 | 0.6552 | 0.4791 | 0.3758 | 0.4863 | 0.0783 | 0.1186 | 0.0795 | 0.3853 | 0.4162 | 0.1437 | 0.3506 | 0.3723 | ### Fill-in-the-Middle - [SantaCoder-FIM](https://huggingface.co/datasets/bigcode/santacoder-fim-task) (Allal et al., 2023) Note: The models with asterisk (*) do not support FIM. We used the SPM prompt in [Gong et al., 2024](https://arxiv.org/pdf/2403.04814) and truncated the generated output just before the point that matched the beginning of the provided suffix. The scores of Llama models on PSM mode are not reported here since we got almost 0 scores for all those settings. All the scores below are exact match (EM) with 1 trial. | model | size | PSM (py) | SPM (py) | PSM (js) | SPM (js) | PSM (java) | SPM (java) | |--------------------------|--------|------------|------------|------------|------------|--------------|--------------| | Llama 3.1 Future Code Ja | 8B | 0.5216 | 0.5139 | 0.6018 | 0.6049 | 0.5517 | 0.5478 | | Qwen2.5-Coder | 7B | 0.5829 | 0.4084 | 0.6612 | 0.5597 | 0.6433 | 0.6180 | | Llama 3.1 8B * | 8B | - | 0.4468 | - | 0.3951 | - | 0.3506 | | Llama 3.1 70B * | 70B | - | 0.5964 | - | 0.5084 | - | 0.2910 | ### Japanese tasks - JCommonSenseQA (Kurihara et al., 2022, Exact Match) - JEMHopQA (Ishii et al., 2024, chr-F1) - NIILC (Sekine, 2003, chr-F1) - JSQuAD (Kurihara et al., 2022, chr-F1) - XL-Sum (Hasan et al., 2021, ROUGE-2) - MGSM (Shi et al., 2023, Exact Match) - WMT20 en-ja (Barrault et al., 2020, BLEU) - WMT20 ja-en (Barrault et al., 2020, BLEU) | model | size | JCommonsenseQA | JEMHopQA | NIILC | JSQuAD | XL-SUM | MGSM | WMT20 en-ja | WMT20 ja-en | |--------------------------|--------|------------------|------------|---------|----------|----------|--------|---------------|---------------| | Llama 3.1 Future Code Ja | 8B | 0.9124 | 0.4983 | 0.5118 | 0.8758 | 0.1779 | 0.5480 | 0.2624 | 0.2028 | | Llama 3.1 | 8B | 0.8829 | 0.4537 | 0.4050 | 0.8868 | 0.1486 | 0.5080 | 0.2195 | 0.2008 | | Llama 3.1 Swallow | 8B | 0.9240 | 0.5228 | 0.5805 | 0.8957 | 0.1920 | 0.5480 | 0.2818 | 0.2263 | | Qwen2.5 | 7B | 0.9142 | 0.4394 | 0.3998 | 0.8908 | 0.1690 | 0.6240 | 0.2091 | 0.1909 | | Qwen2.5-Coder | 7B | 0.8472 | 0.3014 | 0.3045 | 0.8906 | 0.1533 | 0.5360 | 0.1816 | 0.1598 | | Qwen3 | 8B | 0.9169 | 0.4265 | 0.4197 | 0.8943 | 0.1882 | 0.7720 | 0.2450 | 0.2133 | | Gemma 2 | 9B | 0.9312 | 0.5288 | 0.5306 | 0.8774 | 0.0873 | 0.4680 | 0.2305 | 0.2017 | ### English tasks - TriviaQA (Joshi et al., 2017, Exact Match) - SQuAD2 (Rajpurkar et al., 2018, Exact Match) - GSM8K (Cobbe et al., 2021, Exact Match) | model | size | TriviaQA | SQuAD2 | GSM8K | |--------------------------|--------|------------|----------|---------| | Llama 3.1 Future Code Ja | 8B | 0.6233 | 0.3754 | 0.7111 | | Llama 3.1 | 8B | 0.6991 | 0.3784 | 0.7475 | | Llama 3.1 Swallow | 8B | 0.6296 | 0.3628 | 0.6126 | | Qwen2.5 | 7B | 0.5176 | 0.2624 | 0.7430 | | Qwen2.5-Coder | 7B | 0.4517 | 0.3388 | 0.7020 | | Qwen3 | 8B | 0.5631 | 0.3922 | 0.8749 | | Gemma 2 | 9B | 0.6573 | 0.3944 | 0.7908 | ### Evaluation Details We used the [Code Generation LM Evaluation Harness](https://github.com/bigcode-project/bigcode-evaluation-harness) toolkit to evaluate code completion and FIM capabilities. We adopted the settings below for decoding. We mostly followed the recommendations however, we set `max_new_tokens` instead of `max_tokens` to avoid truncation while handling long input sequences. - Temperature: 0.2 - Top-p: 0.95 - Number of completions to generate: 10 (for completion tasks), 1 (for FIM tasks) - Maximum number of new tokens: 512 We followed the evaluation strategy adopted in the Swallow project for Japanese and English tasks. More specifically, we used the [llm-jp-eval](https://github.com/llm-jp/llm-jp-eval) toolkit for Japanese tasks and the [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) toolkit for English (and some Japanese) tasks. We adopted the default decoding strategy for all the tasks. ## Risks and Limitations The model is trained on general tasks related to software development, not on organization-specific, and/or non-standardized tasks. We recommend further fine-tuning the model to make it work better with those tasks. The model may produce incorrect output and all the suggestions from the model must be carefully examined before adopting in real-world applications. ## Acknowledgements The model is developed as part of the Generative AI Accelerator Challenge (GENIAC) project. We thank great support from the New Energy and Industrial Technology Development Organization (NEDO) and the Ministry of Economy, Trade and Industry (METI) for financial support. ## Contact - pj-geniac at future.co.jp ## License [META LLAMA 3.1 COMMUNITY LICENSE](https://www.llama.com/llama3_1/license/) Copyright © 2025 by Future Corporation