White paper: Effects of beam search on translation quality and resource consumption.

Community Article Published June 12, 2025

A Study of Quality/Resource Tradeoffs in AI Models

Executive Summary
Introduction
Results
Visual Analysis
- Quality-VRAM Tradeoff Curve
- Key Observations from Visualization
Practical Implications
Conclusion

Executive Summary

This study evaluates the impact of beam search width on translation quality and resource consumption using Straker's Tiri J financial translation model (7B parameters, int8 quantization). Testing 2000 English→Japanese translation tasks across beam sizes 1-5, we measured quality using industry-standard metrics (BLEU, CHRF, BLEURT, TER) and tracked VRAM consumption on NVIDIA RTX 4090 hardware.

Key Findings

Quality Improvements:

Beam search delivers 6-9% improvement in BLEU/CHRF scores over greedy decoding
Peak quality achieved at beam size 5, but 93% of maximum quality reached at beam size 2
Diminishing returns evident: each beam beyond #2 yields ≤0.5% quality gain

Resource Costs:

VRAM consumption scales approximately linearly: +7-10% per additional beam
Beam size 2 requires only 10% additional VRAM for 8% quality improvement
Beam size 5 demands 33%+ more VRAM than greedy decoding

Optimal Configuration:

Production environments: Beam size 2 provides best cost/quality balance
Research applications: Beam size 5 maximizes metric scores
Resource-constrained deployments: Greedy decoding (beam size 1) for <16GB VRAM

Business Impact

The analysis reveals that beam size 2 represents the optimal sweet spot for production deployment, delivering substantial quality improvements (8% BLEU gain) with minimal resource overhead (10% VRAM increase). Organizations can achieve 93% of maximum translation quality while maintaining efficient resource utilization, making this configuration ideal for scalable production environments.

1. Introduction

1.1 Hardware Configuration

GPU: NVIDIA GeForce RTX 4090 (24GB VRAM)

1.2 Model Architecture

Base Model: tiri-j-fin-7b-v1.1 (7B parameter transformer) - Early Tiri model, no longer used.
Specialization: English→Japanese translation in the financial domain.
Optimization: int8 - quantization reduces memory footprint by 4x vs FP32
Batch Size: Fixed at 8 sequences

1.3 Beam Search Fundamentals

Beam search is a search algorithm that explores a graph by expanding the most promising nodes in a limited set. It is often used in optimization problems, particularly in sequence-to-sequence tasks like translation

Breadth-First Nature with Limitations: While similar to a breadth-first search, where all possible nodes are expanded at each step, beam search introduces a parameter called "beam width" or "beam size" that limits the number of nodes to explore. This allows the algorithm to strike a balance between efficiency and result quality.
Beam Size: Beam size is a hyperparameter that determines how many of the most probable partial solutions (nodes) are kept at each step. For instance, a beam size of 3 means at most three nodes will be expanded and 'searched' at each step.
Step-by-Step Expansion:
- Start with an initial node or partial solution.
- Expand this partial solution by creating all possible next steps.
- Evaluate each new partial solution using a scoring function, often a probability for sequence tasks.
- Keep only the top "beam search" number of partial solutions with the highest scores.
- Repeat the process with the surviving partial solutions until a complete solution is found or a certain condition is met.

Our tests below outline the effect that beam size has on translation quality. We tested our Tiri J model with 2000 previously unseen translation tasks, scored by industry standard translation quality metrics, and correlated against resource consumption.

1.4 Metrics Explained

1. BLEU (Bilingual Evaluation Understudy)

Measures n-gram precision against reference translations
Range: 0-100 (higher = better)
Correlates well with human judgment at corpus level

2. CHRF (Character n-gram F-score)

Evaluates character-level similarity using F-score
Range: 0-100 (higher = better)
Particularly effective for languages with complex morphology (e.g., Japanese)

3. BLEURT

Learned metric using pre-trained BERT models
Range: 0-1 (higher = better)
Captures semantic similarity beyond surface forms

4. TER (Translation Edit Rate)

Measures edit distance (insertions/deletions/substitutions) needed to match reference
Range: 0-100 (lower = better)

2. Results

2.1 Performance Metrics Table With % Change Between Beams

Table 1: Absolute Scores

Beams	BLEU	CHRF	BLEURT	TER
1	54.24	58.92	0.834	68.15
2	58.58	62.48	0.843	70.51
3	58.80	62.64	0.843	70.07
4	58.20	62.17	0.841	70.26
5	58.90	62.74	0.844	70.84

Table 2: VRAM Requirements

Beams	VRAM (GB)	% Increase vs Greedy
1	15.0	0%
2	16.5	+10%
3	17.5	+16.67%
4	18.5	+23.33%
5	20+	≥33.33%

Table 3: Calculated Performance

Beams	BLEU	Δ% vs Prev	CHRF	Δ% vs Prev	BLEURT	Δ% vs Prev	VRAM (GB)	Δ% vs Prev
1	54.24	-	58.92	-	0.834	-	15.0	-
2	58.58	+8.00%	62.48	+6.04%	0.843	+1.08%	16.5	+10.00%
3	58.80	+0.38%	62.64	+0.26%	0.843	0.00%	17.5	+6.06%
4	58.20	-1.02%	62.17	-0.75%	0.841	-0.24%	18.5	+5.71%
5	58.90	+1.20%	62.74	+0.92%	0.844	+0.36%	20.0+	≥+8.11%

Key Trends:

BLEU/CHRF show 8-9% gains from Beam 1→5
TER paradoxically worsens with beam search (3-4% degradation)
At this batch size, VRAM scales mostly linearly (+1.5GB/beam) until Beam 5 (+33% total)

2.2 Key Observations From Metrics

2.3 Beam Transition Analysis

1→2: 8% BLEU gain for 10% VRAM increase
2→3: <0.5% quality gains despite 6% VRAM growth
4→5: 1.2% BLEU recovery requires 8%+ VRAM

2.4 Critical Thresholds

90% Quality Ceiling: Beam 2 achieves 93% of max BLEU at 50% VRAM cost
Negative ROI Zone: Beam 4 reduces quality while increasing resource use
Diminishing Returns: Each beam beyond #2 yields ≤0.5% BLEU gain per beam

2.5 TER Paradox Analysis

While beam search improves n-gram matching metrics (BLEU/CHRF), its tendency to produce longer translations increases edit distance through two mechanisms:

Insertion Penalties: Additional words require deletions to match reference length
Word Choice Differences: Longer outputs increase opportunities for mismatches in how words are arranged

3. Visual Analysis

3.1 Quality-VRAM Tradeoff Curve

Peak BLEU requires 33% more VRAM than baseline, while CHRF plateaus after Beam 3

3.2 Key Observations from Visualization

Non-Linear Scaling: Quality metrics plateau after Beam 2 while VRAM continues linear growth
Metric Consensus: High inter-metric correlation (BLEU/CHRF/BLEURT) validates their reliability in these tests
Beam 4 Anomaly: Visible as dip in Figure 1's quality curves despite VRAM increase

4. Practical Implications

Deployment Recommendations

Use Case	Optimal Beams	Rationale
Basic Production	2	Best cost/quality balance
Research	5	Maximizes metric scores
Mobile Deployment	1	Only configuration <16GB VRAM

Hardware Planning Guidelines

VRAM Budgeting: Each additional beam requires ~7% VRAM headroom on this architecture
Batch Size Warning: Doubling batch size from 8→16 would require ~30GB VRAM at Beam 5

5. Conclusion

This study reveals three fundamental tradeoffs in beam search optimization:

Quality Gains: Beam search can improve BLEU/CHRF by 6-9%
Resource Costs: Each beam increases VRAM consumption by 7-10%
Metric Conflicts: TER behavior in this study suggests beam search produces "different but valid" translations

While beam search enhances translation quality, you must balance metric improvements against resource costs. Beam 2 emerges as the optimal choice for Tiri Translation applications.

[Link to Tiri: https://www.straker.ai/products/tiri]

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote