White paper: Effects of beam search on translation quality and resource consumption.
A Study of Quality/Resource Tradeoffs in AI Models
Table of Contents
Executive Summary
This study evaluates the impact of beam search width on translation quality and resource consumption using Straker's Tiri J financial translation model (7B parameters, int8 quantization). Testing 2000 English→Japanese translation tasks across beam sizes 1-5, we measured quality using industry-standard metrics (BLEU, CHRF, BLEURT, TER) and tracked VRAM consumption on NVIDIA RTX 4090 hardware.
Key Findings
Quality Improvements:
- Beam search delivers 6-9% improvement in BLEU/CHRF scores over greedy decoding
- Peak quality achieved at beam size 5, but 93% of maximum quality reached at beam size 2
- Diminishing returns evident: each beam beyond #2 yields ≤0.5% quality gain
Resource Costs:
- VRAM consumption scales approximately linearly: +7-10% per additional beam
- Beam size 2 requires only 10% additional VRAM for 8% quality improvement
- Beam size 5 demands 33%+ more VRAM than greedy decoding
Optimal Configuration:
- Production environments: Beam size 2 provides best cost/quality balance
- Research applications: Beam size 5 maximizes metric scores
- Resource-constrained deployments: Greedy decoding (beam size 1) for <16GB VRAM
Business Impact
The analysis reveals that beam size 2 represents the optimal sweet spot for production deployment, delivering substantial quality improvements (8% BLEU gain) with minimal resource overhead (10% VRAM increase). Organizations can achieve 93% of maximum translation quality while maintaining efficient resource utilization, making this configuration ideal for scalable production environments.
1. Introduction
1.1 Hardware Configuration
- GPU: NVIDIA GeForce RTX 4090 (24GB VRAM)
1.2 Model Architecture
- Base Model: tiri-j-fin-7b-v1.1 (7B parameter transformer) - Early Tiri model, no longer used.
- Specialization: English→Japanese translation in the financial domain.
- Optimization: int8 - quantization reduces memory footprint by 4x vs FP32
- Batch Size: Fixed at 8 sequences
1.3 Beam Search Fundamentals
Beam search is a search algorithm that explores a graph by expanding the most promising nodes in a limited set. It is often used in optimization problems, particularly in sequence-to-sequence tasks like translation
Breadth-First Nature with Limitations: While similar to a breadth-first search, where all possible nodes are expanded at each step, beam search introduces a parameter called "beam width" or "beam size" that limits the number of nodes to explore. This allows the algorithm to strike a balance between efficiency and result quality.
Beam Size: Beam size is a hyperparameter that determines how many of the most probable partial solutions (nodes) are kept at each step. For instance, a beam size of 3 means at most three nodes will be expanded and 'searched' at each step.
Step-by-Step Expansion:
- Start with an initial node or partial solution.
- Expand this partial solution by creating all possible next steps.
- Evaluate each new partial solution using a scoring function, often a probability for sequence tasks.
- Keep only the top "beam search" number of partial solutions with the highest scores.
- Repeat the process with the surviving partial solutions until a complete solution is found or a certain condition is met.
Our tests below outline the effect that beam size has on translation quality. We tested our Tiri J model with 2000 previously unseen translation tasks, scored by industry standard translation quality metrics, and correlated against resource consumption.
1.4 Metrics Explained
1. BLEU (Bilingual Evaluation Understudy)
- Measures n-gram precision against reference translations
- Range: 0-100 (higher = better)
- Correlates well with human judgment at corpus level
2. CHRF (Character n-gram F-score)
- Evaluates character-level similarity using F-score
- Range: 0-100 (higher = better)
- Particularly effective for languages with complex morphology (e.g., Japanese)
3. BLEURT
- Learned metric using pre-trained BERT models
- Range: 0-1 (higher = better)
- Captures semantic similarity beyond surface forms
4. TER (Translation Edit Rate)
- Measures edit distance (insertions/deletions/substitutions) needed to match reference
- Range: 0-100 (lower = better)
2. Results
2.1 Performance Metrics Table With % Change Between Beams
Table 1: Absolute Scores
Beams | BLEU | CHRF | BLEURT | TER | |
---|---|---|---|---|---|
1 | 54.24 | 58.92 | 0.834 | 68.15 | |
2 | 58.58 | 62.48 | 0.843 | 70.51 | |
3 | 58.80 | 62.64 | 0.843 | 70.07 | |
4 | 58.20 | 62.17 | 0.841 | 70.26 | |
5 | 58.90 | 62.74 | 0.844 | 70.84 |
Table 2: VRAM Requirements
Beams | VRAM (GB) | % Increase vs Greedy |
---|---|---|
1 | 15.0 | 0% |
2 | 16.5 | +10% |
3 | 17.5 | +16.67% |
4 | 18.5 | +23.33% |
5 | 20+ | ≥33.33% |
Table 3: Calculated Performance
Beams | BLEU | Δ% vs Prev | CHRF | Δ% vs Prev | BLEURT | Δ% vs Prev | VRAM (GB) | Δ% vs Prev |
---|---|---|---|---|---|---|---|---|
1 | 54.24 | - | 58.92 | - | 0.834 | - | 15.0 | - |
2 | 58.58 | +8.00% | 62.48 | +6.04% | 0.843 | +1.08% | 16.5 | +10.00% |
3 | 58.80 | +0.38% | 62.64 | +0.26% | 0.843 | 0.00% | 17.5 | +6.06% |
4 | 58.20 | -1.02% | 62.17 | -0.75% | 0.841 | -0.24% | 18.5 | +5.71% |
5 | 58.90 | +1.20% | 62.74 | +0.92% | 0.844 | +0.36% | 20.0+ | ≥+8.11% |
Key Trends:
- BLEU/CHRF show 8-9% gains from Beam 1→5
- TER paradoxically worsens with beam search (3-4% degradation)
- At this batch size, VRAM scales mostly linearly (+1.5GB/beam) until Beam 5 (+33% total)
2.2 Key Observations From Metrics
2.3 Beam Transition Analysis
- 1→2: 8% BLEU gain for 10% VRAM increase
- 2→3: <0.5% quality gains despite 6% VRAM growth
- 4→5: 1.2% BLEU recovery requires 8%+ VRAM
2.4 Critical Thresholds
- 90% Quality Ceiling: Beam 2 achieves 93% of max BLEU at 50% VRAM cost
- Negative ROI Zone: Beam 4 reduces quality while increasing resource use
- Diminishing Returns: Each beam beyond #2 yields ≤0.5% BLEU gain per beam
2.5 TER Paradox Analysis
While beam search improves n-gram matching metrics (BLEU/CHRF), its tendency to produce longer translations increases edit distance through two mechanisms:
- Insertion Penalties: Additional words require deletions to match reference length
- Word Choice Differences: Longer outputs increase opportunities for mismatches in how words are arranged
3. Visual Analysis
3.1 Quality-VRAM Tradeoff Curve
Peak BLEU requires 33% more VRAM than baseline, while CHRF plateaus after Beam 3
3.2 Key Observations from Visualization
- Non-Linear Scaling: Quality metrics plateau after Beam 2 while VRAM continues linear growth
- Metric Consensus: High inter-metric correlation (BLEU/CHRF/BLEURT) validates their reliability in these tests
- Beam 4 Anomaly: Visible as dip in Figure 1's quality curves despite VRAM increase
4. Practical Implications
Deployment Recommendations
Use Case | Optimal Beams | Rationale |
---|---|---|
Basic Production | 2 | Best cost/quality balance |
Research | 5 | Maximizes metric scores |
Mobile Deployment | 1 | Only configuration <16GB VRAM |
Hardware Planning Guidelines
- VRAM Budgeting: Each additional beam requires ~7% VRAM headroom on this architecture
- Batch Size Warning: Doubling batch size from 8→16 would require ~30GB VRAM at Beam 5
5. Conclusion
This study reveals three fundamental tradeoffs in beam search optimization:
- Quality Gains: Beam search can improve BLEU/CHRF by 6-9%
- Resource Costs: Each beam increases VRAM consumption by 7-10%
- Metric Conflicts: TER behavior in this study suggests beam search produces "different but valid" translations
While beam search enhances translation quality, you must balance metric improvements against resource costs. Beam 2 emerges as the optimal choice for Tiri Translation applications.
[Link to Tiri: https://www.straker.ai/products/tiri]