White paper: Effects of beam search on translation quality and resource consumption.

Community Article Published June 12, 2025

A Study of Quality/Resource Tradeoffs in AI Models


Table of Contents

  1. Executive Summary
  2. Introduction
  3. Results
  4. Visual Analysis
  5. Practical Implications
  6. Conclusion

Executive Summary

This study evaluates the impact of beam search width on translation quality and resource consumption using Straker's Tiri J financial translation model (7B parameters, int8 quantization). Testing 2000 English→Japanese translation tasks across beam sizes 1-5, we measured quality using industry-standard metrics (BLEU, CHRF, BLEURT, TER) and tracked VRAM consumption on NVIDIA RTX 4090 hardware.

Key Findings

Quality Improvements:

  • Beam search delivers 6-9% improvement in BLEU/CHRF scores over greedy decoding
  • Peak quality achieved at beam size 5, but 93% of maximum quality reached at beam size 2
  • Diminishing returns evident: each beam beyond #2 yields ≤0.5% quality gain

Resource Costs:

  • VRAM consumption scales approximately linearly: +7-10% per additional beam
  • Beam size 2 requires only 10% additional VRAM for 8% quality improvement
  • Beam size 5 demands 33%+ more VRAM than greedy decoding

Optimal Configuration:

  • Production environments: Beam size 2 provides best cost/quality balance
  • Research applications: Beam size 5 maximizes metric scores
  • Resource-constrained deployments: Greedy decoding (beam size 1) for <16GB VRAM

Business Impact

The analysis reveals that beam size 2 represents the optimal sweet spot for production deployment, delivering substantial quality improvements (8% BLEU gain) with minimal resource overhead (10% VRAM increase). Organizations can achieve 93% of maximum translation quality while maintaining efficient resource utilization, making this configuration ideal for scalable production environments.


1. Introduction

1.1 Hardware Configuration

  • GPU: NVIDIA GeForce RTX 4090 (24GB VRAM)

1.2 Model Architecture

  • Base Model: tiri-j-fin-7b-v1.1 (7B parameter transformer) - Early Tiri model, no longer used.
  • Specialization: English→Japanese translation in the financial domain.
  • Optimization: int8 - quantization reduces memory footprint by 4x vs FP32
  • Batch Size: Fixed at 8 sequences

1.3 Beam Search Fundamentals

Beam search is a search algorithm that explores a graph by expanding the most promising nodes in a limited set. It is often used in optimization problems, particularly in sequence-to-sequence tasks like translation

  1. Breadth-First Nature with Limitations: While similar to a breadth-first search, where all possible nodes are expanded at each step, beam search introduces a parameter called "beam width" or "beam size" that limits the number of nodes to explore. This allows the algorithm to strike a balance between efficiency and result quality.

  2. Beam Size: Beam size is a hyperparameter that determines how many of the most probable partial solutions (nodes) are kept at each step. For instance, a beam size of 3 means at most three nodes will be expanded and 'searched' at each step.

  3. Step-by-Step Expansion:

    • Start with an initial node or partial solution.
    • Expand this partial solution by creating all possible next steps.
    • Evaluate each new partial solution using a scoring function, often a probability for sequence tasks.
    • Keep only the top "beam search" number of partial solutions with the highest scores.
    • Repeat the process with the surviving partial solutions until a complete solution is found or a certain condition is met.

Our tests below outline the effect that beam size has on translation quality. We tested our Tiri J model with 2000 previously unseen translation tasks, scored by industry standard translation quality metrics, and correlated against resource consumption.

1.4 Metrics Explained

1. BLEU (Bilingual Evaluation Understudy)

  • Measures n-gram precision against reference translations
  • Range: 0-100 (higher = better)
  • Correlates well with human judgment at corpus level

2. CHRF (Character n-gram F-score)

  • Evaluates character-level similarity using F-score
  • Range: 0-100 (higher = better)
  • Particularly effective for languages with complex morphology (e.g., Japanese)

3. BLEURT

  • Learned metric using pre-trained BERT models
  • Range: 0-1 (higher = better)
  • Captures semantic similarity beyond surface forms

4. TER (Translation Edit Rate)

  • Measures edit distance (insertions/deletions/substitutions) needed to match reference
  • Range: 0-100 (lower = better)

2. Results

2.1 Performance Metrics Table With % Change Between Beams

Table 1: Absolute Scores

Beams BLEU CHRF BLEURT TER
1 54.24 58.92 0.834 68.15
2 58.58 62.48 0.843 70.51
3 58.80 62.64 0.843 70.07
4 58.20 62.17 0.841 70.26
5 58.90 62.74 0.844 70.84

Table 2: VRAM Requirements

Beams VRAM (GB) % Increase vs Greedy
1 15.0 0%
2 16.5 +10%
3 17.5 +16.67%
4 18.5 +23.33%
5 20+ ≥33.33%

Table 3: Calculated Performance

Beams BLEU Δ% vs Prev CHRF Δ% vs Prev BLEURT Δ% vs Prev VRAM (GB) Δ% vs Prev
1 54.24 - 58.92 - 0.834 - 15.0 -
2 58.58 +8.00% 62.48 +6.04% 0.843 +1.08% 16.5 +10.00%
3 58.80 +0.38% 62.64 +0.26% 0.843 0.00% 17.5 +6.06%
4 58.20 -1.02% 62.17 -0.75% 0.841 -0.24% 18.5 +5.71%
5 58.90 +1.20% 62.74 +0.92% 0.844 +0.36% 20.0+ ≥+8.11%

Key Trends:

  • BLEU/CHRF show 8-9% gains from Beam 1→5
  • TER paradoxically worsens with beam search (3-4% degradation)
  • At this batch size, VRAM scales mostly linearly (+1.5GB/beam) until Beam 5 (+33% total)

2.2 Key Observations From Metrics

2.3 Beam Transition Analysis

  • 1→2: 8% BLEU gain for 10% VRAM increase
  • 2→3: <0.5% quality gains despite 6% VRAM growth
  • 4→5: 1.2% BLEU recovery requires 8%+ VRAM

2.4 Critical Thresholds

  • 90% Quality Ceiling: Beam 2 achieves 93% of max BLEU at 50% VRAM cost
  • Negative ROI Zone: Beam 4 reduces quality while increasing resource use
  • Diminishing Returns: Each beam beyond #2 yields ≤0.5% BLEU gain per beam

2.5 TER Paradox Analysis

While beam search improves n-gram matching metrics (BLEU/CHRF), its tendency to produce longer translations increases edit distance through two mechanisms:

  1. Insertion Penalties: Additional words require deletions to match reference length
  2. Word Choice Differences: Longer outputs increase opportunities for mismatches in how words are arranged

3. Visual Analysis

3.1 Quality-VRAM Tradeoff Curve

Figure: BLEU/CHRF vs VRAM image/png Peak BLEU requires 33% more VRAM than baseline, while CHRF plateaus after Beam 3

3.2 Key Observations from Visualization

  1. Non-Linear Scaling: Quality metrics plateau after Beam 2 while VRAM continues linear growth
  2. Metric Consensus: High inter-metric correlation (BLEU/CHRF/BLEURT) validates their reliability in these tests
  3. Beam 4 Anomaly: Visible as dip in Figure 1's quality curves despite VRAM increase

4. Practical Implications

Deployment Recommendations

Use Case Optimal Beams Rationale
Basic Production 2 Best cost/quality balance
Research 5 Maximizes metric scores
Mobile Deployment 1 Only configuration <16GB VRAM

Hardware Planning Guidelines

  • VRAM Budgeting: Each additional beam requires ~7% VRAM headroom on this architecture
  • Batch Size Warning: Doubling batch size from 8→16 would require ~30GB VRAM at Beam 5

5. Conclusion

This study reveals three fundamental tradeoffs in beam search optimization:

  1. Quality Gains: Beam search can improve BLEU/CHRF by 6-9%
  2. Resource Costs: Each beam increases VRAM consumption by 7-10%
  3. Metric Conflicts: TER behavior in this study suggests beam search produces "different but valid" translations

While beam search enhances translation quality, you must balance metric improvements against resource costs. Beam 2 emerges as the optimal choice for Tiri Translation applications.


[Link to Tiri: https://www.straker.ai/products/tiri]

Community

Sign up or log in to comment