AWS Trainium & Inferentia documentation

NeuronTrainer

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

NeuronTrainer

Training classes for AWS Trainium accelerators.

NeuronTrainingArguments

class optimum.neuron.NeuronTrainingArguments

< >

( output_dir: str | None = None overwrite_output_dir: bool = False do_train: bool = False do_eval: bool = False eval_strategy: transformers.trainer_utils.IntervalStrategy | str = 'no' per_device_train_batch_size: int = 1 per_device_eval_batch_size: int = 1 gradient_accumulation_steps: int = 1 learning_rate: float = 5e-05 weight_decay: float = 0.0 adam_beta1: float = 0.9 adam_beta2: float = 0.999 adam_epsilon: float = 1e-08 max_grad_norm: float = 1.0 num_train_epochs: float = 3.0 max_steps: int = -1 lr_scheduler_type: transformers.trainer_utils.SchedulerType | str = 'linear' lr_scheduler_kwargs: dict[str, typing.Any] | str | None = <factory> warmup_ratio: float = 0.0 warmup_steps: int = 0 log_level: str = 'info' log_level_replica: str = 'silent' logging_dir: str | None = None logging_strategy: transformers.trainer_utils.IntervalStrategy | str = 'steps' logging_first_step: bool = False logging_steps: float = 500 save_strategy: transformers.trainer_utils.SaveStrategy | str = 'steps' save_steps: float = 500 save_total_limit: int | None = None save_only_model: bool = False restore_callback_states_from_checkpoint: bool = False seed: int = 42 bf16: bool = False dataloader_drop_last: bool = False eval_steps: float | None = None dataloader_num_workers: int = 0 dataloader_prefetch_factor: int | None = None run_name: str | None = None disable_tqdm: bool | None = None remove_unused_columns: bool | None = True label_names: list[str] | None = None accelerator_config: dict | str | None = None label_smoothing_factor: float = 0.0 optim: transformers.training_args.OptimizerNames | str = 'adamw_torch' optim_args: str | None = None report_to: None | str | list[str] = None resume_from_checkpoint: str | None = None gradient_checkpointing: bool = False gradient_checkpointing_kwargs: dict[str, typing.Any] | str | None = None use_liger_kernel: bool | None = False average_tokens_across_devices: bool | None = False dataloader_prefetch_size: int = None skip_cache_push: bool = False use_autocast: bool = False zero_1: bool = False tensor_parallel_size: int = 1 disable_sequence_parallel: bool = False pipeline_parallel_size: int = 1 pipeline_parallel_num_microbatches: int = -1 kv_size_multiplier: int | None = None num_local_ranks_per_step: int = 8 use_xser: bool = True async_save: bool = False fuse_qkv: bool = False recompute_causal_mask: bool = True )

get_process_log_level

< >

( )

Returns the log level to be used depending on whether this process is the main process of node 0, main process of node non-0, or a non-main process.

For the main process the log level defaults to the logging level set (logging.WARNING if you didn’t do anything) unless overridden by log_level argument.

For the replica processes the log level defaults to logging.WARNING unless overridden by log_level_replica argument.

The choice between the main and replica process settings is made according to the return value of should_log.

get_warmup_steps

< >

( num_training_steps: int )

Get number of steps used for a linear warmup.

to_dict

< >

( )

Serializes this instance while replace Enum by their values (for JSON serialization support). It obfuscates the token values by removing their value.

to_json_string

< >

( )

Serializes this instance to a JSON string.

to_sanitized_dict

< >

( )

Sanitized serialization to use with TensorBoard’s hparams

NeuronTrainer

class optimum.neuron.NeuronTrainer

< >

( model: transformers.modeling_utils.PreTrainedModel | torch.nn.modules.module.Module args: NeuronTrainingArguments data_collator: typing.Optional[transformers.data.data_collator.DataCollator] = None train_dataset: Dataset | IterableDataset | datasets.Dataset | None = None eval_dataset: Dataset | dict[str, Dataset] | datasets.Dataset | None = None processing_class: transformers.tokenization_utils_base.PreTrainedTokenizerBase | transformers.image_processing_utils.BaseImageProcessor | transformers.feature_extraction_utils.FeatureExtractionMixin | transformers.processing_utils.ProcessorMixin | None = None callbacks: list[transformers.trainer_callback.TrainerCallback] | None = None optimizers: tuple[torch.optim.optimizer.Optimizer | None, torch.optim.lr_scheduler.LambdaLR | None] = (None, None) optimizer_cls_and_kwargs: tuple[type[torch.optim.optimizer.Optimizer], dict[str, typing.Any]] | None = None tokenizer: transformers.tokenization_utils_base.PreTrainedTokenizerBase | None = None )

add_callback

< >

( callback: typing.Union[typing.Type[transformers.trainer_callback.TrainerCallback], transformers.trainer_callback.TrainerCallback] )

Parameters

  • callback (Type[TrainerCallback] | TrainerCallback) — A TrainerCallback class or an instance of a TrainerCallback. In the first case, will instantiate a member of that class.

Add a callback to the current list of TrainerCallback.

autocast_smart_context_manager

< >

( cache_enabled: bool | None = True )

A helper wrapper that creates an appropriate context manager for autocast while feeding it the desired arguments, depending on the situation.

create_accelerator_and_postprocess

< >

( )

Creates NeuronAccelerator instance and prepares model for distributed training.

create_optimizer

< >

( )

Setup the optimizer.

We provide a reasonable default that works well. If you want to use something else, you can pass a tuple in the NeuronTrainer’s init through optimizers, or subclass and override this method in a subclass.

create_optimizer_and_scheduler

< >

( num_training_steps: int )

Setup the optimizer and the learning rate scheduler.

We provide a reasonable default that works well. If you want to use something else, you can pass a tuple in the NeuronTrainer’s init through optimizers, or subclass and override this method (or create_optimizer and/or create_scheduler) in a subclass.

create_scheduler

< >

( num_training_steps: int optimizer: torch.optim.optimizer.Optimizer | None = None )

Parameters

  • num_training_steps (int) — The number of training steps to do.

Setup the scheduler. The optimizer of the trainer must have been set up either before this method is called or passed as an argument.

get_decay_parameter_names

< >

( model )

Get all parameter names that weight decay will be applied to.

This function filters out parameters in two ways:

  1. By layer type (instances of layers specified in ALL_LAYERNORM_LAYERS)
  2. By parameter name patterns (containing ‘bias’, ‘layernorm’, or ‘rmsnorm’)

get_learning_rates

< >

( )

Returns the learning rate of each parameter from self.optimizer.

get_num_trainable_parameters

< >

( )

Get the number of trainable parameters.

get_optimizer_cls_and_kwargs

< >

( args: TrainingArguments model: transformers.modeling_utils.PreTrainedModel | None = None )

Parameters

  • args (transformers.training_args.TrainingArguments) — The training arguments for the training session.

Returns the optimizer class and optimizer parameters based on the training arguments.

get_optimizer_group

< >

( param: str | torch.nn.parameter.Parameter | None = None )

Parameters

  • param (str | torch.nn.parameter.Parameter | None, defaults to None) — The parameter for which optimizer group needs to be returned.

Returns optimizer group for a parameter if given, else returns all optimizer groups for params.

get_train_dataloader

< >

( )

Returns the training DataLoader with appropriate sampler and batch size.

is_local_process_zero

< >

( )

Whether or not this process is the local (e.g., on one machine if training in a distributed fashion on several machines) main process.

is_world_process_zero

< >

( )

Whether or not this process is the global main process (when training in a distributed fashion on several machines, this is only going to be True for one process).

log

< >

( logs: dict[str, float] )

Log training metrics to the state history and callbacks.

maybe_log_train_step_metrics

< >

( )

Log training step metrics if logging is due.

maybe_save_checkpoint

< >

( )

Save checkpoint if saving is due.

num_examples

< >

( dataloader: DataLoader )

Helper to get number of samples in a ~torch.utils.data.DataLoader by accessing its dataset. When dataloader.dataset does not exist or has no length, estimates as best it can

num_tokens

< >

( train_dl: DataLoader max_steps: int | None = None )

Helper to get number of tokens in a ~torch.utils.data.DataLoader by enumerating dataloader.

pop_callback

< >

( callback: typing.Union[typing.Type[transformers.trainer_callback.TrainerCallback], transformers.trainer_callback.TrainerCallback] ) TrainerCallback | None

Parameters

  • callback (Type[TrainerCallback] | TrainerCallback) — A TrainerCallback class or an instance of a TrainerCallback. In the first case, will pop the first member of that class found in the list of callbacks.

Returns

TrainerCallback | None

The callback removed, if found.

Remove a callback from the current list of TrainerCallback and returns it.

If the callback is not found, returns None (and no error is raised).

remove_callback

< >

( callback: typing.Union[typing.Type[transformers.trainer_callback.TrainerCallback], transformers.trainer_callback.TrainerCallback] )

Parameters

  • callback (Type[TrainerCallback] | TrainerCallback) — A TrainerCallback class or an instance of a TrainerCallback. In the first case, will remove the first member of that class found in the list of callbacks.

Remove a callback from the current list of TrainerCallback.

set_initial_training_values

< >

( args: NeuronTrainingArguments dataloader: DataLoader total_train_batch_size: int )

Calculates and returns the following values:

  • num_train_epochs
  • num_update_steps_per_epoch
  • num_examples
  • num_train_samples
  • epoch_based
  • len_dataloader
  • max_steps

setup_training

< >

( train_dataloader: DataLoader max_steps: int num_train_epochs: int num_examples: int total_train_batch_size: int )

Setup everything to prepare for the training loop. This methods does not return anything but initializes many attributes of the class for training.