super_gradients.training.sg_model package

Submodules

super_gradients.training.sg_model.sg_model module

class super_gradients.training.sg_model.sg_model.StrictLoad(value)[source]

Bases: enum.Enum

Wrapper for adding more functionality to torch’s strict_load parameter in load_state_dict().
Attributes:

OFF - Native torch “strict_load = off” behaviour. See nn.Module.load_state_dict() documentation for more details. ON - Native torch “strict_load = on” behaviour. See nn.Module.load_state_dict() documentation for more details. NO_KEY_MATCHING - Allows the usage of SuperGradient’s adapt_checkpoint function, which loads a checkpoint by matching each

layer’s shapes (and bypasses the strict matching of the names of each layer (ie: disregards the state_dict key matching)).

OFF = False
ON = True
NO_KEY_MATCHING = 'no_key_matching'
class super_gradients.training.sg_model.sg_model.MultiGPUMode(value)[source]

Bases: str, enum.Enum

OFF                       - Single GPU Mode / CPU Mode
DATA_PARALLEL             - Multiple GPUs, Synchronous
DISTRIBUTED_DATA_PARALLEL - Multiple GPUs, Asynchronous
OFF = 'Off'
DATA_PARALLEL = 'DP'
DISTRIBUTED_DATA_PARALLEL = 'DDP'
AUTO = 'AUTO'
class super_gradients.training.sg_model.sg_model.EvaluationType(value)[source]

Bases: str, enum.Enum

Passed to SgModel.evaluate(..), and controls which phase callbacks should be triggered (if at all).

Attributes:

TEST VALIDATION

TEST = 'TEST'
VALIDATION = 'VALIDATION'
class super_gradients.training.sg_model.sg_model.SgModel(experiment_name: str, device: Optional[str] = None, multi_gpu: Union[super_gradients.training.sg_model.sg_model.MultiGPUMode, str] = <MultiGPUMode.OFF: 'Off'>, model_checkpoints_location: str = 'local', overwrite_local_checkpoint: bool = True, ckpt_name: str = 'ckpt_latest.pth', post_prediction_callback: Optional[super_gradients.training.utils.detection_utils.DetectionPostPredictionCallback] = None, ckpt_root_dir=None)[source]

Bases: object

SuperGradient Model - Base Class for Sg Models

train(max_epochs: int, initial_epoch: int, save_model: bool)[source]

the main function used for the training, h.p. updating, logging etc.

predict(idx: int)[source]

returns the predictions and label of the current inputs

test(epoch : int, idx : int, save : bool):

returns the test loss, accuracy and runtime

connect_dataset_interface(dataset_interface: super_gradients.training.datasets.dataset_interfaces.dataset_interface.DatasetInterface, data_loader_num_workers: int = 8)[source]
Parameters
  • dataset_interface – DatasetInterface object

  • data_loader_num_workers – The number of threads to initialize the Data Loaders with The dataset to be connected

build_model(architecture: Union[str, torch.nn.modules.module.Module], arch_params={}, checkpoint_params={}, *args, **kwargs)[source]
Parameters
  • architecture – Defines the network’s architecture from models/ALL_ARCHITECTURES

  • arch_params – Architecture H.P. e.g.: block, num_blocks, num_classes, etc.

  • checkpoint_params

    Dictionary like object with the following key:values:

    load_checkpoint: Load a pre-trained checkpoint strict_load: See StrictLoad class documentation for details. source_ckpt_folder_name: folder name to load the checkpoint from (self.experiment_name if none is given) load_weights_only: loads only the weight from the checkpoint and zeroize the training params load_backbone: loads the provided checkpoint to self.net.backbone instead of self.net external_checkpoint_path: The path to the external checkpoint to be loaded. Can be absolute or relative

    (ie: path/to/checkpoint.pth). If provided, will automatically attempt to load the checkpoint even if the load_checkpoint flag is not provided.

backward_step(loss: torch.Tensor, epoch: int, batch_idx: int, context: super_gradients.training.utils.callbacks.PhaseContext, *args, **kwargs)[source]

Run backprop on the loss and perform a step :param loss: The value computed by the loss function :param optimizer: An object that can perform a gradient step and zeroize model gradient :param epoch: number of epoch the training is on :param batch_idx: number of iteration inside the current epoch :param context: current phase context :return:

save_checkpoint(optimizer=None, epoch: Optional[int] = None, validation_results_tuple: Optional[tuple] = None, context: Optional[super_gradients.training.utils.callbacks.PhaseContext] = None)[source]

Save the current state dict as latest (always), best (if metric was improved), epoch# (if determined in training params)

train(training_params: dict = {})[source]

train - Trains the Model

IMPORTANT NOTE: Additional batch parameters can be added as a third item (optional) if a tuple is returned by

the data loaders, as dictionary. The phase context will hold the additional items, under an attribute with the same name as the key in this dictionary. Then such items can be accessed through phase callbacks.

param training_params
  • max_epochs : int

    Number of epochs to run training.

  • lr_updates : list(int)

    List of fixed epoch numbers to perform learning rate updates when lr_mode=’step’.

  • lr_decay_factor : float

    Decay factor to apply to the learning rate at each update when lr_mode=’step’.

  • lr_mode : str

    Learning rate scheduling policy, one of [‘step’,’poly’,’cosine’,’function’]. ‘step’ refers to constant updates at epoch numbers passed through lr_updates. ‘cosine’ refers to Cosine Anealing policy as mentioned in https://arxiv.org/abs/1608.03983. ‘poly’ refers to polynomial decrease i.e in each epoch iteration self.lr = self.initial_lr * pow((1.0 - (current_iter / max_iter)), 0.9) ‘function’ refers to user defined learning rate scheduling function, that is passed through lr_schedule_function.

  • lr_schedule_function : Union[callable,None]

    Learning rate scheduling function to be used when lr_mode is ‘function’.

  • lr_warmup_epochs : int (default=0)

    Number of epochs for learning rate warm up - see https://arxiv.org/pdf/1706.02677.pdf (Section 2.2).

  • cosine_final_lr_ratiofloat (default=0.01)
    Final learning rate ratio (only relevant when `lr_mode`=’cosine’). The cosine starts from initial_lr and reaches

    initial_lr * cosine_final_lr_ratio in last epoch

  • inital_lr : float

    Initial learning rate.

  • loss : Union[nn.module, str]

    Loss function for training. One of SuperGradient’s built in options:

    “cross_entropy”: LabelSmoothingCrossEntropyLoss, “mse”: MSELoss, “r_squared_loss”: RSquaredLoss, “detection_loss”: YoLoV3DetectionLoss, “shelfnet_ohem_loss”: ShelfNetOHEMLoss, “shelfnet_se_loss”: ShelfNetSemanticEncodingLoss, “yolo_v5_loss”: YoLoV5DetectionLoss, “ssd_loss”: SSDLoss,

    or user defined nn.module loss function.

    IMPORTANT: forward(…) should return a (loss, loss_items) tuple where loss is the tensor used for backprop (i.e what your original loss function returns), and loss_items should be a tensor of shape (n_items), of values computed during the forward pass which we desire to log over the entire epoch. For example- the loss itself should always be logged. Another example is a scenario where the computed loss is the sum of a few components we would like to log- these entries in loss_items).

    When training, set the loss_logging_items_names parameter in train_params to be a list of strings, of length n_items who’s ith element is the name of the ith entry in loss_items. Then each item will be logged, rendered on tensorboard and “watched” (i.e saving model checkpoints according to it).

    Since running logs will save the loss_items in some internal state, it is recommended that loss_items are detached from their computational graph for memory efficiency.

  • optimizer : Union[str, torch.optim.Optimizer]

    Optimization algorithm. One of [‘Adam’,’SGD’,’RMSProp’] corresponding to the torch.optim optimzers implementations, or any object that implements torch.optim.Optimizer.

  • criterion_params : dict

    Loss function parameters.

  • optimizer_paramsdict

    When optimizer is one of [‘Adam’,’SGD’,’RMSProp’], it will be initialized with optimizer_params.

    (see https://pytorch.org/docs/stable/optim.html for the full list of parameters for each optimizer).

  • train_metrics_list : list(torchmetrics.Metric)

    Metrics to log during training. For more information on torchmetrics see https://torchmetrics.rtfd.io/en/latest/.

  • valid_metrics_list : list(torchmetrics.Metric)

    Metrics to log during validation/testing. For more information on torchmetrics see https://torchmetrics.rtfd.io/en/latest/.

  • loss_logging_items_names : list(str)

    The list of names/titles for the outputs returned from the loss functions forward pass (reminder- the loss function should return the tuple (loss, loss_items)). These names will be used for logging their values.

  • metric_to_watch : str (default=”Accuracy”)

    will be the metric which the model checkpoint will be saved according to, and can be set to any of the following:

    a metric name (str) of one of the metric objects from the valid_metrics_list

    a “metric_name” if some metric in valid_metrics_list has an attribute component_names which is a list referring to the names of each entry in the output metric (torch tensor of size n)

    one of “loss_logging_items_names” i.e which will correspond to an item returned during the loss function’s forward pass.

    At the end of each epoch, if a new best metric_to_watch value is achieved, the models checkpoint is saved in YOUR_PYTHON_PATH/checkpoints/ckpt_best.pth

  • greater_metric_to_watch_is_better : bool

    When choosing a model’s checkpoint to be saved, the best achieved model is the one that maximizes the

    metric_to_watch when this parameter is set to True, and a one that minimizes it otherwise.

  • ema : bool (default=False)

    Whether to use Model Exponential Moving Average (see https://github.com/rwightman/pytorch-image-models ema implementation)

  • batch_accumulate : int (default=1)

    Number of batches to accumulate before every backward pass.

  • ema_params : dict

    Parameters for the ema model.

  • zero_weight_decay_on_bias_and_bn : bool (default=False)

    Whether to apply weight decay on batch normalization parameters or not (ignored when the passed optimizer has already been initialized).

  • load_opt_params : bool (default=True)

    Whether to load the optimizers parameters as well when loading a model’s checkpoint.

  • run_validation_freq : int (default=1)

    The frequency in which validation is performed during training (i.e the validation is ran every

    run_validation_freq epochs.

  • save_model : bool (default=True)

    Whether to save the model checkpoints.

  • silent_mode : bool

    Silents the print outs.

  • mixed_precision : bool

    Whether to use mixed precision or not.

  • save_ckpt_epoch_list : list(int) (default=[])

    List of fixed epoch indices the user wishes to save checkpoints in.

  • average_best_models : bool (default=False)

    If set, a snapshot dictionary file and the average model will be saved / updated at every epoch and evaluated only when training is completed. The snapshot file will only be deleted upon completing the training. The snapshot dict will be managed on cpu.

  • precise_bn : bool (default=False)

    Whether to use precise_bn calculation during the training.

  • precise_bn_batch_size : int (default=None)

    The effective batch size we want to calculate the batchnorm on. For example, if we are training a model on 8 gpus, with a batch of 128 on each gpu, a good rule of thumb would be to give it 8192 (ie: effective_batch_size * num_gpus = batch_per_gpu * num_gpus * num_gpus). If precise_bn_batch_size is not provided in the training_params, the latter heuristic will be taken.

  • seed : int (default=42)

    Random seed to be set for torch, numpy, and random. When using DDP each process will have it’s seed set to seed + rank.

  • log_installed_packages : bool (default=False)

    When set, the list of all installed packages (and their versions) will be written to the tensorboard

    and logfile (useful when trying to reproduce results).

  • dataset_statistics : bool (default=False)

    Enable a statistic analysis of the dataset. If set to True the dataset will be analyzed and a report will be added to the tensorboard along with some sample images from the dataset. Currently only detection datasets are supported for analysis.

  • save_full_train_log : bool (default=False)

    When set, a full log (of all super_gradients modules, including uncaught exceptions from any other

    module) of the training will be saved in the checkpoint directory under full_train_log.log

  • sg_logger : Union[AbstractSGLogger, str] (defauls=base_sg_logger)

    Define the SGLogger object for this training process. The SGLogger handles all disk writes, logs, TensorBoard, remote logging and remote storage. By overriding the default base_sg_logger, you can change the storage location, support external monitoring and logging or support remote storage.

  • sg_logger_params : dict

    SGLogger parameters

  • clip_grad_norm : float

    Defines a maximal L2 norm of the gradients. Values which exceed the given value will be clipped

Returns

predict(inputs, targets=None, half=False, normalize=False, verbose=False, move_outputs_to_cpu=True)[source]

A fast predictor for a batch of inputs :param inputs: torch.tensor or numpy.array

a batch of inputs

Parameters
  • targets – torch.tensor() corresponding labels - if non are given - accuracy will not be computed

  • verbose – bool print the results to screen

  • normalize – bool If true, normalizes the tensor according to the dataloader’s normalization values

  • half – Performs half precision evaluation

  • move_outputs_to_cpu – Moves the results from the GPU to the CPU

Returns

outputs, acc, net_time, gross_time networks predictions, accuracy calculation, forward pass net time, function gross time

compute_model_runtime(input_dims: Optional[tuple] = None, batch_sizes: Union[tuple, list, int] = (1, 8, 16, 32, 64), verbose: bool = True)[source]

Compute the “atomic” inference time and throughput. Atomic refers to calculating the forward pass independently, discarding effects such as data augmentation, data upload to device, multi-gpu distribution etc. :param input_dims: tuple

shape of a basic input to the network (without the first index) e.g. (3, 224, 224) if None uses an input from the test loader

Parameters
  • batch_sizes – int or list Batch sizes for latency calculation

  • verbose – bool Prints results to screen

Returns

log: dict Latency and throughput for each tested batch size

get_arch_params()[source]
get_structure()[source]
get_architecture()[source]
set_experiment_name(experiment_name)[source]
re_build_model(arch_params={})[source]
arch_paramsdict

Architecture H.P. e.g.: block, num_blocks, num_classes, etc.

Returns

update_architecture(structure)[source]
architecturestr

Defines the network’s architecture according to the options in models/all_architectures

load_checkpointbool

Loads a checkpoint according to experiment_name

arch_paramsdict

Architecture H.P. e.g.: block, num_blocks, num_classes, etc.

Returns

get_module()[source]
set_module(module)[source]
test(test_loader: Optional[torch.utils.data.dataloader.DataLoader] = None, loss: Optional[torch.nn.modules.loss._Loss] = None, silent_mode: bool = False, test_metrics_list=None, loss_logging_items_names=None, metrics_progress_verbose=False, test_phase_callbacks=None, use_ema_net=True)tuple[source]

Evaluates the model on given dataloader and metrics.

Parameters
  • test_loader – dataloader to perform test on.

  • test_metrics_list – (list(torchmetrics.Metric)) metrics list for evaluation.

  • silent_mode – (bool) controls verbosity

  • metrics_progress_verbose – (bool) controls the verbosity of metrics progress (default=False). Slows down the program.

:param use_ema_net (bool) whether to perform test on self.ema_model.ema (when self.ema_model.ema exists,

otherwise self.net will be tested) (default=True)

Returns

results tuple (tuple) containing the loss items and metric values.

All of the above args will override SgModel’s corresponding attribute when not equal to None. Then evaluation

is ran on self.test_loader with self.test_metrics.

evaluate(data_loader: torch.utils.data.dataloader.DataLoader, metrics: torchmetrics.collections.MetricCollection, evaluation_type: super_gradients.training.sg_model.sg_model.EvaluationType, epoch: Optional[int] = None, silent_mode: bool = False, metrics_progress_verbose: bool = False)[source]

Evaluates the model on given dataloader and metrics.

Parameters
  • data_loader – dataloader to perform evaluataion on

  • metrics – (MetricCollection) metrics for evaluation

  • evaluation_type – (EvaluationType) controls which phase callbacks will be used (for example, on batch end, when evaluation_type=EvaluationType.VALIDATION the Phase.VALIDATION_BATCH_END callbacks will be triggered)

  • epoch – (int) epoch idx

  • silent_mode – (bool) controls verbosity

  • metrics_progress_verbose – (bool) controls the verbosity of metrics progress (default=False). Slows down the program significantly.

Returns

results tuple (tuple) containing the loss items and metric values.

instantiate_net(architecture: Union[torch.nn.modules.module.Module, type, str], arch_params: dict, checkpoint_params: dict, *args, **kwargs)tuple[source]
Instantiates nn.Module according to architecture and arch_params, and handles pretrained weights and the required

module manipulation (i.e head replacement).

Parameters
  • architecture – String, torch.nn.Module or uninstantiated SgModule class describing the netowrks architecture.

  • arch_params – Architecture’s parameters passed to networks c’tor.

  • checkpoint_params – checkpoint loading related parameters dictionary with ‘pretrained_weights’ key, s.t it’s value is a string describing the dataset of the pretrained weights (for example “imagenent”).

Returns

instantiated netowrk i.e torch.nn.Module, architecture_class (will be none when architecture is not str)

Module contents

class super_gradients.training.sg_model.SgModel(experiment_name: str, device: Optional[str] = None, multi_gpu: Union[super_gradients.training.sg_model.sg_model.MultiGPUMode, str] = <MultiGPUMode.OFF: 'Off'>, model_checkpoints_location: str = 'local', overwrite_local_checkpoint: bool = True, ckpt_name: str = 'ckpt_latest.pth', post_prediction_callback: Optional[super_gradients.training.utils.detection_utils.DetectionPostPredictionCallback] = None, ckpt_root_dir=None)[source]

Bases: object

SuperGradient Model - Base Class for Sg Models

train(max_epochs: int, initial_epoch: int, save_model: bool)[source]

the main function used for the training, h.p. updating, logging etc.

predict(idx: int)[source]

returns the predictions and label of the current inputs

test(epoch : int, idx : int, save : bool):

returns the test loss, accuracy and runtime

connect_dataset_interface(dataset_interface: super_gradients.training.datasets.dataset_interfaces.dataset_interface.DatasetInterface, data_loader_num_workers: int = 8)[source]
Parameters
  • dataset_interface – DatasetInterface object

  • data_loader_num_workers – The number of threads to initialize the Data Loaders with The dataset to be connected

build_model(architecture: Union[str, torch.nn.modules.module.Module], arch_params={}, checkpoint_params={}, *args, **kwargs)[source]
Parameters
  • architecture – Defines the network’s architecture from models/ALL_ARCHITECTURES

  • arch_params – Architecture H.P. e.g.: block, num_blocks, num_classes, etc.

  • checkpoint_params

    Dictionary like object with the following key:values:

    load_checkpoint: Load a pre-trained checkpoint strict_load: See StrictLoad class documentation for details. source_ckpt_folder_name: folder name to load the checkpoint from (self.experiment_name if none is given) load_weights_only: loads only the weight from the checkpoint and zeroize the training params load_backbone: loads the provided checkpoint to self.net.backbone instead of self.net external_checkpoint_path: The path to the external checkpoint to be loaded. Can be absolute or relative

    (ie: path/to/checkpoint.pth). If provided, will automatically attempt to load the checkpoint even if the load_checkpoint flag is not provided.

backward_step(loss: torch.Tensor, epoch: int, batch_idx: int, context: super_gradients.training.utils.callbacks.PhaseContext, *args, **kwargs)[source]

Run backprop on the loss and perform a step :param loss: The value computed by the loss function :param optimizer: An object that can perform a gradient step and zeroize model gradient :param epoch: number of epoch the training is on :param batch_idx: number of iteration inside the current epoch :param context: current phase context :return:

save_checkpoint(optimizer=None, epoch: Optional[int] = None, validation_results_tuple: Optional[tuple] = None, context: Optional[super_gradients.training.utils.callbacks.PhaseContext] = None)[source]

Save the current state dict as latest (always), best (if metric was improved), epoch# (if determined in training params)

train(training_params: dict = {})[source]

train - Trains the Model

IMPORTANT NOTE: Additional batch parameters can be added as a third item (optional) if a tuple is returned by

the data loaders, as dictionary. The phase context will hold the additional items, under an attribute with the same name as the key in this dictionary. Then such items can be accessed through phase callbacks.

param training_params
  • max_epochs : int

    Number of epochs to run training.

  • lr_updates : list(int)

    List of fixed epoch numbers to perform learning rate updates when lr_mode=’step’.

  • lr_decay_factor : float

    Decay factor to apply to the learning rate at each update when lr_mode=’step’.

  • lr_mode : str

    Learning rate scheduling policy, one of [‘step’,’poly’,’cosine’,’function’]. ‘step’ refers to constant updates at epoch numbers passed through lr_updates. ‘cosine’ refers to Cosine Anealing policy as mentioned in https://arxiv.org/abs/1608.03983. ‘poly’ refers to polynomial decrease i.e in each epoch iteration self.lr = self.initial_lr * pow((1.0 - (current_iter / max_iter)), 0.9) ‘function’ refers to user defined learning rate scheduling function, that is passed through lr_schedule_function.

  • lr_schedule_function : Union[callable,None]

    Learning rate scheduling function to be used when lr_mode is ‘function’.

  • lr_warmup_epochs : int (default=0)

    Number of epochs for learning rate warm up - see https://arxiv.org/pdf/1706.02677.pdf (Section 2.2).

  • cosine_final_lr_ratiofloat (default=0.01)
    Final learning rate ratio (only relevant when `lr_mode`=’cosine’). The cosine starts from initial_lr and reaches

    initial_lr * cosine_final_lr_ratio in last epoch

  • inital_lr : float

    Initial learning rate.

  • loss : Union[nn.module, str]

    Loss function for training. One of SuperGradient’s built in options:

    “cross_entropy”: LabelSmoothingCrossEntropyLoss, “mse”: MSELoss, “r_squared_loss”: RSquaredLoss, “detection_loss”: YoLoV3DetectionLoss, “shelfnet_ohem_loss”: ShelfNetOHEMLoss, “shelfnet_se_loss”: ShelfNetSemanticEncodingLoss, “yolo_v5_loss”: YoLoV5DetectionLoss, “ssd_loss”: SSDLoss,

    or user defined nn.module loss function.

    IMPORTANT: forward(…) should return a (loss, loss_items) tuple where loss is the tensor used for backprop (i.e what your original loss function returns), and loss_items should be a tensor of shape (n_items), of values computed during the forward pass which we desire to log over the entire epoch. For example- the loss itself should always be logged. Another example is a scenario where the computed loss is the sum of a few components we would like to log- these entries in loss_items).

    When training, set the loss_logging_items_names parameter in train_params to be a list of strings, of length n_items who’s ith element is the name of the ith entry in loss_items. Then each item will be logged, rendered on tensorboard and “watched” (i.e saving model checkpoints according to it).

    Since running logs will save the loss_items in some internal state, it is recommended that loss_items are detached from their computational graph for memory efficiency.

  • optimizer : Union[str, torch.optim.Optimizer]

    Optimization algorithm. One of [‘Adam’,’SGD’,’RMSProp’] corresponding to the torch.optim optimzers implementations, or any object that implements torch.optim.Optimizer.

  • criterion_params : dict

    Loss function parameters.

  • optimizer_paramsdict

    When optimizer is one of [‘Adam’,’SGD’,’RMSProp’], it will be initialized with optimizer_params.

    (see https://pytorch.org/docs/stable/optim.html for the full list of parameters for each optimizer).

  • train_metrics_list : list(torchmetrics.Metric)

    Metrics to log during training. For more information on torchmetrics see https://torchmetrics.rtfd.io/en/latest/.

  • valid_metrics_list : list(torchmetrics.Metric)

    Metrics to log during validation/testing. For more information on torchmetrics see https://torchmetrics.rtfd.io/en/latest/.

  • loss_logging_items_names : list(str)

    The list of names/titles for the outputs returned from the loss functions forward pass (reminder- the loss function should return the tuple (loss, loss_items)). These names will be used for logging their values.

  • metric_to_watch : str (default=”Accuracy”)

    will be the metric which the model checkpoint will be saved according to, and can be set to any of the following:

    a metric name (str) of one of the metric objects from the valid_metrics_list

    a “metric_name” if some metric in valid_metrics_list has an attribute component_names which is a list referring to the names of each entry in the output metric (torch tensor of size n)

    one of “loss_logging_items_names” i.e which will correspond to an item returned during the loss function’s forward pass.

    At the end of each epoch, if a new best metric_to_watch value is achieved, the models checkpoint is saved in YOUR_PYTHON_PATH/checkpoints/ckpt_best.pth

  • greater_metric_to_watch_is_better : bool

    When choosing a model’s checkpoint to be saved, the best achieved model is the one that maximizes the

    metric_to_watch when this parameter is set to True, and a one that minimizes it otherwise.

  • ema : bool (default=False)

    Whether to use Model Exponential Moving Average (see https://github.com/rwightman/pytorch-image-models ema implementation)

  • batch_accumulate : int (default=1)

    Number of batches to accumulate before every backward pass.

  • ema_params : dict

    Parameters for the ema model.

  • zero_weight_decay_on_bias_and_bn : bool (default=False)

    Whether to apply weight decay on batch normalization parameters or not (ignored when the passed optimizer has already been initialized).

  • load_opt_params : bool (default=True)

    Whether to load the optimizers parameters as well when loading a model’s checkpoint.

  • run_validation_freq : int (default=1)

    The frequency in which validation is performed during training (i.e the validation is ran every

    run_validation_freq epochs.

  • save_model : bool (default=True)

    Whether to save the model checkpoints.

  • silent_mode : bool

    Silents the print outs.

  • mixed_precision : bool

    Whether to use mixed precision or not.

  • save_ckpt_epoch_list : list(int) (default=[])

    List of fixed epoch indices the user wishes to save checkpoints in.

  • average_best_models : bool (default=False)

    If set, a snapshot dictionary file and the average model will be saved / updated at every epoch and evaluated only when training is completed. The snapshot file will only be deleted upon completing the training. The snapshot dict will be managed on cpu.

  • precise_bn : bool (default=False)

    Whether to use precise_bn calculation during the training.

  • precise_bn_batch_size : int (default=None)

    The effective batch size we want to calculate the batchnorm on. For example, if we are training a model on 8 gpus, with a batch of 128 on each gpu, a good rule of thumb would be to give it 8192 (ie: effective_batch_size * num_gpus = batch_per_gpu * num_gpus * num_gpus). If precise_bn_batch_size is not provided in the training_params, the latter heuristic will be taken.

  • seed : int (default=42)

    Random seed to be set for torch, numpy, and random. When using DDP each process will have it’s seed set to seed + rank.

  • log_installed_packages : bool (default=False)

    When set, the list of all installed packages (and their versions) will be written to the tensorboard

    and logfile (useful when trying to reproduce results).

  • dataset_statistics : bool (default=False)

    Enable a statistic analysis of the dataset. If set to True the dataset will be analyzed and a report will be added to the tensorboard along with some sample images from the dataset. Currently only detection datasets are supported for analysis.

  • save_full_train_log : bool (default=False)

    When set, a full log (of all super_gradients modules, including uncaught exceptions from any other

    module) of the training will be saved in the checkpoint directory under full_train_log.log

  • sg_logger : Union[AbstractSGLogger, str] (defauls=base_sg_logger)

    Define the SGLogger object for this training process. The SGLogger handles all disk writes, logs, TensorBoard, remote logging and remote storage. By overriding the default base_sg_logger, you can change the storage location, support external monitoring and logging or support remote storage.

  • sg_logger_params : dict

    SGLogger parameters

  • clip_grad_norm : float

    Defines a maximal L2 norm of the gradients. Values which exceed the given value will be clipped

Returns

predict(inputs, targets=None, half=False, normalize=False, verbose=False, move_outputs_to_cpu=True)[source]

A fast predictor for a batch of inputs :param inputs: torch.tensor or numpy.array

a batch of inputs

Parameters
  • targets – torch.tensor() corresponding labels - if non are given - accuracy will not be computed

  • verbose – bool print the results to screen

  • normalize – bool If true, normalizes the tensor according to the dataloader’s normalization values

  • half – Performs half precision evaluation

  • move_outputs_to_cpu – Moves the results from the GPU to the CPU

Returns

outputs, acc, net_time, gross_time networks predictions, accuracy calculation, forward pass net time, function gross time

compute_model_runtime(input_dims: Optional[tuple] = None, batch_sizes: Union[tuple, list, int] = (1, 8, 16, 32, 64), verbose: bool = True)[source]

Compute the “atomic” inference time and throughput. Atomic refers to calculating the forward pass independently, discarding effects such as data augmentation, data upload to device, multi-gpu distribution etc. :param input_dims: tuple

shape of a basic input to the network (without the first index) e.g. (3, 224, 224) if None uses an input from the test loader

Parameters
  • batch_sizes – int or list Batch sizes for latency calculation

  • verbose – bool Prints results to screen

Returns

log: dict Latency and throughput for each tested batch size

get_arch_params()[source]
get_structure()[source]
get_architecture()[source]
set_experiment_name(experiment_name)[source]
re_build_model(arch_params={})[source]
arch_paramsdict

Architecture H.P. e.g.: block, num_blocks, num_classes, etc.

Returns

update_architecture(structure)[source]
architecturestr

Defines the network’s architecture according to the options in models/all_architectures

load_checkpointbool

Loads a checkpoint according to experiment_name

arch_paramsdict

Architecture H.P. e.g.: block, num_blocks, num_classes, etc.

Returns

get_module()[source]
set_module(module)[source]
test(test_loader: Optional[torch.utils.data.dataloader.DataLoader] = None, loss: Optional[torch.nn.modules.loss._Loss] = None, silent_mode: bool = False, test_metrics_list=None, loss_logging_items_names=None, metrics_progress_verbose=False, test_phase_callbacks=None, use_ema_net=True)tuple[source]

Evaluates the model on given dataloader and metrics.

Parameters
  • test_loader – dataloader to perform test on.

  • test_metrics_list – (list(torchmetrics.Metric)) metrics list for evaluation.

  • silent_mode – (bool) controls verbosity

  • metrics_progress_verbose – (bool) controls the verbosity of metrics progress (default=False). Slows down the program.

:param use_ema_net (bool) whether to perform test on self.ema_model.ema (when self.ema_model.ema exists,

otherwise self.net will be tested) (default=True)

Returns

results tuple (tuple) containing the loss items and metric values.

All of the above args will override SgModel’s corresponding attribute when not equal to None. Then evaluation

is ran on self.test_loader with self.test_metrics.

evaluate(data_loader: torch.utils.data.dataloader.DataLoader, metrics: torchmetrics.collections.MetricCollection, evaluation_type: super_gradients.training.sg_model.sg_model.EvaluationType, epoch: Optional[int] = None, silent_mode: bool = False, metrics_progress_verbose: bool = False)[source]

Evaluates the model on given dataloader and metrics.

Parameters
  • data_loader – dataloader to perform evaluataion on

  • metrics – (MetricCollection) metrics for evaluation

  • evaluation_type – (EvaluationType) controls which phase callbacks will be used (for example, on batch end, when evaluation_type=EvaluationType.VALIDATION the Phase.VALIDATION_BATCH_END callbacks will be triggered)

  • epoch – (int) epoch idx

  • silent_mode – (bool) controls verbosity

  • metrics_progress_verbose – (bool) controls the verbosity of metrics progress (default=False). Slows down the program significantly.

Returns

results tuple (tuple) containing the loss items and metric values.

instantiate_net(architecture: Union[torch.nn.modules.module.Module, type, str], arch_params: dict, checkpoint_params: dict, *args, **kwargs)tuple[source]
Instantiates nn.Module according to architecture and arch_params, and handles pretrained weights and the required

module manipulation (i.e head replacement).

Parameters
  • architecture – String, torch.nn.Module or uninstantiated SgModule class describing the netowrks architecture.

  • arch_params – Architecture’s parameters passed to networks c’tor.

  • checkpoint_params – checkpoint loading related parameters dictionary with ‘pretrained_weights’ key, s.t it’s value is a string describing the dataset of the pretrained weights (for example “imagenent”).

Returns

instantiated netowrk i.e torch.nn.Module, architecture_class (will be none when architecture is not str)

class super_gradients.training.sg_model.MultiGPUMode(value)[source]

Bases: str, enum.Enum

OFF                       - Single GPU Mode / CPU Mode
DATA_PARALLEL             - Multiple GPUs, Synchronous
DISTRIBUTED_DATA_PARALLEL - Multiple GPUs, Asynchronous
OFF = 'Off'
DATA_PARALLEL = 'DP'
DISTRIBUTED_DATA_PARALLEL = 'DDP'
AUTO = 'AUTO'
class super_gradients.training.sg_model.StrictLoad(value)[source]

Bases: enum.Enum

Wrapper for adding more functionality to torch’s strict_load parameter in load_state_dict().
Attributes:

OFF - Native torch “strict_load = off” behaviour. See nn.Module.load_state_dict() documentation for more details. ON - Native torch “strict_load = on” behaviour. See nn.Module.load_state_dict() documentation for more details. NO_KEY_MATCHING - Allows the usage of SuperGradient’s adapt_checkpoint function, which loads a checkpoint by matching each

layer’s shapes (and bypasses the strict matching of the names of each layer (ie: disregards the state_dict key matching)).

OFF = False
ON = True
NO_KEY_MATCHING = 'no_key_matching'