Training¶
Training is launched via train.py and is fully config-driven.
Recommended CLI Usage¶
Use explicit config paths to avoid ambiguity:
/work/envs/depth/bin/python train.py \
--data-config configs/px_space/data_ostia_argo_disk.yaml \
--train-config configs/px_space/training_config.yaml \
--model-config configs/px_space/model_config.yaml
CLI aliases:
- --train-config and --training-config are equivalent
- --model-config also accepts the typo alias --mdoel-config
- --set <root.path=value> is repeatable for strict nested overrides (root in data, training, model)
- because configs/px_space/model_config.yaml itself is nested under top-level model:, model overrides must use model.model.* (example below)
Override example:
/work/envs/depth/bin/python train.py \
--data-config configs/px_space/data_ostia_argo_disk.yaml \
--train-config configs/px_space/training_config.yaml \
--model-config configs/px_space/model_config.yaml \
--set data.dataset.output.return_info=true \
--set training.trainer.max_epochs=100 \
--set training.wandb.run_name=null
Ambient-occlusion objective example (self-supervised on x):
/work/envs/depth/bin/python train.py \
--data-config configs/px_space/data_ostia_argo_disk.yaml \
--train-config configs/px_space/training_config.yaml \
--model-config configs/px_space/model_config_ambient.yaml \
--set training.wandb.run_name=ambient_ostia_argo_disk_v1
When enabled, training logs:
- train/ambient_further_drop_fraction
- train/ambient_observed_fraction_original
- train/ambient_observed_fraction_further
- same metrics under val/* on validation epochs
See Ambient Occlusion Objective for the full derivation, figure walkthrough, and paper citation.
Note: turning model.ambient_occlusion.enabled back to false switches training back to direct y reconstruction over y_valid_mask. With model.mask_loss_with_valid_pixels=true, the standard task uses y_valid_mask, while ambient uses x_valid_mask ∩ y_valid_mask.
For CLI overrides, the corresponding path is model.model.ambient_occlusion.enabled=false.
Important Config Notes¶
dataset.core.dataloader_typeis expected to be"light"in the training runner.model.model_type="cond_px_dif"runs pixel-space diffusion.model.model_type="latent_cond_dif"runs latent diffusion with the autoencoder bridge.- dataset variant is selected by
dataset.core.dataset_variant(or inferred from data config filename) dataset_variantselects the dataset implementation in code ("eo_4band","ostia", or"ostia_argo_disk")SurfaceTempPatchOstiaLightDatasetstill does not apply EO degradation (no EO dropout/random-scale/speckle) when you train the legacy OSTIA overlap dataset through an older config.- EO dropout from data config is injected into dataset object for both train and val
- parser defaults in
train.pynow point toconfigs/px_space/*.yaml; explicit CLI paths are still recommended for reproducibility
Autoencoder and Latent Workflow¶
For latent diffusion training, use the latent config domain under configs/lat_space/.
Autoencoder pretraining command:
/work/envs/depth/bin/python train_autoencoder.py \
--ae-config configs/lat_space/ae_config.yaml \
--data-config configs/lat_space/data_config.yaml \
--train-config configs/lat_space/training_config.yaml
Latent diffusion command:
/work/envs/depth/bin/python train.py \
--data-config configs/lat_space/data_config.yaml \
--train-config configs/lat_space/training_config.yaml \
--model-config configs/lat_space/model_config.yaml
Repo launcher scripts:
- ./scripts/train_autoencoder.sh
- ./scripts/train_latent_diffusion.sh
Design details, model goals, and limitations are documented in Autoencoder + Latent Diffusion.
What train.py Does During Startup¶
- Resolves distributed rank and creates a run directory under
logs/<timestamp>on global rank 0. - Copies exact config files into the run directory for reproducibility.
- Loads configs and validates
model.resume_checkpoint/model.load_checkpointearly. - Builds dataset and datamodule.
- Instantiates
PixelDiffusionConditional.from_config(...). - Sets up W&B logger and callbacks.
Checkpointing and Resume¶
ModelCheckpoint behavior:
- best checkpoint: best-epoch{epoch:03d}.ckpt (monitor from trainer.ckpt_monitor)
- always saved: last.ckpt
- location: current run folder under logs/
Resume and warm-start behavior:
- set model.resume_checkpoint to a valid .ckpt path to resume full Lightning state (model + optimizer/scheduler/trainer state)
- set model.load_checkpoint to a valid .ckpt path to load only model state_dict before training starts (optimizer/scheduler state is re-initialized)
- model.resume_checkpoint and model.load_checkpoint are mutually exclusive
- invalid path fails early before trainer start
Device, Precision, and Validation Controls¶
From training_config trainer section:
- accelerator/devices strategy (accelerator, devices, optional legacy num_gpus)
- mixed precision (precision)
- optional validation cap via val_batches_per_epoch or limit_val_batches
- gradient clipping (gradient_clip_val)
- epoch-end full-reconstruction validation diagnostics run on global rank 0 only; regular validation_step loss metrics still use distributed reduction
Learning Rate Behavior¶
PixelDiffusionConditional supports:
- step-based linear warmup in optimizer_step
- ReduceLROnPlateau scheduler when enabled
Warmup and scheduler are configured via:
- scheduler.warmup.*
- scheduler.reduce_on_plateau.*
Logging¶
W&B logging is configured in training_config.wandb.
Notable behavior:
- gradients/parameters watching is opt-in via watch_gradients / watch_parameters
- periodic scalar/image logging intervals are configurable
- config files are uploaded to W&B run files (when experiment handle is available)
W&B Occlusion Sweep (EO Always Available)¶
Sweep config:
- configs/px_space/sweeps/eo_occlusion_grid_no_eodrop.yaml
This sweep runs grid values:
- mask_fraction: 0.95, 0.96, 0.97, 0.98, 0.99, 0.995
- fixed overrides:
- data.dataset.conditioning.eo_dropout_prob=0.0
- training.trainer.max_epochs=100
- training.wandb.run_name=null (auto-generated run names)
Launch:
./scripts/start_occlusion_sweep.sh
Equivalent manual steps:
/work/envs/depth/bin/wandb sweep configs/px_space/sweeps/eo_occlusion_grid_no_eodrop.yaml
/work/envs/depth/bin/wandb agent <entity/project/sweep_id>