Training¶
Training is launched via train.py and is fully config-driven.
Recommended CLI Usage¶
Use explicit config paths to avoid ambiguity:
/work/envs/depth/bin/python train.py \
--data-config configs/data_config_eo_4band.yaml \
--train-config configs/training_config_eo_4band.yaml \
--model-config configs/model_config_eo_4band.yaml
CLI aliases:
- --train-config and --training-config are equivalent
- --model-config also accepts the typo alias --mdoel-config
- --set <root.path=value> is repeatable for strict nested overrides (root in data, training, model)
Override example:
/work/envs/depth/bin/python train.py \
--data-config configs/data_config_eo_4band.yaml \
--train-config configs/training_config_eo_4band.yaml \
--model-config configs/model_config_eo_4band.yaml \
--set data.dataset.mask_fraction=0.99 \
--set data.dataset.eo_dropout_prob=0.0 \
--set training.trainer.max_epochs=100 \
--set training.wandb.run_name=null
Single-Band (Legacy Config Set)¶
For the single-band setup in this repo, use configs/older_configs/*:
/work/envs/depth/bin/python train.py \
--data-config configs/older_configs/data_config.yaml \
--train-config configs/older_configs/training_config.yaml \
--model-config configs/older_configs/model_config.yaml
Before launching a fresh run with this legacy set, set
model.resume_checkpoint: false (or null) in configs/older_configs/model_config.yaml.
Important Config Notes¶
train.pycurrently supports only:dataset.dataloader_type: "light"model.model_type: "cond_px_dif"- dataset variant is selected by
dataset.dataset_variant(or inferred from data config filename) - EO dropout from data config is injected into dataset object for both train and val
- parser defaults in
train.pystill point to legacyconfigs/*_config.yamlnames, so explicit CLI paths are recommended
What train.py Does During Startup¶
- Resolves distributed rank and creates a run directory under
logs/<timestamp>on global rank 0. - Copies exact config files into the run directory for reproducibility.
- Loads configs and validates
model.resume_checkpointearly. - Builds dataset and datamodule.
- Instantiates
PixelDiffusionConditional.from_config(...). - Sets up W&B logger and callbacks.
Checkpointing and Resume¶
ModelCheckpoint behavior:
- best checkpoint: best-epoch{epoch:03d}.ckpt (monitor from trainer.ckpt_monitor)
- always saved: last.ckpt
- location: current run folder under logs/
Resume behavior:
- set model.resume_checkpoint to a valid .ckpt path
- invalid path fails early before trainer start
Device, Precision, and Validation Controls¶
From training_config trainer section:
- accelerator/devices strategy (accelerator, devices, optional legacy num_gpus)
- mixed precision (precision)
- optional validation cap via val_batches_per_epoch or limit_val_batches
- gradient clipping (gradient_clip_val)
Learning Rate Behavior¶
PixelDiffusionConditional supports:
- step-based linear warmup in optimizer_step
- ReduceLROnPlateau scheduler when enabled
Warmup and scheduler are configured via:
- scheduler.warmup.*
- scheduler.reduce_on_plateau.*
Logging¶
W&B logging is configured in training_config.wandb.
Notable behavior:
- gradients/parameters watching is opt-in via watch_gradients / watch_parameters
- periodic scalar/image logging intervals are configurable
- config files are uploaded to W&B run files (when experiment handle is available)
W&B Occlusion Sweep (EO Always Available)¶
Sweep config:
- configs/sweeps/eo_occlusion_grid_no_eodrop.yaml
This sweep runs grid values:
- mask_fraction: 0.95, 0.96, 0.97, 0.98, 0.99, 0.995
- fixed overrides:
- data.dataset.eo_dropout_prob=0.0
- training.trainer.max_epochs=100
- training.wandb.run_name=null (auto-generated run names)
Launch:
./scripts/start_occlusion_sweep.sh
Equivalent manual steps:
/work/envs/depth/bin/wandb sweep configs/sweeps/eo_occlusion_grid_no_eodrop.yaml
/work/envs/depth/bin/wandb agent <entity/project/sweep_id>