Data Contract¶
This page defines the model-facing sample produced from the GeoTIFF workflow.
The files on disk are byte-encoded rasters and a preprocessed ARGO profile
store; the loader decodes them into physical units, normalizes temperature,
and returns PyTorch tensors with the shapes below. Salinity tensors are
returned when the selected scenario needs salinity (salinity or joint).
Axes¶
Symbols used in this contract:
| Symbol | Meaning | Default |
|---|---|---|
B |
batch size after DataLoader collation |
configured by training |
D |
GLORYS depth levels | 50 |
H |
patch height in raster rows | 128 |
W |
patch width in raster columns | 128 |
The default horizontal resolution is 0.1 degrees. With H = W = 128, one
training patch covers 12.8 x 12.8 degrees. The default GeoTIFF patch stride is
32 pixels, so neighboring patches overlap by 75% of a tile.
Sample Keys¶
Each dataset item is a dictionary. After collation, tensor keys gain a leading
batch dimension. data.dataset.output.fields controls which physical fields are loaded: temperature uses x/y, salinity-only uses only the salinity keys, and joint uses both groups.
| Key | Item shape | Batch shape | Dtype | Meaning |
|---|---|---|---|---|
eo |
(1, H, W) |
(B, 1, H, W) |
float32 |
Dense OSTIA surface temperature context, normalized. |
x |
(D, H, W) |
(B, D, H, W) |
float32 |
Sparse ARGO temperature observations, normalized and zero-filled where missing. |
y |
(D, H, W) |
(B, D, H, W) |
float32 |
Dense GLORYS thetao target, normalized and zero-filled where invalid. |
x_salinity |
(D, H, W) |
(B, D, H, W) |
float32 |
Opt-in sparse ARGO salinity observations, normalized and zero-filled where missing. |
y_salinity |
(D, H, W) |
(B, D, H, W) |
float32 |
Opt-in dense GLORYS so salinity target, normalized and zero-filled where invalid. |
x_valid_mask |
(D, H, W) |
(B, D, H, W) |
bool |
True where x contains an observed ARGO temperature value. |
y_valid_mask |
(D, H, W) |
(B, D, H, W) |
bool |
True where the GLORYS target is valid ocean data. |
x_salinity_valid_mask |
(D, H, W) |
(B, D, H, W) |
bool |
Opt-in mask where x_salinity contains an observed ARGO salinity value. |
y_salinity_valid_mask |
(D, H, W) |
(B, D, H, W) |
bool |
Opt-in mask where the GLORYS salinity target is valid ocean data. |
x_valid_mask_1d |
(1, H, W) |
(B, 1, H, W) |
bool |
True where any ARGO temperature depth is present in that horizontal pixel. |
x_salinity_valid_mask_1d |
(1, H, W) |
(B, 1, H, W) |
bool |
Opt-in mask where any ARGO salinity depth is present in that horizontal pixel. |
land_mask |
(1, H, W) |
(B, 1, H, W) |
float32 |
GLORYS spatial ocean/domain support; 1 where any active target depth is finite and 0 elsewhere. |
date |
scalar | (B,) |
integer | GLORYS target date as YYYYMMDD. |
coords |
(2,) |
(B, 2) |
float32 |
Optional patch-center latitude and longitude. |
info |
dictionary | list-like | metadata | Optional debugging metadata, not part of the training model input. |
x_valid_mask is ARGO observation support, collapsed to one channel only when it is used as conditioning. land_mask is GLORYS-derived spatial ocean/domain support and gates the diffusion loss together with the task-valid mask; if GLORYS support is unavailable for mask construction, the loader falls back to finite EO support and then the configured on-disk mask. Train/validation dataloaders do not return the common on-disk mask; callers may pass an optional output_land_mask directly to predict_step for final cleanup overlays. Training code should not infer missing values from zeros in x, y, optional x_salinity, optional y_salinity, or eo.
Salinity Scenarios¶
The active super-configs do not require users to maintain salinity flags by hand. Select the scenario instead:
/work/envs/depth/bin/python train.py --scenario temperature
/work/envs/depth/bin/python train.py --scenario salinity
/work/envs/depth/bin/python train.py --scenario joint
The resolver writes dataset.output.fields for the selected scenario and sets dataset.output.include_salinity=false for temperature and true for salinity and joint. It also derives the EO raster source (ostia/analysed_sst for temperature and joint, sss/sos for salinity), model.output_fields, model.generated_channels, and model.condition_channels, so the GeoTIFF loader and model agree before batches are built.
Loading Steps¶
For each selected (patch, date) row, the GeoTIFF loader should:
- Build a rasterio window from the shared land-mask grid.
- Read
rasters/glorys/thetao/thetao_YYYYMMDD.tifas(D, H, W). - Read the resolved EO raster as
(H, W)and add the leading channel dimension:rasters/ostia/analysed_sst/...for temperature/joint orrasters/sss/sos/...for salinity. - If the resolved scenario enables salinity, read
rasters/glorys/so/so_YYYYMMDD.tifas(D, H, W). - Query preprocessed ARGO profiles assigned to the same target date and patch window.
- Rasterize ARGO temperature, plus salinity when enabled, onto
(D, H, W)using precomputedgrid_rowandgrid_col; average duplicate observations in the same depth/pixel cell. - Build validity masks from GeoTIFF nodata codes and ARGO valid flags.
- Derive
land_maskfrom finite GLORYS target support, with fallback to finite EO support and then the configured on-disk mask. - Normalize temperature, plus salinity when enabled, and replace NaN or
infinite normalized values with
0.0.
Sea-level adt and SSS dos are exported on the same grid for auxiliary experiments. SSS sos is the default salinity-scenario EO channel.
Decoding¶
All exported rasters use uint8 with 255 reserved for nodata:
0..254 = valid stretched values
255 = nodata
decoded = minimum + code / 254 * (maximum - minimum)
Temperature rasters decode to Kelvin:
| Product | Variable | Decoded units | Stretch |
|---|---|---|---|
| GLORYS | thetao |
Kelvin | [270.15, 308.15] |
| OSTIA | analysed_sst |
Kelvin | [270.15, 308.15] |
| ARGO | temperature | Kelvin | [270.15, 308.15] |
Other exported variables decode to their physical units:
| Product | Variable | Decoded units | Stretch |
|---|---|---|---|
| GLORYS | so |
PSU | [30, 40] |
| ARGO | salinity | PSU | [30, 40] |
| Sea Level L4 | adt |
meters | [-2, 2] |
| SSS | sos |
PSU | [30, 40] |
| SSS | dos |
kg/m3 | [1000, 1035] |
The loader must treat code 255 as missing before normalization. Valid code
0 is a real clipped value, not missing.
Temperature Normalization¶
Training temperature tensors use the existing project normalization:
normalized = (temperature_kelvin - 289.74267177946783) / 10.933397487585731
This is equivalent to converting decoded Kelvin to Celsius and calling
temperature_normalize(mode="norm", ...), because that helper adds 273.15
internally. A GeoTIFF loader can therefore either normalize directly from
Kelvin with the formula above or convert to Celsius first and reuse the helper.
After normalization:
- Missing
x,y, andeovalues are filled with0.0. x_valid_mask,y_valid_mask, andx_valid_mask_1dpreserve which values were physically observed or supervised.- Losses and metrics should use
y_valid_maskso zero-filled invalid target pixels do not contribute.
Salinity Normalization¶
Training salinity tensors use the project salinity target statistics:
normalized = (salinity_psu - 34.54260282159372) / 1.158266487751096
This is equivalent to calling salinity_normalize(mode="norm", ...); use
salinity_normalize(mode="denorm", ...) to recover physical PSU values.
After normalization, missing x_salinity and y_salinity values are filled
with 0.0, while x_salinity_valid_mask, y_salinity_valid_mask, and
x_salinity_valid_mask_1d preserve the physical support.
Temporal Contract¶
The GLORYS weekly date is the sample date. Dense rasters are expected to exist for every exported date:
thetaois the GLORYS weekly target for that date.analysed_sstis the centered 7-day OSTIA mean around that date.adtis the centered 7-day sea-level mean around that date.- SSS
sosanddosare centered 7-day means around that date. - ARGO profiles are assigned to the nearest GLORYS weekly date inside the same temporal window.
The default validation split uses calendar year 2018; all other years are
training rows. When patches overlap, keep a date-based split such as this to
avoid spatial train/validation leakage.