Data Export¶
This page describes the GeoTIFF training dataset exported from the downloaded raw products. Use Dataset Downloads for acquisition and Depth Alignment for the ARGO-to-GLORYS vertical projection.
The export goal is a directory that can be shared or copied as one training
dataset: dense fields are stored as aligned GeoTIFF rasters, and sparse ARGO
profiles are stored in a compact grid-indexed profile store. The aligned ARGO
profile zarr is also packageable as a Hugging Face dataset folder before it is
used by the GeoTIFF export.
Training selects the packaged dataset folder at
/work/data/OceanVariableReconstruction with
src/depth_recon/configs/px_space/training_super_config.yaml, which sets
data.dataset.core.dataset_variant: argo_geotiff_gridded.
Aligned ARGO and Hugging Face Layout¶
b_export_enriched_argo_profiles.py creates the enriched profile-level zarr by
aligning EN4/ARGO profiles with GLORYS depth levels and sampling GLORYS, OSTIA,
sea-level, and SSS context at each profile location. By default, it writes:
/data1/datasets/depth_v2/aligned_argo/enriched_argo_profiles.zarr
c_package_huggingface_aligned_argo.py repackages that same zarr into a
Hugging Face-ready folder without changing the zarr schema:
/data1/datasets/depth_v2/aligned_argo/
enriched_argo_profiles.zarr
hf_argo_glors_ostia_ssh/
README.md
LICENSE
data/
argo_glors_ostia_ssh.zarr/
indices/
profiles.parquet
variables.parquet
examples/
open_with_xarray.py
subset_by_region_time.py
metadata/
dataset_description.json
citation.cff
stac-item.json
The GeoTIFF exporter can read either the direct zarr path or the packaged zarr path:
/data1/datasets/depth_v2/aligned_argo/hf_argo_glors_ostia_ssh/data/argo_glors_ostia_ssh.zarr
Output Layout¶
GeoTIFF staging export root before Hugging Face packaging:
/work/data/depthdif
Staged output files:
/work/data/depthdif/
manifest.yaml
rasters/
glorys/
thetao/thetao_YYYYMMDD.tif
so/so_YYYYMMDD.tif
ostia/
analysed_sst/analysed_sst_YYYYMMDD.tif
sealevel/
adt/adt_YYYYMMDD.tif
sss/
sos/sos_YYYYMMDD.tif
dos/dos_YYYYMMDD.tif
argo/
argo_profiles_on_grid.zarr
The Hugging Face package step copies these assets into /work/data/OceanVariableReconstruction, which is the path used by the dataset and dataloader. Each YYYYMMDD is a GLORYS weekly target date. Files for the same date share
the same CRS, transform, width, height, pixel centers, and nodata convention.
The export includes GLORYS so rasters and ARGO argo_psal_* variables so the
same on-disk dataset can support salinity-only and joint temperature/salinity
experiments. The GeoTIFF dataloader reads and returns those salinity fields when
the resolved scenario is salinity or joint. Daily SSS sos and dos
rasters are also exported for auxiliary experiments.
Spatial Grid¶
The land-mask GeoTIFF is the authoritative grid:
src/depth_recon/data/dataset_creation/data_download_raw/get_world/world_land_mask_glorys_0p1.tif
The export uses that file for:
- CRS and affine transform
- raster width and height
- 0.1 degree pixel centers
- patch-grid compatibility with training and inference
GLORYS, OSTIA, sea-level, and SSS data are read onto this exact grid before quantization. If source coordinates already match the requested pixel centers, the exporter uses exact nearest-coordinate selection; otherwise it interpolates onto the target axes.
Dense Raster Products¶
GLORYS Temperature¶
Path:
rasters/glorys/thetao/thetao_YYYYMMDD.tif
Contents:
- Source variable:
thetao - Source units: Celsius in GLORYS source files
- Stored physical units after decoding: Kelvin
- File structure: one multiband GeoTIFF per weekly date
- Bands: one band per GLORYS depth level
- Band metadata:
depth_m, stretch information, clipped counts, nodata counts
GLORYS weekly files are exported as-is for their target date. They are not averaged across the week.
GLORYS Salinity¶
Path:
rasters/glorys/so/so_YYYYMMDD.tif
Contents:
- Source variable:
so - Stored physical units after decoding: PSU
- File structure: one multiband GeoTIFF per weekly date
- Bands: one band per GLORYS depth level
- Band metadata:
depth_m, stretch information, clipped counts, nodata counts
OSTIA Surface Temperature¶
Path:
rasters/ostia/analysed_sst/analysed_sst_YYYYMMDD.tif
Contents:
- Source variable:
analysed_sst - Stored physical units after decoding: Kelvin
- File structure: one single-band GeoTIFF per GLORYS weekly date
- Temporal aggregation: centered 7-day mean around the GLORYS date by default
OSTIA values are normally Kelvin. The exporter preserves Kelvin values and only
adds 273.15 when the source values look Celsius-like.
Sea Level¶
Path:
rasters/sealevel/adt/adt_YYYYMMDD.tif
Contents:
- Source variable:
adt - Stored physical units after decoding: meters
- File structure: one single-band GeoTIFF per GLORYS weekly date
- Temporal aggregation: centered 7-day mean around the GLORYS date by default
SSS Surface Salinity and Density¶
Path:
rasters/sss/<variable>/<variable>_YYYYMMDD.tif
Contents:
- Source variables:
sos,dos - Stored physical units after decoding: PSU for
sos, kg/m3 fordos - File structure: one single-band GeoTIFF per variable and GLORYS weekly date
- Temporal aggregation: centered 7-day mean around the GLORYS date by default
Variables in the Export Products¶
The enriched ARGO zarr and its Hugging Face package contain the same variables. The package adds Parquet indices for discovery only; those indices do not replace the zarr arrays.
| Product | Variables |
|---|---|
| ARGO profile coordinates and provenance | latitude, longitude, profile_date, profile_juld, profile_idx, profile_source_file, valid_observed_depth_count |
| ARGO profiles on GLORYS depths | argo_temp_on_glorys_depth, argo_potm_on_glorys_depth, argo_psal_on_glorys_depth, matching *_valid_on_glorys_depth masks, matching *_qc_on_glorys_depth codes, argo_depth_qc_on_glorys_depth, whole-profile QC fields |
| GLORYS profile context | glorys_thetao, glorys_so, glorys_uo, glorys_vo, glorys_zos, glorys_mlotst, glorys_bottomT, glorys_sithick, glorys_siconc, glorys_usi, glorys_vsi, glorys_temporal_status |
| OSTIA profile context | ostia_analysed_sst, ostia_analysis_error, ostia_sea_ice_fraction, ostia_mask, ostia_temporal_status |
| Sea-level profile context | sealevel_sla, sealevel_err_sla, sealevel_adt, sealevel_ugosa, sealevel_err_ugosa, sealevel_vgosa, sealevel_err_vgosa, sealevel_ugos, sealevel_vgos, sealevel_flag_ice, sealevel_tpa_correction, sealevel_temporal_status |
| SSS profile context | sss_sos, sss_dos, sss_sea_ice_fraction, sss_temporal_status |
| HF Parquet indices | indices/profiles.parquet stores scalar profile metadata and valid-depth counts; indices/variables.parquet stores variable names, dimensions, dtypes, and descriptions. |
| GeoTIFF dense rasters | thetao, so, analysed_sst, adt, sos, dos |
| GeoTIFF compact ARGO zarr | argo_temp_kelvin_uint8, argo_psal_uint8, argo_temp_valid, argo_psal_valid, profile_date, target_date, latitude, longitude, grid_row, grid_col, optional source identifiers |
The raw SSS error variables sos_error and dos_error are not exported to the
enriched ARGO zarr or the GeoTIFF training store. Daily SSS sos and dos are
included as profile context in the aligned ARGO zarr and as dense surface
rasters in the GeoTIFF export.
ARGO Profile Store¶
Path:
argo/argo_profiles_on_grid.zarr
The exporter reads the enriched ARGO profile input by default:
/data1/datasets/depth_v2/aligned_argo/enriched_argo_profiles.zarr
The packaged Hugging Face zarr can be used instead:
/data1/datasets/depth_v2/aligned_argo/hf_argo_glors_ostia_ssh/data/argo_glors_ostia_ssh.zarr
The saved profile store contains the profile information needed by a GeoTIFF loader without redoing expensive profile preprocessing:
profile_datetarget_datelatitudelongitudegrid_rowgrid_colargo_temp_kelvin_uint8argo_psal_uint8argo_temp_validargo_psal_valid- optional source profile identifiers
Profiles are assigned to the nearest GLORYS target date inside the centered weekly window and to the nearest raster grid cell. Temperature is converted to Kelvin before quantization.
Quantization¶
All dense rasters are written as tiled uint8 GeoTIFFs with ZSTD compression
when available, otherwise DEFLATE. BIGTIFF=IF_SAFER is enabled.
Encoding:
0..254 = valid stretched values
255 = nodata
decoded = minimum + code / 254 * (maximum - minimum)
The transform, units, valid range, clipped-low count, clipped-high count,
nodata count, quantization step, and worst-case rounding error are written to
GeoTIFF tags and to manifest.yaml.
| Variable family | Stretch | uint8 step | uint8 max error | int8 nonnegative step | int8 nonnegative max error |
|---|---|---|---|---|---|
| Temperature | [270.15, 308.15] K |
0.1496 K |
0.0748 K |
0.3016 K |
0.1508 K |
| Salinity | [30, 40] PSU |
0.0394 PSU |
0.0197 PSU |
0.0794 PSU |
0.0397 PSU |
| Density | [1000, 1035] kg/m3 |
0.1378 kg/m3 |
0.0689 kg/m3 |
0.2778 kg/m3 |
0.1389 kg/m3 |
Sea height adt |
[-2, 2] m |
0.0157 m |
0.0079 m |
0.0317 m |
0.0159 m |
The int8 comparison assumes a signed-byte encoding that uses only 0..126 as
valid values and 127 as nodata. A signed int8 layout remapped across all 255
non-nodata codes would have the same quantization error as uint8, but uint8
keeps byte values and nodata handling explicit for raster readers.
Manifest¶
manifest.yaml records the export configuration and output metadata:
- creation script and UTC timestamp
- requested date range
- land-mask grid source, CRS, transform, shape, and resolution
- target weekly dates
- GLORYS depth axis
- per-variable stretch metadata
- per-file paths, source files, compression, and encode statistics
- ARGO profile count and ARGO stretch metadata
The manifest is the stable entry point for downstream loaders: it tells the loader which dates exist, how to decode each byte raster, which SSS rasters are present, and which grid/depth axes the exported arrays use.
Command¶
Run from the repository root:
/work/envs/depth/bin/python -m depth_recon.data.dataset_creation.export_dataset_geotiff.export_dataset_geotiff \
--glorys-dir /data1/datasets/depth_v2/glorys_weekly \
--ostia-dir /data1/datasets/depth_v2/ostia \
--sealevel-dir /data1/datasets/depth_v2/sealevel_daily \
--sss-dir /data1/datasets/depth_v2/sss_daily \
--land-mask-path src/depth_recon/data/dataset_creation/data_download_raw/get_world/world_land_mask_glorys_0p1.tif \
--output-dir /work/data/depthdif \
--start-date 20100101 \
--end-date 20240731 \
--surface-aggregate-days 7 \
--workers 4 \
--rasters-only \
--overwrite
--workers controls process-level parallelism for dense raster dates. Lower it
if source-disk contention or memory pressure becomes the bottleneck. Use
--rasters-only when b_export_enriched_argo_profiles.py has already written
argo/argo_profiles_on_grid.zarr. Use --skip-existing instead of
--overwrite to resume a partial export: existing modality/date rasters are
validated and reused, and missing rasters are written.