=============================
Missing Data Handling in Iris
=============================

This document provides a brief overview of how Iris handles missing data values
when datasets are loaded as cubes, and when cubes are saved or modified.

A missing data value, or fill-value, defines the value used within a dataset to
indicate that data point is missing or not set.
This value is included as part of a dataset's metadata.

For example, in a gridded global ocean dataset, no data values will be recorded
over land, so land points will be missing data.
In such a case, land points could be indicated by being set to the dataset's
missing data value.


Loading
-------

On load, any fill-value or missing data value defined in the loaded dataset
should be used as the ``fill_value`` of the NumPy masked array data attribute of the
:class:`~iris.cube.Cube`. This will only appear when the cube's data is realised.

.. _missing_data_saving:

Saving
------

On save, the fill-value of a cube's masked data array is **not** used in saving data.
Instead, Iris always uses the default fill-value for the fileformat, *except*
when a fill-value is specified by the user via a fileformat-specific saver.

For example::

    >>> iris.save(my_cube, 'my_file.nc', fill_value=-99999)

.. note::
    Not all savers accept the ``fill_value`` keyword argument.

Iris will check for and issue warnings of fill-value 'collisions' (exception:
**NetCDF**, see the heading below).
This basically means that whenever there are unmasked values that would read back
as masked, we issue a warning and suggest a workaround.

This will occur in the following cases:

* where masked data contains *unmasked* points matching the fill-value, or
* where unmasked data contains the fill-value (either the format-specific default fill-value,
  or a fill-value specified by the user in the save call).


NetCDF
~~~~~~

:term:`NetCDF Format`

NetCDF is a special case, because all ordinary variable data is "potentially masked",
owing to the use of default fill values. The default fill-value used depends on the type
of the variable data.

The exceptions to this are:

* One-byte values are not masked unless the variable has an explicit ``_FillValue`` attribute.
  That is, there is no default fill-value for ``byte`` types in NetCDF.
* Data may be tagged with a ``_NoFill`` attribute. This is not currently officially
  documented or widely implemented.
* Small integers create problems by *not* having the exemption applied to byte data.
  Thus, in principle, ``int32`` data cannot use the full range of 2**16 valid values.

Warnings are not issued for NetCDF fill value collisions. Increasingly large
and complex parallel I/O operations unfortunately made this feature
un-maintainable and it was retired in Iris 3.9 (:pull:`5833`).

If you need to know about collisions then you can perform your own checks ahead
of saving. Such operations can be run lazily (:term:`Lazy Data`). Here is an
example::

    >>> default_fill = netCDF4.default_fillvals[my_cube.dtype.str[1:]]
    >>> fill_present = (my_cube.lazy_data() == default_fill).any().compute()

Merging
-------

Merged data should have a fill-value equal to that of the components, if they
all have the same fill-value. If the components have differing fill-values, a
default fill-value will be used instead.


Other Operations
----------------

Other operations, such as :class:`~iris.cube.Cube` arithmetic operations,
generally produce output with a default (NumPy) fill-value. That is, these operations
ignore the fill-values of the input(s) to the operation.