You are viewing the latest unreleased documentation 3.10.0.dev14. You can switch to a stable version.

iris.pandas#

Provide conversion to and from Pandas data structures.

See also: https://pandas.pydata.org/

iris.pandas.as_cube(pandas_array, copy=True, calendars=None)[source]#

Convert a Pandas Series/DataFrame into a 1D/2D Iris Cube.

Parameters:
  • pandas_array (pandas.Series or pandas.DataFrame) – The Pandas object to convert.

  • copy (bool, default=True) – Whether to copy pandas_array, or to create array views where possible. Provided in case of memory limit concerns.

  • calendars (dict, optional) – A dict mapping a dimension to a calendar. Required to convert datetime indices/columns.

Notes

This function will copy your data by default.

Examples

as_cube(series, calendars={0: cf_units.CALENDAR_360_DAY})
as_cube(data_frame, calendars={1: cf_units.CALENDAR_STANDARD})

Since this function converts to/from a Pandas object, laziness will not be preserved.

Deprecated since version 3.3.0: This function is scheduled for removal in a future release, being replaced by iris.pandas.as_cubes(), which offers richer dimensional intelligence.

iris.pandas.as_cubes(pandas_structure, copy=True, calendars=None, aux_coord_cols=None, cell_measure_cols=None, ancillary_variable_cols=None)[source]#

Convert a Pandas Series/DataFrame into n-dimensional Iris Cubes, including dimensional metadata.

The index of pandas_structure will be used for generating the Cube dimension(s) and DimCoord. Other dimensional metadata may span multiple dimensions - based on how the column values vary with the index values.

Parameters:
  • pandas_structure (pandas.Series or pandas.DataFrame) – The Pandas object to convert.

  • copy (bool, default=True) – Whether the Cube data is a copy of the pandas_structure column, or a view of the same array. Arrays other than the data (coords etc.) are always copies. This option is provided to help with memory size concerns.

  • calendars (dict, optional) – Calendar conversions for individual date-time coordinate columns/index-levels e.g. {"my_column": cf_units.CALENDAR_360_DAY}.

  • aux_coord_cols (list of str, optional) – Names of columns to be converted into AuxCoord, CellMeasure and AncillaryVariable objects.

  • cell_measure_cols (list of str, optional) – Names of columns to be converted into AuxCoord, CellMeasure and AncillaryVariable objects.

  • ancillary_variable_cols (list of str, optional) – Names of columns to be converted into AuxCoord, CellMeasure and AncillaryVariable objects.

Returns:

One Cube for each column not referenced in aux_coord_cols/cell_measure_cols/ancillary_variable_cols.

Return type:

CubeList

Notes

A DataFrame using columns as a second data dimension will need to be ‘melted’ before conversion. See the Examples for how.

dask.dataframe.DataFrame are not supported.

Since this function converts to/from a Pandas object, laziness will not be preserved.

Examples

>>> from iris.pandas import as_cubes
>>> import numpy as np
>>> from pandas import DataFrame, Series

Converting a simple Series :

>>> my_series = Series([300, 301, 302], name="air_temperature")
>>> converted_cubes = as_cubes(my_series)
>>> print(converted_cubes)
0: air_temperature / (unknown)         (unknown: 3)
>>> print(converted_cubes[0])
air_temperature / (unknown)         (unknown: 3)
    Dimension coordinates:
        unknown                             x

A DataFrame, with a custom index becoming the DimCoord :

>>> my_df = DataFrame({
...     "air_temperature": [300, 301, 302],
...     "longitude": [30, 40, 50]
...     })
>>> my_df = my_df.set_index("longitude")
>>> converted_cubes = as_cubes(my_df)
>>> print(converted_cubes[0])
air_temperature / (unknown)         (longitude: 3)
    Dimension coordinates:
        longitude                             x

A DataFrame representing two 3-dimensional datasets, including a 2-dimensional AuxCoord :

>>> my_df = DataFrame({
...     "air_temperature": np.arange(300, 312, 1),
...     "air_pressure": np.arange(1000, 1012, 1),
...     "longitude": [0, 10] * 6,
...     "latitude": [25, 25, 35, 35] * 3,
...     "height": ([0] * 4) + ([100] * 4) + ([200] * 4),
...     "in_region": [True, False, False, False] * 3
... })
>>> print(my_df)
    air_temperature  air_pressure  longitude  latitude  height  in_region
0               300          1000          0        25       0       True
1               301          1001         10        25       0      False
2               302          1002          0        35       0      False
3               303          1003         10        35       0      False
4               304          1004          0        25     100       True
5               305          1005         10        25     100      False
6               306          1006          0        35     100      False
7               307          1007         10        35     100      False
8               308          1008          0        25     200       True
9               309          1009         10        25     200      False
10              310          1010          0        35     200      False
11              311          1011         10        35     200      False
>>> my_df = my_df.set_index(["longitude", "latitude", "height"])
>>> my_df = my_df.sort_index()
>>> converted_cubes = as_cubes(my_df, aux_coord_cols=["in_region"])
>>> print(converted_cubes)
0: air_temperature / (unknown)         (longitude: 2; latitude: 2; height: 3)
1: air_pressure / (unknown)            (longitude: 2; latitude: 2; height: 3)
>>> print(converted_cubes[0])
air_temperature / (unknown)         (longitude: 2; latitude: 2; height: 3)
    Dimension coordinates:
        longitude                             x            -          -
        latitude                              -            x          -
        height                                -            -          x
    Auxiliary coordinates:
        in_region                             x            x          -

Pandas uses NaN rather than masking data. Converted Cube can be masked in downstream user code :

>>> my_series = Series([300, np.NaN, 302], name="air_temperature")
>>> converted_cube = as_cubes(my_series)[0]
>>> print(converted_cube.data)
[300.  nan 302.]
>>> converted_cube.data = np.ma.masked_invalid(converted_cube.data)
>>> print(converted_cube.data)
[300.0 -- 302.0]

If the DataFrame uses columns as a second dimension, pandas.melt() should be used to convert the data to the expected n-dimensional format :

>>> my_df = DataFrame({
...     "latitude": [35, 25],
...     0: [300, 301],
...     10: [302, 303],
... })
>>> print(my_df)
   latitude    0   10
0        35  300  302
1        25  301  303
>>> my_df = my_df.melt(
...     id_vars=["latitude"],
...     value_vars=[0, 10],
...     var_name="longitude",
...     value_name="air_temperature"
... )
>>> my_df["longitude"] = my_df["longitude"].infer_objects()
>>> print(my_df)
   latitude  longitude  air_temperature
0        35          0              300
1        25          0              301
2        35         10              302
3        25         10              303
>>> my_df = my_df.set_index(["latitude", "longitude"])
>>> my_df = my_df.sort_index()
>>> converted_cube = as_cubes(my_df)[0]
>>> print(converted_cube)
air_temperature / (unknown)         (latitude: 2; longitude: 2)
    Dimension coordinates:
        latitude                             x             -
        longitude                            -             x
iris.pandas.as_data_frame(cube, copy=True, add_aux_coords=False, add_cell_measures=False, add_ancillary_variables=False)[source]#

Convert a Cube to a pandas.DataFrame.

dim_coords and data are flattened into a long-style DataFrame. Other aux_coords, aux_coords and attributes may be optionally added as additional DataFrame columns.

Parameters:
Returns:

A DataFrame with Cube dimensions forming a MultiIndex.

Return type:

DataFrame

Warning

  1. This documentation is for the new as_data_frame() behaviour, which is currently opt-in to preserve backwards compatibility. The default legacy behaviour is documented in pre-v3.4 documentation (summary: limited to 2-dimensional Cube, with only the data and dim_coords being added). The legacy behaviour will be removed in a future version of Iris, so please opt-in to the new behaviour at your earliest convenience, via iris.Future:

    >>> iris.FUTURE.pandas_ndim = True
    

    Breaking change: to enable the improvements, the new opt-in behaviour flattens multi-dimensional data into a single DataFrame column (the legacy behaviour preserves 2 dimensions via rows and columns).

  2. Where the Cube contains masked values, these become numpy.nan in the returned DataFrame.

Notes

dask.dataframe.DataFrame are not supported.

A MultiIndex DataFrame is returned by default. Use the reset_index() to return a DataFrame without MultiIndex levels. Use ‘inplace=True` to preserve memory object reference.

Cube data dtype is preserved.

Since this function converts to/from a Pandas object, laziness will not be preserved.

Examples

>>> import iris
>>> from iris.pandas import as_data_frame
>>> import pandas as pd
>>> pd.set_option('display.width', 1000)
>>> pd.set_option('display.max_columns', 1000)

Convert a simple Cube:

>>> path = iris.sample_data_path('ostia_monthly.nc')
>>> cube = iris.load_cube(path)
>>> df = as_data_frame(cube)
>>> print(df)
... 
                                          surface_temperature
time                latitude  longitude
2006-04-16 00:00:00 -4.999992 0.000000             301.659271
                              0.833333             301.785004
                              1.666667             301.820984
                              2.500000             301.865234
                              3.333333             301.926819
...                                                       ...
2010-09-16 00:00:00  4.444450 355.833313           298.779938
                              356.666656           298.913147
                              357.500000                  NaN
                              358.333313                  NaN
                              359.166656           298.995148

[419904 rows x 1 columns]

Using add_aux_coords=True maps AuxCoord and scalar coordinate information to the DataFrame:

>>> df = as_data_frame(cube, add_aux_coords=True)
>>> print(df)
... 
                                          surface_temperature  forecast_period forecast_reference_time
time                latitude  longitude
2006-04-16 00:00:00 -4.999992 0.000000             301.659271                0     2006-04-16 12:00:00
                              0.833333             301.785004                0     2006-04-16 12:00:00
                              1.666667             301.820984                0     2006-04-16 12:00:00
                              2.500000             301.865234                0     2006-04-16 12:00:00
                              3.333333             301.926819                0     2006-04-16 12:00:00
...                                                       ...              ...                     ...
2010-09-16 00:00:00  4.444450 355.833313           298.779938                0     2010-09-16 12:00:00
                              356.666656           298.913147                0     2010-09-16 12:00:00
                              357.500000                  NaN                0     2010-09-16 12:00:00
                              358.333313                  NaN                0     2010-09-16 12:00:00
                              359.166656           298.995148                0     2010-09-16 12:00:00

[419904 rows x 3 columns]

To add netCDF global attribution information to the DataFrame, add a column directly to the DataFrame:

>>> df['STASH'] = str(cube.attributes['STASH'])
>>> print(df)
... 
                                          surface_temperature  forecast_period forecast_reference_time       STASH
time                latitude  longitude
2006-04-16 00:00:00 -4.999992 0.000000             301.659271                0     2006-04-16 12:00:00  m01s00i024
                              0.833333             301.785004                0     2006-04-16 12:00:00  m01s00i024
                              1.666667             301.820984                0     2006-04-16 12:00:00  m01s00i024
                              2.500000             301.865234                0     2006-04-16 12:00:00  m01s00i024
                              3.333333             301.926819                0     2006-04-16 12:00:00  m01s00i024
...                                                       ...              ...                     ...         ...
2010-09-16 00:00:00  4.444450 355.833313           298.779938                0     2010-09-16 12:00:00  m01s00i024
                              356.666656           298.913147                0     2010-09-16 12:00:00  m01s00i024
                              357.500000                  NaN                0     2010-09-16 12:00:00  m01s00i024
                              358.333313                  NaN                0     2010-09-16 12:00:00  m01s00i024
                              359.166656           298.995148                0     2010-09-16 12:00:00  m01s00i024

[419904 rows x 4 columns]

To return a DataFrame without a MultiIndex use reset_index(). Optionally use inplace=True keyword to modify the DataFrame rather than creating a new one:

>>> df.reset_index(inplace=True)
>>> print(df)
... 
                       time  latitude   longitude  surface_temperature  forecast_period forecast_reference_time       STASH
0       2006-04-16 00:00:00 -4.999992    0.000000           301.659271                0     2006-04-16 12:00:00  m01s00i024
1       2006-04-16 00:00:00 -4.999992    0.833333           301.785004                0     2006-04-16 12:00:00  m01s00i024
2       2006-04-16 00:00:00 -4.999992    1.666667           301.820984                0     2006-04-16 12:00:00  m01s00i024
3       2006-04-16 00:00:00 -4.999992    2.500000           301.865234                0     2006-04-16 12:00:00  m01s00i024
4       2006-04-16 00:00:00 -4.999992    3.333333           301.926819                0     2006-04-16 12:00:00  m01s00i024
                     ...       ...         ...                  ...              ...                     ...         ...
419899  2010-09-16 00:00:00  4.444450  355.833313           298.779938                0     2010-09-16 12:00:00  m01s00i024
419900  2010-09-16 00:00:00  4.444450  356.666656           298.913147                0     2010-09-16 12:00:00  m01s00i024
419901  2010-09-16 00:00:00  4.444450  357.500000                  NaN                0     2010-09-16 12:00:00  m01s00i024
419902  2010-09-16 00:00:00  4.444450  358.333313                  NaN                0     2010-09-16 12:00:00  m01s00i024
419903  2010-09-16 00:00:00  4.444450  359.166656           298.995148                0     2010-09-16 12:00:00  m01s00i024

[419904 rows x 7 columns]

To retrieve a Series from df DataFrame, subselect a column:

>>> df['surface_temperature']
0         301.659271
1         301.785004
2         301.820984
3         301.865234
4         301.926819
            ...
419899    298.779938
419900    298.913147
419901           NaN
419902           NaN
419903    298.995148
Name: surface_temperature, Length: 419904, dtype: float32
iris.pandas.as_series(cube, copy=True)[source]#

Convert a 1D cube to a Pandas Series.

Parameters:
  • cube (Cube) – The cube to convert to a Pandas Series.

  • copy (bool, default=True) – Whether to make a copy of the data. Defaults to True. Must be True for masked data.

Notes

This function will copy your data by default. If you have a large array that cannot be copied, make sure it is not masked and use copy=False.

Since this function converts to/from a Pandas object, laziness will not be preserved.

Deprecated since version 3.4.0: This function is scheduled for removal in a future release, being replaced by iris.pandas.as_data_frame(), which offers improved multi dimension handling.