iris.pandas#
Provide conversion to and from Pandas data structures.
See also: http://pandas.pydata.org/
- iris.pandas.as_cube(pandas_array, copy=True, calendars=None)[source]#
Convert a Pandas Series/DataFrame into a 1D/2D Iris Cube.
Deprecated since version 3.3.0: This function is scheduled for removal in a future release, being replaced by
iris.pandas.as_cubes()
, which offers richer dimensional intelligence.- Parameters:
pandas_array (
pandas.Series
orpandas.DataFrame
) – The Pandas object to convertcopy (bool, default=True) – Whether to copy pandas_array, or to create array views where possible. Provided in case of memory limit concerns.
calendars (dict, optional) – A dict mapping a dimension to a calendar. Required to convert datetime indices/columns.
Notes
This function will copy your data by default.
Example usage:
as_cube(series, calendars={0: cf_units.CALENDAR_360_DAY}) as_cube(data_frame, calendars={1: cf_units.CALENDAR_STANDARD})
Since this function converts to/from a Pandas object, laziness will not be preserved.
- iris.pandas.as_cubes(pandas_structure, copy=True, calendars=None, aux_coord_cols=None, cell_measure_cols=None, ancillary_variable_cols=None)[source]#
Convert a Pandas Series/DataFrame into n-dimensional Iris Cubes, including dimensional metadata.
The index of pandas_structure will be used for generating the
Cube
dimension(s) andDimCoord
s. Other dimensional metadata may span multiple dimensions - based on how the column values vary with the index values.- Parameters:
pandas_structure (
pandas.Series
orpandas.DataFrame
) – The Pandas object to convertcopy (bool, default=True) – Whether the Cube
data
is a copy of the pandas_structure column, or a view of the same array. Arrays other than the data (coords etc.) are always copies. This option is provided to help with memory size concerns.calendars (dict, optional) – Calendar conversions for individual date-time coordinate columns/index-levels e.g.
{"my_column": cf_units.CALENDAR_360_DAY}
.aux_coord_cols (list of str, optional) – Names of columns to be converted into
AuxCoord
,CellMeasure
andAncillaryVariable
objects.cell_measure_cols (list of str, optional) – Names of columns to be converted into
AuxCoord
,CellMeasure
andAncillaryVariable
objects.ancillary_variable_cols (list of str, optional) – Names of columns to be converted into
AuxCoord
,CellMeasure
andAncillaryVariable
objects.
- Returns:
One
Cube
for each column not referenced in aux_coord_cols/cell_measure_cols/ancillary_variable_cols.- Return type:
Notes
A
DataFrame
using columns as a second data dimension will need to be ‘melted’ before conversion. See the Examples for how.dask.dataframe.DataFrame
s are not supported.Since this function converts to/from a Pandas object, laziness will not be preserved.
Examples
>>> from iris.pandas import as_cubes >>> import numpy as np >>> from pandas import DataFrame, Series
Converting a simple
Series
:>>> my_series = Series([300, 301, 302], name="air_temperature") >>> converted_cubes = as_cubes(my_series) >>> print(converted_cubes) 0: air_temperature / (unknown) (unknown: 3) >>> print(converted_cubes[0]) air_temperature / (unknown) (unknown: 3) Dimension coordinates: unknown x
A
DataFrame
, with a custom index becoming theDimCoord
:>>> my_df = DataFrame({ ... "air_temperature": [300, 301, 302], ... "longitude": [30, 40, 50] ... }) >>> my_df = my_df.set_index("longitude") >>> converted_cubes = as_cubes(my_df) >>> print(converted_cubes[0]) air_temperature / (unknown) (longitude: 3) Dimension coordinates: longitude x
A
DataFrame
representing two 3-dimensional datasets, including a 2-dimensionalAuxCoord
:>>> my_df = DataFrame({ ... "air_temperature": np.arange(300, 312, 1), ... "air_pressure": np.arange(1000, 1012, 1), ... "longitude": [0, 10] * 6, ... "latitude": [25, 25, 35, 35] * 3, ... "height": ([0] * 4) + ([100] * 4) + ([200] * 4), ... "in_region": [True, False, False, False] * 3 ... }) >>> print(my_df) air_temperature air_pressure longitude latitude height in_region 0 300 1000 0 25 0 True 1 301 1001 10 25 0 False 2 302 1002 0 35 0 False 3 303 1003 10 35 0 False 4 304 1004 0 25 100 True 5 305 1005 10 25 100 False 6 306 1006 0 35 100 False 7 307 1007 10 35 100 False 8 308 1008 0 25 200 True 9 309 1009 10 25 200 False 10 310 1010 0 35 200 False 11 311 1011 10 35 200 False >>> my_df = my_df.set_index(["longitude", "latitude", "height"]) >>> my_df = my_df.sort_index() >>> converted_cubes = as_cubes(my_df, aux_coord_cols=["in_region"]) >>> print(converted_cubes) 0: air_temperature / (unknown) (longitude: 2; latitude: 2; height: 3) 1: air_pressure / (unknown) (longitude: 2; latitude: 2; height: 3) >>> print(converted_cubes[0]) air_temperature / (unknown) (longitude: 2; latitude: 2; height: 3) Dimension coordinates: longitude x - - latitude - x - height - - x Auxiliary coordinates: in_region x x -
Pandas uses
NaN
rather than masking data. ConvertedCube
s can be masked in downstream user code :>>> my_series = Series([300, np.NaN, 302], name="air_temperature") >>> converted_cube = as_cubes(my_series)[0] >>> print(converted_cube.data) [300. nan 302.] >>> converted_cube.data = np.ma.masked_invalid(converted_cube.data) >>> print(converted_cube.data) [300.0 -- 302.0]
If the
DataFrame
uses columns as a second dimension,pandas.melt()
should be used to convert the data to the expected n-dimensional format :>>> my_df = DataFrame({ ... "latitude": [35, 25], ... 0: [300, 301], ... 10: [302, 303], ... }) >>> print(my_df) latitude 0 10 0 35 300 302 1 25 301 303 >>> my_df = my_df.melt( ... id_vars=["latitude"], ... value_vars=[0, 10], ... var_name="longitude", ... value_name="air_temperature" ... ) >>> my_df["longitude"] = my_df["longitude"].infer_objects() >>> print(my_df) latitude longitude air_temperature 0 35 0 300 1 25 0 301 2 35 10 302 3 25 10 303 >>> my_df = my_df.set_index(["latitude", "longitude"]) >>> my_df = my_df.sort_index() >>> converted_cube = as_cubes(my_df)[0] >>> print(converted_cube) air_temperature / (unknown) (latitude: 2; longitude: 2) Dimension coordinates: latitude x - longitude - x
- iris.pandas.as_data_frame(cube, copy=True, add_aux_coords=False, add_cell_measures=False, add_ancillary_variables=False)[source]#
Convert a
Cube
to apandas.DataFrame
.dim_coords
anddata
are flattened into a long-styleDataFrame
. Otheraux_coords
,aux_coords
andattributes
may be optionally added as additionalDataFrame
columns.- Parameters:
cube (
Cube
) – TheCube
to be converted to apandas.DataFrame
.copy (bool, default=True) – Whether the
pandas.DataFrame
is a copy of the the Cubedata
. This option is provided to help with memory size concerns.add_aux_coords (bool, default=False) – If True, add all
aux_coords
(including scalar coordinates) to the returnedpandas.DataFrame
.add_cell_measures (bool, default=False) – If True, add
cell_measures
to the returnedpandas.DataFrame
.add_ancillary_variables (bool, default=False) – If True, add
ancillary_variables
to the returnedpandas.DataFrame
.
- Returns:
A
DataFrame
withCube
dimensions forming aMultiIndex
- Return type:
Warning
This documentation is for the new
as_data_frame()
behaviour, which is currently opt-in to preserve backwards compatibility. The default legacy behaviour is documented in pre-v3.4
documentation (summary: limited to 2-dimensionalCube
s, with only thedata
anddim_coords
being added). The legacy behaviour will be removed in a future version of Iris, so please opt-in to the new behaviour at your earliest convenience, viairis.Future
:>>> iris.FUTURE.pandas_ndim = True
Breaking change: to enable the improvements, the new opt-in behaviour flattens multi-dimensional data into a single
DataFrame
column (the legacy behaviour preserves 2 dimensions via rows and columns).Where the
Cube
contains masked values, these becomenumpy.nan
in the returnedDataFrame
.
Notes
dask.dataframe.DataFrame
s are not supported.A
MultiIndex
DataFrame
is returned by default. Use thereset_index()
to return aDataFrame
withoutMultiIndex
levels. Use ‘inplace=True` to preserve memory object reference.Cube
data dtype is preserved.Examples
>>> import iris >>> from iris.pandas import as_data_frame >>> import pandas as pd >>> pd.set_option('display.width', 1000) >>> pd.set_option('display.max_columns', 1000)
Convert a simple
Cube
:>>> path = iris.sample_data_path('ostia_monthly.nc') >>> cube = iris.load_cube(path) >>> df = as_data_frame(cube) >>> print(df) ... surface_temperature time latitude longitude 2006-04-16 00:00:00 -4.999992 0.000000 301.659271 0.833333 301.785004 1.666667 301.820984 2.500000 301.865234 3.333333 301.926819 ... ... 2010-09-16 00:00:00 4.444450 355.833313 298.779938 356.666656 298.913147 357.500000 NaN 358.333313 NaN 359.166656 298.995148 [419904 rows x 1 columns]
Using
add_aux_coords=True
mapsAuxCoord
and scalar coordinate information to theDataFrame
:>>> df = as_data_frame(cube, add_aux_coords=True) >>> print(df) ... surface_temperature forecast_period forecast_reference_time time latitude longitude 2006-04-16 00:00:00 -4.999992 0.000000 301.659271 0 2006-04-16 12:00:00 0.833333 301.785004 0 2006-04-16 12:00:00 1.666667 301.820984 0 2006-04-16 12:00:00 2.500000 301.865234 0 2006-04-16 12:00:00 3.333333 301.926819 0 2006-04-16 12:00:00 ... ... ... ... 2010-09-16 00:00:00 4.444450 355.833313 298.779938 0 2010-09-16 12:00:00 356.666656 298.913147 0 2010-09-16 12:00:00 357.500000 NaN 0 2010-09-16 12:00:00 358.333313 NaN 0 2010-09-16 12:00:00 359.166656 298.995148 0 2010-09-16 12:00:00 [419904 rows x 3 columns]
To add netCDF global attribution information to the
DataFrame
, add a column directly to theDataFrame
:>>> df['STASH'] = str(cube.attributes['STASH']) >>> print(df) ... surface_temperature forecast_period forecast_reference_time STASH time latitude longitude 2006-04-16 00:00:00 -4.999992 0.000000 301.659271 0 2006-04-16 12:00:00 m01s00i024 0.833333 301.785004 0 2006-04-16 12:00:00 m01s00i024 1.666667 301.820984 0 2006-04-16 12:00:00 m01s00i024 2.500000 301.865234 0 2006-04-16 12:00:00 m01s00i024 3.333333 301.926819 0 2006-04-16 12:00:00 m01s00i024 ... ... ... ... ... 2010-09-16 00:00:00 4.444450 355.833313 298.779938 0 2010-09-16 12:00:00 m01s00i024 356.666656 298.913147 0 2010-09-16 12:00:00 m01s00i024 357.500000 NaN 0 2010-09-16 12:00:00 m01s00i024 358.333313 NaN 0 2010-09-16 12:00:00 m01s00i024 359.166656 298.995148 0 2010-09-16 12:00:00 m01s00i024 [419904 rows x 4 columns]
To return a
DataFrame
without aMultiIndex
usereset_index()
. Optionally use inplace=True keyword to modify the DataFrame rather than creating a new one:>>> df.reset_index(inplace=True) >>> print(df) ... time latitude longitude surface_temperature forecast_period forecast_reference_time STASH 0 2006-04-16 00:00:00 -4.999992 0.000000 301.659271 0 2006-04-16 12:00:00 m01s00i024 1 2006-04-16 00:00:00 -4.999992 0.833333 301.785004 0 2006-04-16 12:00:00 m01s00i024 2 2006-04-16 00:00:00 -4.999992 1.666667 301.820984 0 2006-04-16 12:00:00 m01s00i024 3 2006-04-16 00:00:00 -4.999992 2.500000 301.865234 0 2006-04-16 12:00:00 m01s00i024 4 2006-04-16 00:00:00 -4.999992 3.333333 301.926819 0 2006-04-16 12:00:00 m01s00i024 ... ... ... ... ... ... ... 419899 2010-09-16 00:00:00 4.444450 355.833313 298.779938 0 2010-09-16 12:00:00 m01s00i024 419900 2010-09-16 00:00:00 4.444450 356.666656 298.913147 0 2010-09-16 12:00:00 m01s00i024 419901 2010-09-16 00:00:00 4.444450 357.500000 NaN 0 2010-09-16 12:00:00 m01s00i024 419902 2010-09-16 00:00:00 4.444450 358.333313 NaN 0 2010-09-16 12:00:00 m01s00i024 419903 2010-09-16 00:00:00 4.444450 359.166656 298.995148 0 2010-09-16 12:00:00 m01s00i024 [419904 rows x 7 columns]
To retrieve a
Series
from dfDataFrame
, subselect a column:>>> df['surface_temperature'] 0 301.659271 1 301.785004 2 301.820984 3 301.865234 4 301.926819 ... 419899 298.779938 419900 298.913147 419901 NaN 419902 NaN 419903 298.995148 Name: surface_temperature, Length: 419904, dtype: float32
Notes
Since this function converts to/from a Pandas object, laziness will not be preserved.
- iris.pandas.as_series(cube, copy=True)[source]#
Convert a 1D cube to a Pandas Series.
Deprecated since version 3.4.0: This function is scheduled for removal in a future release, being replaced by
iris.pandas.as_data_frame()
, which offers improved multi dimension handling.- Parameters:
cube (
Cube
) – The cube to convert to a Pandas Series.copy (bool, default=True) – Whether to make a copy of the data. Defaults to True. Must be True for masked data.
Notes
This function will copy your data by default. If you have a large array that cannot be copied, make sure it is not masked and use copy=False.
Notes
Since this function converts to/from a Pandas object, laziness will not be preserved.