Gridded Datasets

In [1]:
import xarray as xr
import numpy as np
import holoviews as hv
hv.extension('matplotlib')
%opts Scatter3D [size_index=None color_index=3] (cmap='fire')

In the Tabular Data guide we covered how to work with columnar data in HoloViews. Apart from tabular or column based data there is another data format that is particularly common in the science and engineering contexts, namely multi-dimensional arrays. The gridded data interfaces allow interfacing with grid-based datasets directly.

Grid-based datasets have two types of dimensions:

  • they have coordinate or key dimensions, which describe the sampling of each dimension in the value arrays
  • they have value dimensions which describe the quantity of the multi-dimensional value arrays

Declaring gridded data

All Elements that support a ColumnInterface also support the GridInterface. The simplest example of a multi-dimensional (or more precisely 2D) gridded dataset is an image, which has implicit or explicit x-coordinates, y-coordinates and an array representing the values for each combination of these coordinates. Let us start by declaring an Image with explicit x- and y-coordinates:

In [2]:
img = hv.Image((range(10), range(5), np.random.rand(5, 10)), datatype=['grid'])
img
Out[2]:

In the above example we defined that there would be 10 samples along the x-axis, 5 samples along the y-axis and then defined a random 5x10 array, matching those dimensions. This follows the NumPy (row, column) indexing convention. When passing a tuple HoloViews will use the first gridded data interface, which stores the coordinates and value arrays as a dictionary mapping the dimension name to a NumPy array representing the data:

In [3]:
img.data
Out[3]:
{'x': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
 'y': array([0, 1, 2, 3, 4]),
 'z': array([[ 0.39038447,  0.5809186 ,  0.12968233,  0.98518853,  0.82353369,
          0.00904825,  0.92951999,  0.06997991,  0.09565921,  0.39314974],
        [ 0.19149875,  0.30422852,  0.51478074,  0.76948423,  0.0908213 ,
          0.47405966,  0.57984901,  0.59721032,  0.40688238,  0.92245316],
        [ 0.20083408,  0.78531743,  0.41305037,  0.13770441,  0.5807749 ,
          0.04929245,  0.75421141,  0.91537635,  0.40221771,  0.82849946],
        [ 0.29525556,  0.31930462,  0.6573146 ,  0.02311893,  0.4155926 ,
          0.78252929,  0.83330492,  0.22257102,  0.50052556,  0.01615106],
        [ 0.36901954,  0.36214848,  0.30293863,  0.6354043 ,  0.00470442,
          0.2823036 ,  0.88763943,  0.92972773,  0.98962421,  0.4832394 ]])}

However HoloViews also ships with interfaces for xarray and iris , two common libraries for working with multi-dimensional datasets:

In [4]:
xr_img = img.clone(datatype=['xarray'])
arr_img = img.clone(datatype=['image'])
iris_img = img.clone(datatype=['cube'])

print(type(xr_img.data))
print(type(iris_img.data))
print(type(arr_img.data))
<class 'xarray.core.dataset.Dataset'>
<class 'iris.cube.Cube'>
<type 'numpy.ndarray'>

In the case of an Image HoloViews also has a simple image representation which stores the data as a single array and converts the x- and y-coordinates to a set of bounds:

In [5]:
print("Array type: %s with bounds %s" % (type(arr_img.data), arr_img.bounds))
Array type: <type 'numpy.ndarray'> with bounds BoundingBox(points=((-0.5,-0.5),(9.5,4.5)))

To summarize the constructor accepts a number of formats where the value arrays should always match the shape of the coordinate arrays:

1. A simple np.ndarray along with (l, b, r, t) bounds
2. A tuple of the coordinate and value arrays
3. A dictionary of the coordinate and value arrays indexed by their dimension names
3. XArray DataArray or XArray Dataset
4. An Iris cube

Working with a multi-dimensional dataset

A gridded Dataset may have as many dimensions as desired, however individual Element types only support data of a certain dimensionality. Therefore we usually declare a Dataset to hold our multi-dimensional data and take it from there.

In [6]:
dataset3d = hv.Dataset((range(3), range(5), range(7), np.random.randn(7, 5, 3)),
                       kdims=['x', 'y', 'z'], vdims=['Value'])
dataset3d
Out[6]:
:Dataset   [x,y,z]   (Value)

This is because even a 3D multi-dimensional array represents volumetric data which we can only easily display if it only contains a few samples. In this simple case we can get an overview of what this data looks like by casting it to a Scatter3D Element (which will help us visualize the operations we are applying to the data:

In [7]:
hv.Scatter3D(dataset3d)
Out[7]:

Indexing

In order to explore the dataset we therefore often want to define a lower dimensional slice into the array and then convert the dataset:

In [8]:
dataset3d.select(x=1).to(hv.Image, ['y', 'z']) + hv.Scatter3D(dataset3d.select(x=1))
Out[8]:

Groupby

Another common method to apply to our data is to facet or animate the data is groupby operations. HoloViews provides a convient interface to apply groupby operations and select which dimensions to visualize.

In [9]:
(dataset3d.to(hv.Image, ['y', 'z'], 'Value', ['x']) +
hv.HoloMap({x: hv.Scatter3D(dataset3d.select(x=x)) for x in range(3)}, kdims=['x']))
Out[9]:

Aggregating

Another common operation is to aggregate the data with a function thereby reducing a dimension. You can either aggregate the data by passing the dimensions to aggregate or reduce a specific dimension. Both have the same function:

In [10]:
hv.Image(dataset3d.aggregate(['x', 'y'], np.mean)) + hv.Image(dataset3d.reduce(z=np.mean))
Out[10]:

By aggregating the data we can reduce it to any number of dimensions we want. We can for example compute the spread of values for each z-coordinate and plot it using a Spread and Curve Element. We simply aggregate by that dimension and pass the aggregation functions we want to apply:

In [11]:
hv.Spread(dataset3d.aggregate('z', np.mean, np.std)) * hv.Curve(dataset3d.aggregate('z', np.mean))
Out[11]:

It is also possible to generate lower-dimensional views into the dataset which can be useful to summarize the statistics of the data along a particular dimension. A simple example is a box-whisker of the Value for each x-coordinate. Using the .to conversion interface we declare that we want a BoxWhisker Element indexed by the x dimension showing the Value dimension. Additionally we have to ensure to set groupby to an empty list because by default the interface will group over any remaining dimension.

In [12]:
dataset3d.to(hv.BoxWhisker, 'x', 'Value', groupby=[])
Out[12]:

Similarly we can generate a Distribution Element showing the Value dimension, group by the 'x' dimension and then overlay the distributions, giving us another statistical summary of the data:

In [13]:
dataset3d.to(hv.Distribution, [], 'Value', groupby='x').overlay()
Out[13]:

Categorical dimensions

The key dimensions of the multi-dimensional arrays do not have to represent continuous values, we can display datasets with categorical variables as a HeatMap Element:

In [14]:
heatmap = hv.HeatMap((['A', 'B', 'C'], ['a', 'b', 'c', 'd', 'e'], np.random.rand(5, 3)))
heatmap + heatmap.table()
Out[14]:

API

Accessing the data

In order to be able to work with data in different formats it defines a general interface to access the data. The dimension_values method allows returning underlying arrays.

Key dimensions (coordinates)

By default dimension_values will return the expanded columnar format of the data:

In [15]:
heatmap.dimension_values('x')
Out[15]:
array(['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C',
       'C', 'C'],
      dtype='|S1')

To access just the unique coordinates along a dimension simply supply the expanded=False keyword:

In [16]:
heatmap.dimension_values('x', expanded=False)
Out[16]:
array(['A', 'B', 'C'],
      dtype='|S1')

Finally we can also get a non-flattened, expanded coordinate array returning a coordinate array of the same shape as the value arrays

In [17]:
heatmap.dimension_values('x', flat=False)
Out[17]:
array([['A', 'A', 'A', 'A', 'A'],
       ['B', 'B', 'B', 'B', 'B'],
       ['C', 'C', 'C', 'C', 'C']],
      dtype='|S1')

Value dimensions

When accessing a value dimension the method will also return a flat view of the data:

In [18]:
heatmap.dimension_values('z')
Out[18]:
array([ 0.7228966 ,  0.31783408,  0.93571534,  0.4231442 ,  0.49566044,
        0.28835859,  0.53386977,  0.61899398,  0.08347936,  0.73628744,
        0.49255667,  0.43946026,  0.22211106,  0.29402531,  0.88105038])

We can pass the flat=False argument to access the multi-dimensional array:

In [19]:
heatmap.dimension_values('z', flat=False)
Out[19]:
array([[ 0.7228966 ,  0.28835859,  0.49255667],
       [ 0.31783408,  0.53386977,  0.43946026],
       [ 0.93571534,  0.61899398,  0.22211106],
       [ 0.4231442 ,  0.08347936,  0.29402531],
       [ 0.49566044,  0.73628744,  0.88105038]])