Columnar Data

In this Tutorial we will explore how to work with columnar data in HoloViews. Columnar data has a fixed list of column headings, with values stored in an arbitrarily long list of rows. Spreadsheets, relational databases, CSV files, and many other typical data sources fit naturally into this format. HoloViews defines an extensible system of interfaces to load, manipulate, and visualize this kind of data, as well as allowing conversion of any of the non-columnar data types into columnar data for analysis or data interchange.

By default HoloViews will use one of three storage formats for columnar data:

  • A pure Python dictionary containing each column.
  • A purely NumPy-based format for numeric data.
  • Pandas DataFrames
In [1]:
import numpy as np
import pandas as pd
import holoviews as hv
from IPython.display import HTML
hv.notebook_extension()
HoloViewsJS successfully loaded in this cell.

Simple Dataset

Usually when working with data we have one or more independent variables, taking the form of categories, labels, discrete sample coordinates, or bins. These variables are what we refer to as key dimensions (or kdims for short) in HoloViews. The observer or dependent variables, on the other hand, are referred to as value dimensions ( vdims ), and are ordinarily measured or calculated given the independent variables. The simplest useful form of a Dataset object is therefore a column 'x' and a column 'y' corresponding to the key dimensions and value dimensions respectively. An obvious visual representation of this data is a Table:

In [2]:
xs = range(10)
ys = np.exp(xs)

table = hv.Table((xs, ys), kdims=['x'], vdims=['y'])
table
Out[2]:

However, this data has many more meaningful visual representations, and therefore the first important concept is that Dataset objects are interchangeable as long as their dimensionality allows it, meaning that you can easily create the different objects from the same data (and cast between the objects once created):

In [3]:
hv.Scatter(table) + hv.Curve(table) + hv.Bars(table)
Out[3]:

Each of these three plots uses the same data, but represents a different assumption about the semantic meaning of that data -- the Scatter plot is appropriate if that data consists of independent samples, the Curve plot is appropriate for samples chosen from an underlying smooth function, and the Bars plot is appropriate for independent categories of data. Since all these plots have the same dimensionality, they can easily be converted to each other, but there is normally only one of these representations that is semantically appropriate for the underlying data. For this particular data, the semantically appropriate choice is Curve, since the y values are samples from the continuous function exp .

As a guide to which Elements can be converted to each other, those of the same dimensionality here should be interchangeable, because of the underlying similarity of their columnar representation:

  • 0D: BoxWhisker, Spikes, Distribution*,
  • 1D: Scatter, Curve, ErrorBars, Spread, Bars, BoxWhisker, Regression*
  • 2D: Points, HeatMap, Bars, BoxWhisker, Bivariate*
  • 3D: Scatter3D, Trisurface, VectorField, BoxWhisker, Bars

* - requires Seaborn

This categorization is based only on the kdims , which define the space in which the data has been sampled or defined. An Element can also have any number of value dimensions ( vdims ), which may be mapped onto various attributes of a plot such as the color, size, and orientation of the plotted items. For a reference of how to use these various Element types, see the Elements Tutorial .

Data types and Constructors

As discussed above, Dataset provide an extensible interface to store and operate on data in different formats. All interfaces support a number of standard constructors.

Storage formats

Dataset types can be constructed using one of three supported formats, (a) a dictionary of columns, (b) an NxD array with N rows and D columns, or (c) pandas dataframes:

In [4]:
print(hv.Scatter({'x': xs, 'y': ys}) +
      hv.Scatter(np.column_stack([xs, ys])) +
      hv.Scatter(pd.DataFrame({'x': xs, 'y': ys})))
:Layout
   .Scatter.I   :Scatter   [x]   (y)
   .Scatter.II  :Scatter   [x]   (y)
   .Scatter.III :Scatter   [x]   (y)

Literals

In addition to the main storage formats, Dataset Elements support construction from three Python literal formats: (a) An iterator of y-values, (b) a tuple of columns, and (c) an iterator of row tuples.

In [5]:
print(hv.Scatter(ys) + hv.Scatter((xs, ys)) + hv.Scatter(zip(xs, ys)))
:Layout
   .Scatter.I   :Scatter   [x]   (y)
   .Scatter.II  :Scatter   [x]   (y)
   .Scatter.III :Scatter   [x]   (y)

For these inputs, the data will need to be copied to a new data structure, having one of the three storage formats above. By default Dataset will try to construct a simple array, falling back to either pandas dataframes (if available) or the dictionary-based format if the data is not purely numeric. Additionally, the interfaces will try to maintain the provided data's type, so numpy arrays and pandas DataFrames will therefore always be parsed by the array and dataframe interfaces first respectively.

In [6]:
df = pd.DataFrame({'x': xs, 'y': ys, 'z': ys*2})
print(type(hv.Scatter(df).data))
<class 'pandas.core.frame.DataFrame'>

Dataset will attempt to parse the supplied data, falling back to each consecutive interface if the previous could not interpret the data. The default list of fallbacks and simultaneously the list of allowed datatypes is:

In [7]:
hv.Dataset.datatype
Out[7]:
['array', 'dataframe', 'dictionary', 'grid', 'ndelement', 'cube', 'xarray']

Note these include grid based datatypes, which are not covered in this tutorial. To select a particular storage format explicitly, supply one or more allowed datatypes:

In [8]:
print(type(hv.Scatter((xs, ys), datatype=['array']).data))
print(type(hv.Scatter((xs, ys), datatype=['dictionary']).data))
print(type(hv.Scatter((xs, ys), datatype=['dataframe']).data))
<type 'numpy.ndarray'>
<class 'collections.OrderedDict'>
<class 'pandas.core.frame.DataFrame'>

Sharing Data

Since the formats with labelled columns do not require any specific order, each Element can effectively become a view into a single set of data. By specifying different key and value dimensions, many Elements can show different values, while sharing the same underlying data source.

In [9]:
overlay = hv.Scatter(df, kdims='x', vdims='y') * hv.Scatter(df, kdims='x', vdims='z')
overlay
Out[9]:

We can quickly confirm that the data is actually shared:

In [10]:
overlay.Scatter.I.data is overlay.Scatter.II.data
Out[10]:
True

For columnar data, this approach is much more efficient than creating copies of the data for each Element, and allows for some advanced features like linked brushing in the Bokeh backend .

Converting to raw data

Column types make it easy to export the data to the three basic formats: arrays, dataframes, and a dictionary of columns.

Array
In [11]:
table.array()
Out[11]:
array([[  0.00000000e+00,   1.00000000e+00],
       [  1.00000000e+00,   2.71828183e+00],
       [  2.00000000e+00,   7.38905610e+00],
       [  3.00000000e+00,   2.00855369e+01],
       [  4.00000000e+00,   5.45981500e+01],
       [  5.00000000e+00,   1.48413159e+02],
       [  6.00000000e+00,   4.03428793e+02],
       [  7.00000000e+00,   1.09663316e+03],
       [  8.00000000e+00,   2.98095799e+03],
       [  9.00000000e+00,   8.10308393e+03]])
Pandas DataFrame
In [12]:
HTML(table.dframe().head().to_html())
Out[12]:
x y
0 0.0 1.000000
1 1.0 2.718282
2 2.0 7.389056
3 3.0 20.085537
4 4.0 54.598150
Dataset dictionary
In [13]:
table.columns()
Out[13]:
{'x': array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.]),
 'y': array([  1.00000000e+00,   2.71828183e+00,   7.38905610e+00,
          2.00855369e+01,   5.45981500e+01,   1.48413159e+02,
          4.03428793e+02,   1.09663316e+03,   2.98095799e+03,
          8.10308393e+03])}

Creating tabular data from Elements using the .table and .dframe methods

If you have data in some other HoloViews element and would like to use the columnar data features, you can easily tabularize any of the core Element types into a Table Element, using the .table() method. Similarly, the .dframe() method will convert an Element into a pandas DataFrame. These methods are very useful if you want to then transform the data into a different Element type, or to perform different types of analysis.

Tabularizing simple Elements

For a simple example, we can create a Curve of an exponential function and convert it to a Table with the .table method, with the same result as creating the Table directly from the data as done earlier on this Tutorial:

In [14]:
xs = np.arange(10)
curve = hv.Curve(zip(xs, np.exp(xs)))
curve * hv.Scatter(zip(xs, curve)) + curve.table()
Out[14]:

Similarly, we can get a pandas dataframe of the Curve using curve.dframe() . Here we wrap that call as raw HTML to allow automated testing of this notebook, but just calling curve.dframe() would give the same result visually:

In [15]:
HTML(curve.dframe().to_html())
Out[15]:
x y
0 0.0 1.000000
1 1.0 2.718282
2 2.0 7.389056
3 3.0 20.085537
4 4.0 54.598150
5 5.0 148.413159
6 6.0 403.428793
7 7.0 1096.633158
8 8.0 2980.957987
9 9.0 8103.083928

Although 2D image-like objects are not inherently well suited to a flat columnar representation, serializing them by converting to tabular data is a good way to reveal the differences between Image and Raster elements. Rasters are a very simple type of element, using array-like integer indexing of rows and columns from their top-left corner as in computer graphics applications. Conversely, Image elements are a higher-level abstraction that provides a general-purpose continuous Cartesian coordinate system, with x and y increasing to the right and upwards as in mathematical applications, and each point interpreted as a sample representing the pixel in which it is located (and thus centered within that pixel). Given the same data, the .table() representation will show how the data is being interpreted (and accessed) differently in the two cases (as explained in detail in the Continuous Coordinates Tutorial ):

In [16]:
%%opts Points (s=200) [size_index=None]
extents = (-1.6,-2.7,2.0,3)
np.random.seed(42)
mat = np.random.rand(3, 3)

img = hv.Image(mat, bounds=extents)
raster = hv.Raster(mat)

img    * hv.Points(img)    + img.table() + \
raster * hv.Points(raster) + raster.table()
Out[16]:

Tabularizing space containers

Even deeply nested objects can be deconstructed in this way, serializing them to make it easier to get your raw data out of a collection of specialized Element types. Let's say we want to make multiple observations of a noisy signal. We can collect the data into a HoloMap to visualize it and then call .table() to get a columnar object where we can perform operations or transform it to other Element types. Deconstructing nested data in this way only works if the data is homogenous. In practical terms, the requirement is that your data structure contains Elements (of any types) in these Container types: NdLayout, GridSpace, HoloMap, and NdOverlay, with all dimensions consistent throughout (so that they can all fit into the same set of columns).

Let's now go back to the Image example. We will now collect a number of observations of some noisy data into a HoloMap and display it:

In [17]:
obs_hmap = hv.HoloMap({i: hv.Image(np.random.randn(10, 10), bounds=(0,0,3,3))
                   for i in range(3)}, key_dimensions=['Observation'])
obs_hmap
Out[17]:

Now we can serialize this data just as before, where this time we get a four-column (4D) table. The key dimensions of both the HoloMap and the Images, as well as the z-values of each Image, are all merged into a single table. We can visualize the samples we have collected by converting it to a Scatter3D object.

In [18]:
%%opts Layout [fig_size=150] Scatter3D [color_index=3 size_index=None] (cmap='hot' edgecolor='k' s=50)
obs_hmap.table().to.scatter3d() + obs_hmap.table()
Out[18]:

Here the z dimension is shown by color, as in the original images, and the other three dimensions determine where the datapoint is shown in 3D. This way of deconstructing will work for any data structure that satisfies the conditions described above, no matter how nested. If we vary the amount of noise while continuing to performing multiple observations, we can create an NdLayout of HoloMaps, one for each level of noise, and animated by the observation number.

In [19]:
from itertools import product
extents = (0,0,3,3)
error_hmap = hv.HoloMap({(i, j): hv.Image(j*np.random.randn(3, 3), bounds=extents)
                         for i, j in product(range(3), np.linspace(0, 1, 3))},
                        key_dimensions=['Observation', 'noise'])
noise_layout = error_hmap.layout('noise')
noise_layout
Out[19]:

And again, we can easily convert the object to a Table :

In [20]:
%%opts Table [fig_size=150]
noise_layout.table()
Out[20]:

Applying operations to the data

Sorting by columns

Once data is in columnar form, it is simple to apply a variety of operations. For instance, Dataset can be sorted by their dimensions using the .sort() method. By default, this method will sort by the key dimensions, but any other dimension(s) can be supplied to specify sorting along any other dimensions:

In [21]:
bars = hv.Bars((['C', 'A', 'B', 'D'], [2, 7, 3, 4]))
bars + bars.sort() + bars.sort(['y'])
Out[21]:

Working with categorical or grouped data

Data is often grouped in various ways, and the Dataset interface provides various means to easily compare between groups and apply statistical aggregates. We'll start by generating some synthetic data with two groups along the x-axis and 4 groups along the y axis.

In [22]:
n = np.arange(1000)
xs = np.repeat(range(2), 500)
ys = n%4
zs = np.random.randn(1000)
table = hv.Table((xs, ys, zs), kdims=['x', 'y'], vdims=['z'])
table
Out[22]:

Since there are repeat observations of the same x- and y-values, we have to reduce the data before we display it or else use a datatype that supports plotting distributions in this way. The BoxWhisker type allows doing exactly that:

In [23]:
%%opts BoxWhisker [aspect=2 fig_size=200 bgcolor='w']
hv.BoxWhisker(table)
Out[23]:

Aggregating/Reducing dimensions

Most types require the data to be non-duplicated before being displayed. For this purpose, HoloViews makes it easy to aggregate and reduce the data. These two operations are simple inverses of each other--aggregate computes a statistic for each group in the supplied dimensions, while reduce combines all the groups except the supplied dimensions. Supplying only a function and no dimensions will simply aggregate or reduce all available key dimensions.

In [24]:
%%opts Bars [show_legend=False] {+axiswise}
hv.Bars(table).aggregate(function=np.mean) + hv.Bars(table).reduce(x=np.mean)
Out[24]:

( A ) aggregates over both the x and y dimension, computing the mean for each x/y group, while ( B ) reduces the x dimension leaving just the mean for each group along y.

Collapsing multiple Dataset Elements

When multiple observations are broken out into a HoloMap they can easily be combined using the collapse method. Here we create a number of Curves with increasingly larger y-values. By collapsing them with a function and a spreadfn we can compute the mean curve with a confidence interval. We then simply cast the collapsed Curve to a Spread and Curve Element to visualize them.

In [25]:
hmap = hv.HoloMap({i: hv.Curve(np.arange(10)*i) for i in range(10)})
collapsed = hmap.collapse(function=np.mean, spreadfn=np.std)
hv.Spread(collapsed) * hv.Curve(collapsed) + collapsed.table()
Out[25]:

Working with complex data

In the last section we only scratched the surface of what the Dataset interface can do. When it really comes into its own is when working with high-dimensional datasets. As an illustration, we'll load a dataset of some macro-economic indicators for OECD countries from 1964-1990, cached on the HoloViews website.

In [26]:
macro_df = pd.read_csv('http://assets.holoviews.org/macro.csv', '\t')

dimensions = {'unem':    'Unemployment',
              'capmob':  'Capital Mobility',
              'gdp':     'GDP Growth', 
              'trade':   'Trade',
              'year':    'Year', 
              'country': 'Country'}

macro_df = macro_df.rename(columns=dimensions)

We'll also take this opportunity to set default options for all the following plots.

In [27]:
%output dpi=100
options = hv.Store.options()
opts = hv.Options('plot', aspect=2, fig_size=250, show_frame=False, show_grid=True, legend_position='right')
options.NdOverlay = opts
options.Overlay = opts
Loading the data

As we saw above, we can supply a dataframe to any Dataset type. When dealing with so many dimensions it would be cumbersome to supply all the dimensions explicitly, but luckily Dataset can easily infer the dimensions from the dataframe itself. We simply supply the kdims , and it will infer that all other numeric dimensions should be treated as value dimensions ( vdims ).

In [28]:
macro = hv.Table(macro_df, kdims=['Year', 'Country'])

To get an overview of the data we'll quickly sort it and then view the data for one year.

In [29]:
%%opts Table [aspect=1.5 fig_size=300]
macro = macro.sort()
macro[1988]
Out[29]:

Most of the examples above focus on converting a Table to simple Element types, but HoloViews also provides powerful container objects to explore high-dimensional data, such as HoloMap , NdOverlay , NdLayout , and GridSpace . HoloMaps work as a useful interchange format from which you can conveniently convert to the other container types using its .overlay() , .layout() , and .grid() methods. This way we can easily create an overlay of GDP Growth curves by year for each country. Here Year is a key dimension and GDP Growth a value dimension. We are then left with the Country dimension, which we can overlay using the .overlay() method.

In [30]:
%%opts Curve (color=Palette('Set3'))
gdp_curves = macro.to.curve('Year', 'GDP Growth')
gdp_curves.overlay('Country')
Out[30]:

Now that we've extracted the gdp_curves , we can apply some operations to them. As in the simpler example above we will collapse the HoloMap of Curves using a number of functions to visualize the distribution of GDP Growth rates over time. First we find the mean curve with np.std as the spreadfn and cast the result to a Spread type, then we compute the min, mean and max curve in the same way and put them all inside an Overlay.

In [31]:
%%opts Overlay [bgcolor='w' legend_position='top_right'] Curve (color='k' linewidth=1) Spread (facecolor='gray' alpha=0.2)
hv.Spread(gdp_curves.collapse('Country', np.mean, np.std), label='std') *\
hv.Overlay([gdp_curves.collapse('Country', fn).relabel(name)(style=dict(linestyle=ls))
            for name, fn, ls in [('max', np.max, '--'), ('mean', np.mean, '-'), ('min', np.min, '--')]])
Out[31]:

Many HoloViews Element types support multiple kdims , including HeatMap , Points , Scatter , Scatter3D , and Bars . Bars in particular allows you to lay out your data in groups, categories and stacks. By supplying the index of that dimension as a plotting option you can choose to lay out your data as groups of bars, categories in each group, and stacks. Here we choose to lay out the trade surplus of each country with groups for each year, no categories, and stacked by country. Finally, we choose to color the Bars for each item in the stack.

In [32]:
%opts Bars [bgcolor='w' aspect=3 figure_size=450 show_frame=False]
In [33]:
%%opts Bars [category_index=2 stack_index=0 group_index=1 legend_position='top' legend_cols=7 color_by=['stack']] (color=Palette('Dark2'))
macro.to.bars(['Country', 'Year'], 'Trade', [])
Out[33]:

This plot contains a lot of data, and so it's probably a good idea to focus on specific aspects of it, telling a simpler story about them. For instance, using the .select method we can then customize the palettes (e.g. to use consistent colors per country across multiple analyses).

Palettes can customized by selecting only a subrange of the underlying cmap to draw the colors from. The Palette draws samples from the colormap using the supplied sample_fn , which by default just draws linear samples but may be overriden with any function that draws samples in the supplied ranges. By slicing the Set1 colormap we draw colors only from the upper half of the palette and then reverse it.

In [34]:
%%opts Bars [padding=0.02 color_by=['group']] (alpha=0.6, color=Palette('Set1', reverse=True)[0.:.2])
countries = {'Belgium', 'Netherlands', 'Sweden', 'Norway'}
macro.to.bars(['Country', 'Year'], 'Unemployment').select(Year=(1978, 1985), Country=countries)
Out[34]:

Many HoloViews Elements support multiple key and value dimensions. A HeatMap is indexed by two kdims, so we can visualize each of the economic indicators by year and country in a Layout. Layouts are useful for heterogeneous data you want to lay out next to each other.

Before we display the Layout let's apply some styling; we'll suppress the value labels applied to a HeatMap by default and substitute it for a colorbar. Additionally we up the number of xticks that are drawn and rotate them by 90 degrees to avoid overlapping. Flipping the y-axis ensures that the countries appear in alphabetical order. Finally we reduce some of the margins of the Layout and increase the size.

In [35]:
%opts HeatMap [show_values=False xticks=40 xrotation=90 aspect=1.2 invert_yaxis=True colorbar=True]
In [36]:
%%opts Layout [aspect_weight=1 fig_size=150 sublabel_position=(-0.2, 1.)]
hv.Layout([macro.to.heatmap(['Year', 'Country'], value)
           for value in macro.data.columns[2:]]).cols(2)
Out[36]: