Data Storage

Created on Thu Apr 7 22:10:29 2016

@author: Jonas Eschle “Mayou36”

This module contains the data handling. The main part is the class which takes data, weights, targets, names and converts automatically, plots and more.

class raredecay.tools.data_storage.HEPDataStorage(data, index=None, target=None, sample_weights=None, data_name=None, data_name_addition=None, column_alias=None)[source]

Bases: object

Data-storage for data, weights, targets; conversion; plots and more

Initialize instance and load data.

Parameters:
  • data (root-tree dict (make_root_dict()) or DataFrame) –

    The data itself. This can be two different types

    • root-tree dict (root-dict): Dictionary which specifies all the information to convert a root- tree to an array. Directly given to root2rec()
    • pandas DataFrame: A pandas DataFrame. The index (if not explicitly defined) and column names will be taken.
  • index (1-D array-like) – The indices of the data that will be used.
  • target (list or 1-D array or int {0, 1}) – Labels the data for the machine learning. Usually the y.
  • sample_weights (Series or array or int {1} or str/dict for root-trees (make_root_dict())) – The new weights for the dataset. If the new weights are a pandas Series, the index must match the internal index If the data is a root-tree file, a string (naming the branche) or a whole root-dict can be given, pointing to the weights stored. .. note:: If None or 1 specified, 1 will be assumed for all.
  • data_name (str) –
    Name of the data, human-readable. Displayed in the title of plots.
    Example: ‘Bu2K1piee mc’, ‘beta-decay real data’ etc.
  • data_name_addition (str) –
    Additional remarks to the data, human readable. Displayed in the title of plots.
    Example: ‘reweighted’, ‘shuffled’, ‘5 GeV cut applied’ etc.
  • column_alias (dict{str: str, str: str, ...}) – To change the name of a branch. The argument should be a dict looking like {‘current_branch_name_in_root_tree/DataFrame’: ‘desired_name’}. The current_branch has to exist in the root-tree or DataFrame, the desired_name can be anything.
columns

The columns/branches of the data

copy_storage(columns=None, index=None, add_to_name=' cp')[source]

Return a copy of self (with only some of the columns, indices etc).

Parameters:
  • columns (str or list(str, str, str, ..)) – The columns which will be in the new storage.
  • index (list or array) – The indices of the rows (and corresponding weights, targets etc.) for the new storage. The index of the data to use.
  • add_to_name (str) – An addition to the data_name_addition of the copy.
data

Return the data as is without conversion, e.g. a root-dict, pandasDF etc.

data_name

The name of the data.

data_name_addition

The data name addition.

data_type

“Return the data-type like ‘root’, ‘df’ etc.

fold_name

The name of the fold (like fold 2 of 5).

get_LabeledDataStorage(columns=None, index=None, shuffle=False)[source]

Create and return an instance of class “LabeledDataStorage” from the REP repository.

Parameters:
  • columns (str or list(str, str, str, ..)) – The columns to use for the LabeledDataStorage.
  • index (list or array) – The index of the data to use.
  • shuffle (boolean) – Argument is passed to the LabeledDataStorage. If True, the data will be shuffled.
Returns:

out – Return a Labeled Data Storage instance created with the data from inside this instance.

Return type:

LabeledDataStorage instance

get_fold(fold)[source]

Return the specified fold: train and test data as instance of HEPDataStorage.

Parameters:fold (int) – The number of the fold to return. From 0 to n_folds - 1
Returns:out – Return the train and the test data in a HEPDataStorage
Return type:tuple(HEPDataStorage, HEPDataStorage)
get_index()[source]

Return the index used inside the DataStorage. Advanced feature.

get_n_folds()[source]

Return how many folds are currently availabe or 0 if no folds have been created.

Returns:out – The number of folds which are currently available.
Return type:int
get_name()[source]

Return the human-readable name of the data as a string.

get_targets(index=None)[source]

Return the targets of the data as a pandas Series.

get_weights(index=None, normalize=True, **kwargs)[source]

Return the weights of the specified indeces or, if None, return all.

Parameters:
  • normalize (boolean or float > 0) – If True, the weights will be normalized to 1 (the mean is 1). If a float is provided, the mean of the weights will be equal to normalize. So True and 1 will yield the same results.
  • index (list or array) – |index_docstring
Returns:

out – Return the weights as pandas Series

Return type:

1-D pandas Series

index

The internal index

make_dataset(second_storage=None, index=None, index_2=None, columns=None, weights_ratio=0, shuffle=False, targets_from_data=False)[source]

Create data, targets and weights of the instance (and another one).

In machine-learning, it is very often required to have data, it’s targets (or labeling, the ‘y’) and the weights. In most cases, we are not only interested in one such pair, but need to concatenate it to other data (for example signal and background).

This is exactly, what make_dataset does.

Parameters:second_storage (instance of) –
:param HEPDataStorage: A second data-storage. If provided, the data/targets/weights
will be concatenated and returned as one.
Parameters:
  • index (list or array) – The index for the calling (the first) storage instance. The index of the data to use.
  • index_2 (list(int, int, int, ..)) – The index for the (optional) second storage instance. The index of the data to use.
  • columns (list(str, str, str, ..)) – The columns to be used of both data-storages.
  • weights_ratio (float >= 0) –

    The (relative) normalization. If a second data storage is provided it is assumed (will be changed in future ?) that the two storages can be seen as the two different targets. If zero, nothing happens. If it is bigger than zero, it represents the ratio of the sum of the weights from the first to the second storage. If set to 1, they both are equally weighted. If no second storage is provided, it is the normalization of the storage called.

    Ratio := sum(weights_1) / sum(weights_2) with a second storage Ratio := sum(weights_1) / mean(weights_1)

  • shuffle (boolean or int) – If True or int, the dataset will be shuffled before returned. If an int is provided, it will be used as a seed to the pseudo-random generator.
  • targets_from_data – OUTDATED, dont use it. Use two datastorage, one labeled 0, one 1
make_folds(n_folds=10, shuffle=True)[source]

Create shuffled train-test folds which can be accessed via get_fold().

Split the data into n folds (for usage in KFold validaten etc.). Then every fold consists of a train dataset, which consists of n-1/n part of the data and a test dataset, which consists of 1/n part of the whole data. The folds will be created as HEPDataStorage. To get a certain fold (train-test pair), use get_fold()

Parameters:
  • n_folds (int > 1) – The number of folds to be created from the data. If you want, for example, a simple 2/3-1/3 split, just specify n_folds = 3 and just take one fold.
  • shuffle (boolean or int) – If True or int, shuffle the data before slicing.
name

Return the full human-readable name of the data as a string.

pandasDF(columns=None, index=None)[source]

Return a pandas DataFrame representation of the data

Return a pandas DataFrame.

Parameters:
  • columns (str) – Arguments for the root2rec() ls function.
  • index (list or array) – The index of the data to use.
plot(figure=None, columns=None, index=None, title=None, data_name=None, bins=None, log_y_axes=False, plot_range=None, sample_weights=None, importance=3, see_all=False, hist_settings=None)[source]

Draw histograms of the data.

Warning

Only 99.98% of the newest plotted data will be shown to focus on the essential parts (the axis limits will be set accordingly). This implies a risk of cutting the previously (in the same figure) plotted data (mostly, if they do not overlap a lot). To ensure that all data is plotted, set see_all to True.

Parameters:
  • figure (str or int) – The name of the figure. If the figure already exists, the plots will be plotted in the same window (can be intentional, for example to compare data)
  • columns (str or list(str, str, str, ..)) – The columns of the data to be plotted. If None, all are plotted.
  • index (list or array) – The index of the data to use.
  • title (str) –
    The title of the whole plot (NOT of the subplots). If several titles for the same figures are given, they will be concatenated.
    So for a “simple” title, specify the title only once.
  • data_name
    Additional, (to the data_name and data_name_addition), human- readable name for the legend.
    Examples: “before cut”, “after cut” etc
  • bins (int) – Number of bins to plot.
  • log_y_axes (boolean) – If True, the y-axes will be scaled logarithmically.
  • plot_range (tuple (float, float) or None) – The lower and upper range of the bins. If None, 99.98% of the data will be plottet automatically.
  • sample_weights (pandas Series) – The weights for the data, how “high” a bin is. Actually, how much it should account for the whole distribution or how “often” it occures. If None is specified, the weights are taken from the data.
  • importance (int {0, 1, 2, 3, 4, 5}) – The higher the importance, the more likely the output will be printed. All output will be saved anyway if an output path was initialized.
  • see_all (boolean) – If True, all data (not just 99.98%) will be plotted.
  • hist_settings (dict) – A dictionary containing the settings as keywords for the hist() function.
plot2Dhist(x_columns, y_columns=None)[source]

Plot a 2D hist of x_columns vs itself or y_columns.

Warning

this can produce A LOT of plots! (x_columns * y_columns)

Parameters:
  • x_columns (list(str, str, str,..)) – The x columns to plot agains
  • y_columns (list(str, str, str,..)) – The y columns to plot agains
plot2Dscatter(x_branch, y_branch, dot_scale=20, color='b', figure=None)[source]

Plot two columns against each other to see the distribution.

The dots size is proportional to the weights, so you have a good overview on the data and the weights.

Parameters:
  • x_branch (str) – The x column to plot
  • x_branch – Thy y column to plot
  • dot_scale (int or float) – The overall scaling factor for the dots
  • color (str) – A valid (matplotlib.pyplot-compatible) color
  • figure (str or int or figure) – The figure to be plotted in
Returns:

out – Return the figure

Return type:

figure

plot_correlation(second_storage=None, figure=None, columns=None, method='pearson', plot_importance=5)[source]

Warning

does not support weights. Maybe in the future.

Plot the feature correlation for the data (combined with other data)

Calculate the feature correlation, return it and plot them.

Parameters:
  • second_storage (HEPDataStorage or None) – If a second data-storage is provided, the data will be merged and then the correlation will be calculated. Otherwise, only this datas correlation will be calculated and plotted.
  • method (str {'pearson', 'kendall', 'spearman'}) – The method to calculate the correlation.
  • plot_importance (int {1, 2, 3, 4, 5}) – The higher the more likely it gets plotted. Depends on the plot_verbosity. To make sure the correlation... - does get plotted, chose 5 - does not get plotted, chose 1
Returns:

out – Return the feature-correlations in a pandas DataFrame

Return type:

pandas DataFrame

plot_parallel_coordinates(columns=None, figure=0, second_storage=None)[source]

Plot the parallel coordinates.

Warning

No weights supported so far!

set_data(data, index=None, columns=None, column_alias=None)[source]

Set the data and also change index and columns.

Parameters:
  • data (root-tree dict (make_root_dict()) or DataFrame) – The new data
  • index (list or array) – The index of the data to use.
  • columns (list(str, str, str,..)) – The columns for the data to use
  • column_alias (dict{str: str, str: str, ...}) – To change the name of a branch. The argument should be a dict looking like {‘current_branch_name_in_root_tree/DataFrame’: ‘desired_name’}. The current_branch has to exist in the root-tree or DataFrame, the desired_name can be anything.
set_root_selection(selection, exception_if_failure=True)[source]

Set the selection in a root-file. Only possible if a root-file is provided.

set_targets(targets, index=None)[source]

Set the targets of the data. Either an array-like object or {0, 1}.

set_weights(sample_weights, index=None)[source]

Set the weights of the sample.

Parameters:
  • sample_weights (Series or array or int {1} or str/dict for root-trees (make_root_dict())) – The new weights for the dataset. If the new weights are a pandas Series, the index must match the internal index If the data is a root-tree file, a string (naming the branche) or a whole root-dict can be given, pointing to the weights stored.
  • index (1-D array or list or None) – The indeces for the weights to be set. Only the index given will be set/used as weights.