Physical Analysis with ML¶

Created on Sat Mar 26 16:49:45 2016

@author: Jonas Eschle “Mayou36”

This module provides high-level function, which often contain an essential part of a complete MVA. The functions are sometimes quite verbous, both in plotting as well as in printing, but always also return the important values.

raredecay.analysis.physical_analysis.final_training(real_data, mc_data, bkg_sel, clf='xgb', n_folds=10, columns=None, performance_only=True, metric_vs_cut='punzi', save_real_pred=False, save_mc_pred=False)[source]¶

Train on bkg and MC, test metric, performance and predict probabilities.

The goal of an MVA is to have certain probabilities predicted for each event to make further cuts on the data-sample and reduce the background.

There are two modes to run:

performance_only: train a clf K-folded on the background and the MC and predict, then create the ROC-curve and plot a metric. This is to get an idea of how well the classifier performs as well as to find the optimal cutoff-value on the predictions.
prediction_mode: (set performance_only to False) train a clf on the bkg and MC and predict K-folded the probabilities for all data (bkg, MC and the rest) without any event occuring in the training-set as well as in the test-set. If a name is given to save_mc_pred respectively save_real_pred, the predicitions will be saved to the root-file the data was taken from.

Parameters:	real_data (`HEPDataStorage`) – The real data mc_data (`HEPDataStorage`) – The MC data (signal) bkg_sel (str or [str]) – A string pointing to a column in the root-tree which tells if an event belongs to the bkg (1) to train on or not (0). This typically is something like this: (B_M > 5700) or similar clf (str or clf or dict, see `make_clf()`) – The classifier to be used. n_folds (int > 1) – The number of folds to use for the training columns (list(str, str, str,..)) – The columns to train on performance_only (boolean) – If True, the function is run in performance mode and does not predict but only creates a ROC-curve and a metric-vs-cut. metric_vs_cut (str {'punzi', 'precision'}) – The metric to test on the predictions. save_real_pred (str or False) – If provided, the predictions of the real data will be saved to its root-tree with the branch name specified here. save_mc_pred (str or False) – If provided, the predictions of the MC will be saved to its root-tree with the branch name specified here.
Returns:	out – If a metric_vs_cut is specified, the best cut and metric is returned
Return type:	float, float

raredecay.analysis.physical_analysis.reweight(apply_data, real_data=None, mc_data=None, columns=None, reweighter='gb', reweight_cfg=None, n_reweights=1, apply_weights=True)[source]¶

(Train a reweighter and) apply the reweighter to get new weights.

Train a reweighter from the real data and the corresponding MC differences. Then, try to correct the apply data (MC as well) the same as the first MC would have been corrected to look like its real counterpart.

Parameters:

apply_data (HEPDataStorage) – The data which shall be corrected
real_data (HEPDataStorage) – The real data to train the reweighter on
mc_data (HEPDataStorage) – The MC data to train the reweighter on
columns (list(str, str, str,..)) – The branches to use for the reweighting process.
reweighter ({'gb', 'bins'} or trained hep_ml-reweighter (also pickled)) – Either a string specifying which reweighter to use or an already trained reweighter from the hep_ml-package. The reweighter can also be a file-path (str) to a pickled reweighter.
reweight_cfg (dict) – A dict containing all the keywords and values you want to specify as parameters to the reweighter.
n_reweights (int) – To get more stable weights, the mean of each weight over many reweighting runs (training and predicting) can be used. The n_reweights specifies how many runs to do.
apply_weights (boolean) – If True, the weights will be added to the data directly, therefore the data-storage will be modified.

Returns:

out – Return a dict containing the weights as well as the reweighter. The keywords are:

reweighter : The trained reweighter
weights : pandas Series containing the new weights of the data.

Return type:

dict

raredecay.analysis.physical_analysis.reweightCV(real_data, mc_data, columns=None, n_folds=10, reweighter='gb', reweight_cfg=None, n_reweights=1, scoring=True, score_columns=None, n_folds_scoring=10, score_clf='xgb', mayou_score=False, extended_train_similar=False, apply_weights=True)[source]¶

Reweight data MC/real in a K-Fold way to unbias the reweighting.

Sophisticated reweighting-algorithms can be quite sensitive to its hyperparameters. Therefore, it is good to get an estimation for the reweighting quality by reweighting the data itself and “test” it (compare how similar the reweighted to the real one is). In order to get an unbiased reweighting, a KFolding procedure is applied:

the reweighter is trained on n-1/nth of the data and predicts the weights for the 1/n leftover. This is done n times resulting in unbiased weights for the mc data.

To know, how well the reweighter worked, different stategies can be used and are implemented, for further information also see: mayou36.bitbucket.org/presenation/reweightingCV_quality_measure.pdf :param real_data: The real data :type real_data: HEPDataStorage :param mc_data: The mc data :type mc_data: HEPDataStorage :param columns: The branches to use for the reweighting. :type columns: list(str, str, str, ...) :param n_folds: Number of folds to split the data for the reweighting. Usually, the

higher the better.

Parameters:

reweighter (str {'gb', 'bins'}) – Which reweighter to use, either the Gradient Boosted reweighter or the (normally used) Bins reweighter (both from hep_ml)
reweight_cfg (dict) – A dict containing all the keyword arguments for the configuration of the reweighters.
n_reweights (int) – As the reweighting often yields different weights depending on random parameters like the splitting of the data, the new weights can be produced by taking the average of the weights over many reweighting runs. n_reweights is the number of reweight runs to average over.
scoring (boolean) –
If True, the data is not only reweighted with KFolding but also several scoring metrics are tested.
- Data-ROC : The data (mc reweighted and real mixed) is split in KFolds, a classifier is then trained on the training fold and tested on the test-fold. This is done K times and the roc curve is evaluated. It is a good measure, basically, for how well two datasets can be distinguished but can be “overfitted”. Having too high, single weights can lead to a roc curve significantly lower then 0.5 and therefore only a good indication but not a single measure of quality for the reweighter hyper-parameter search.
- mcreweighted_as_real : n-1/n part of the data is trained on the reweighter and the last 1/n part is then reweighted (as described above). We can train a classifier on the mc (not reweighted) as well as the real data (so a classifier which “distinguishes” between mc and real) and predict:
  - (not in training used) mc (not reweighted) and label it as if it were real data.
  - (not in training used) mc reweighted and label it as if it were real data.
  - (not in training used) real data and label it real.
  Then we look at the tpr (we cannot look at the ROC as we only inserted one class of labels; real) and therefore at “how many of the datapoints we inserted did the classifier predict as real?”:
  
  The score for the real data should be the highest, the one for the mc not reweighted the lowest. The reweighted one should be somewhere in between (most probably). It is not the goal to maximise the tpr for the mc reweighted (by changing the reweighter hyper-parameters) as high, single weights (which occure when overfitting) will increase the tpr drastically.
- train_similar: The probably most stable score to find the gbreweighter hyper-parameters. The data is split into KFolds and a classifier is trained on the mc reweighted and real data. Then it predicts the (not yet seen) real data. The more it is able to predict as real, the more it was able to learn from the differences of the datasets. This scoring cannot overfit the same way the one above because a single, high weight will cause a very bad distribution of the mc data and therefore the classifier will be able to predict nearly every real data as real (only one single point, the one with the high weight, will be predicted as mc, the rest as real)
score_columns (list(str, str, str,..)) – The columns to use for the scoring. They should not be the same as for the reweighting in order to unbias the score. It is usually a good idea to use the same branches as will be used for the selection training later on.
n_folds_scoring (int > 1) – The number of folds to split the data into for the scoring described above.
score_clf (str or dict or clf) – The classifier to use for the scoring. For an overview of what can be used, see :py:function:`~raredecay.analysis.ml_analysis.make_clf()`.
mayou_score (boolean) – If True, the experimental mayou_score will be generated.
extended_train_similar (boolean) – If True, an experimental score will be generated.
apply_weights (boolean) – If True, set the new weights to the MC data in place. This changes the weights in the data-storage.

Returns:

out – The output is a dictionary containing the different scores and/or the new weights. The keywords are:

weights : pandas Series containing the new weights
mcreweighted_as_real_score : The scores of this method in a dict
train_similar : The scores of this method in a dict
roc_auc_score : The scores of this method in a dict

Return type:

dict

raredecay.analysis.physical_analysis.feature_exploration(original_data, target_data, features=None, n_folds=10, clf='xgb', roc_auc='single', extended_report=True)[source]¶

Explore the features by getting the roc auc and their feature importance.

An essential part is to have a rough idea of how discriminating the features are. A classifier is trained on each single feature and all together, correlations and feature importance are plottet if wanted.

Parameters:

original_data (HEPDataStorage) – One dataset
target_data (HEPDataStorage) – The other dataset
features (list(str, str, str,..)) – The features/branches/columns to explore
n_folds (int > 1) – Number of folds to split the data into to do some training/testing and get an estimate for the feature importance.
clf (str or dict or clf, see: make_clf()) – The classifier you want to use.
roc_auc ({'single', 'all', 'both'} or False) –
Whether to make a training/testing with:
- every single feature (-> n_feature times KFolded training)
- all features together (-> one KFolded training)
- both of the above
- None of them (-> use False)
extended_report (boolean) – If True, an extended report will be made including feature importance and more.

raredecay.analysis.physical_analysis.add_branch_to_rootfile(filename, treename, new_branch, branch_name, overwrite=True)[source]¶

Add a branch to a given ROOT-Tree.

Add some data (new_branch) to the ROOT-file (filename) into its tree (treename) under the branch (branch_name)

Parameters:

filename (str) – The name of the file (and its path)
treename (str) – The name of the tree to save the data in
new_branch (array-like) – The data to add to the root-file
branch_name (str) – The name of the branch the data will be written too. This can either be a new one or an already existing one, which then will be overwritten. No “friend” will be created.
overwrite (boolean) – NOT IMPLEMENTED!