Reweighting with GBreweighter

To reweight the MC to make it more similar then the real data, one can use simple binning of certain variables and division. A more sophisticated approach is the split of the event-space via decision trees and a boosted reweighting. For further information, also see:http://arogozhnikov.github.io/2015/10/09/gradient-boosted-reweighter.html

In this notebook, the GBreweighter from the hep_ml repository will be used.

Sensible Hyper-parameters

The gradient boosted reweighting approach has its pros but also its cons. The most prequious one is the search for good hyper-parameters, as it is highly sensible to them and over-fitting is achieved very fast. See also this presentation:NOT-YET-UPLOADED.

So we will have two steps:

  • find good hyper-parameters
  • reweight the data with the hyper-parameters found above

Let's first start with some data-initialization (mostly, you want an initialization via settings.initialize() before! See also the IO_handling_and_settings-HowTo).

In [7]:
%matplotlib inline
import numpy as np
import pandas as pd

from raredecay.tools.data_storage import HEPDataStorage
from raredecay.analysis.physical_analysis import reweightCV, reweight, add_branch_to_rootfile


# 5 columns
rows = 1000
data1 = pd.DataFrame(np.random.normal(size=(rows, 5)), columns=['one', 'two', 'three', 'four', 'five'])  

data1 = HEPDataStorage(data1,
                      sample_weights=abs(np.random.normal(size=rows)),
                      data_name="random real data")

# Second DataStorage
data2 = pd.DataFrame(np.random.normal(size=(rows, 5)), columns=['one', 'two', 'three', 'four', 'five'])  

data2 = HEPDataStorage(data2,
                      sample_weights=abs(np.random.normal(size=rows)),
                      data_name="random MC")

Configuring the reweighter

We will later call a function, which does everything. So it is a good way to define some of your variables already before, like the hyper-parameters of the reweighter and the columns you want to train the reweighter on.

In [3]:
# we collect all parameters for the reweighter (in this case we use the GBReweighter) in a dict
reweight_cfg = dict(  # GB reweighter configuration, comments are "good" values
        n_estimators=23,  # 20-25
        max_depth=3,  # 3-6 or number of features
        learning_rate=0.1,  # 0.1
        min_samples_leaf=200,  # 200
        # loss_regularization=7,  # 3-8  # probably not yet implemented, only in newest git-hub version
        gb_args=dict(
            subsample=0.8,  # 0.8
            #random_state=43,
            min_samples_split=200  # 200
            )
        )

# so we do with the branches we want to use. To use all, simply use "None"
branches = ['two', 'three']

Test reweighting quality

To know, whether this hyper-parameters are good, we first reweight the data itself in a KFold way and get some scores to know how good we reweighted.

In [4]:
output_cv = reweightCV(real_data=data1, mc_data=data2,
                       n_folds=3,  # the folds to use for the reweighting, ~10
                       reweighter='gb',  # Default value. Could also be 'bins'
                       reweight_cfg=reweight_cfg,  # A dict containing the config for the reweighter
                       columns=branches,  # The branches to use for the reweighting
                       scoring=True,  # Whether to obtain scores or not
                       n_folds_scoring=5,  # In how many folds to split the data for certain scorings
                       score_clf='xgb',  # Which clf to use. For more options, also see ml_analysis.make_clf()
                       apply_weights=True  # Adds the weights directly to the mc_data
                       )
/usr/local/lib/python2.7/dist-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
=================
Reweighting Kfold
=================
Doing reweighting_Kfold with 3 folds




Training the reweighter
=======================
Reweighter: GBReweighter with config: 
  n_estimators : 23

  gb_args
^^^^^^^^^
    min_samples_split : 200
    subsample : 0.8
  learning_rate : 0.1
  max_depth : 3
  min_samples_leaf : 200 
Data used:
 random test-data2  cp train set fold 1 of 3  and  random test-data1  cp train set fold 1 of 3 
columns used for the reweighter training:
 ['two', 'three']


Using the reweighter:
GBReweighter(gb_args={'min_samples_split': 200, 'subsample': 0.8},
       learning_rate=0.1, loss_regularization=5.0, max_depth=3,
       min_samples_leaf=200, n_estimators=23)
 to reweight random test-data2  cp test set fold 1 of 3


/usr/local/lib/python2.7/dist-packages/sklearn/metrics/classification.py:1115: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples.
  'recall', 'true', average, warn_for)
/home/mayou/Documents/uniphysik/Bachelor_thesis/python_workspace/raredecay/raredecay/tools/data_storage.py:187: FutureWarning: Will be removed in the future. Use obj.index instead
  warnings.warn("Will be removed in the future. Use obj.index instead", FutureWarning)

Training the reweighter
=======================
Reweighter: GBReweighter with config: 
  n_estimators : 23

  gb_args
^^^^^^^^^
    min_samples_split : 200
    subsample : 0.8
  learning_rate : 0.1
  max_depth : 3
  min_samples_leaf : 200 
Data used:
 random test-data2  cp train set fold 2 of 3  and  random test-data1  cp train set fold 2 of 3 
columns used for the reweighter training:
 ['two', 'three']


Using the reweighter:
GBReweighter(gb_args={'min_samples_split': 200, 'subsample': 0.8},
       learning_rate=0.1, loss_regularization=5.0, max_depth=3,
       min_samples_leaf=200, n_estimators=23)
 to reweight random test-data2  cp test set fold 2 of 3




Training the reweighter
=======================
Reweighter: GBReweighter with config: 
  n_estimators : 23

  gb_args
^^^^^^^^^
    min_samples_split : 200
    subsample : 0.8
  learning_rate : 0.1
  max_depth : 3
  min_samples_leaf : 200 
Data used:
 random test-data2  cp train set fold 3 of 3  and  random test-data1  cp train set fold 3 of 3 
columns used for the reweighter training:
 ['two', 'three']


Using the reweighter:
GBReweighter(gb_args={'min_samples_split': 200, 'subsample': 0.8},
       learning_rate=0.1, loss_regularization=5.0, max_depth=3,
       min_samples_leaf=200, n_estimators=23)
 to reweight random test-data2  cp test set fold 3 of 3



---------------------
Kfold reweight report
---------------------


Precision scores of classification on reweighted mc
===================================================



Classify the target, average score Reweighted: 0.4782 +- 0.0386


Classify the target, average score mc as real (min): 0.4628 +- 0.0329


Classify the target, average score real as real (max): 0.5271 +- 0.0258



-----------------------------------------------
Clf trained on real/mc reweight, tested on real
-----------------------------------------------
Score train_similar (recall, lower means better):  0.4555 +- 0.0306


No reweighting score:  0.4357


KFold prediction using folds column

------------------
Report of classify
------------------
ROC AUC of XGBoost classifier: 0.5058


KFold prediction using folds column

===================================
ROC AUC of mc reweighted/real KFold
===================================
ROC AUC score: 0.5058



====================
Train similar report
====================
score: 0.4555 +- 0.0306


score max: 0.4357 +- 0.0614


Output

We get a lot of output, the scores are explained in the docs or the presentation mentioned above. All the things written are also returned. Let's have a look at the output of the function; it's a dict containing all the scores as well as the new weights (which also have been added to the MC)

In [6]:
print output_cv
{'train_similar': {'score': 0.45550000000000002, 'score_max': 0.43569999999999998, 'score_std': 0.030599999999999999, 'score_max_std': 0.061400000000000003}, 'roc_auc': 0.5058, 'weights': 536    0.545338
657    1.178855
82     4.096259
710    0.303732
263    2.271247
652    0.814733
706    0.128687
171    1.504750
85     1.145765
925    0.481734
588    2.926373
52     0.831615
9      0.800249
600    2.979716
253    1.477198
68     1.358077
292    0.613697
290    0.625958
848    2.449538
701    1.309799
90     1.056359
223    0.766822
201    0.316992
135    1.933890
303    0.472120
976    0.388638
397    0.420802
480    2.587732
37     0.248182
7      1.110666
         ...   
117    1.861048
882    0.068684
781    0.340506
648    2.923045
594    1.834430
609    1.530124
943    1.200565
705    0.015942
723    0.238462
442    1.696918
167    1.072036
101    0.152580
371    0.541372
719    0.720333
48     0.289284
696    1.214315
500    0.571604
621    0.295272
583    1.254412
63     0.663536
449    0.439253
899    0.955351
512    1.114598
748    0.432048
16     0.012738
833    0.731183
225    2.241149
473    0.427155
321    1.643999
416    0.378327
dtype: float64, 'mcreweighted_as_real_score': {'score_min': 0.4628, 'score_max': 0.5271, 'score_reweighted': 0.4782}}

Reweighting

After we are content with our hyper-parameters we can do the reweighting. We need three different data:

  • MC as well as real data to train the reweighter
  • MC to apply the reweighter and get the new weights for

As we have already created the first datasets, we need the third one (which would be our MC of the decay we want to discover).

In [8]:
data_to_reweight = pd.DataFrame(np.random.normal(size=(rows, 5)), columns=['one', 'two', 'three', 'four', 'five'])  

data_to_reweight = HEPDataStorage(data_to_reweight,
                                  sample_weights=abs(np.random.normal(size=rows)),
                                  data_name="random MC of decay")

The function is then similar to the upper one.

In [9]:
output_reweighting = reweight(real_data=data1, mc_data=data2, apply_data=data_to_reweight,
                              reweighter='gb',  # Default value. The reweighter to use
                              reweight_cfg=reweight_cfg,  # Again the configuration to use for the reweighter
                              columns=branches,  # Same as above, the branches for the reweighting
                              apply_weights=True  # Default value. Add the weights to the "apply_data"
                              )

Training the reweighter
=======================
Reweighter: GBReweighter with config: 
  n_estimators : 23

  gb_args
^^^^^^^^^
    min_samples_split : 200
    subsample : 0.8
  learning_rate : 0.1
  max_depth : 3
  min_samples_leaf : 200 
Data used:
 random MC   and  random real data  
columns used for the reweighter training:
 ['two', 'three']


Using the reweighter:
GBReweighter(gb_args={'min_samples_split': 200, 'subsample': 0.8},
       learning_rate=0.1, loss_regularization=5.0, max_depth=3,
       min_samples_leaf=200, n_estimators=23)
 to reweight random MC of decay 


Save new weights to file

Often, we have our data in a ROOT-Tree and want to add the weights there as well. This can be done with the following function:

In [11]:
new_weights = output_reweighting['weights']
#add_branch_to_rootfile(filename='/home/data/K2ee_MC.root',  #  name of the file
#                       treename='DecayTree',  # name of the tree in the file
#                       new_branch=new_weights,  # The data to add
#                       branch_name='weights_gb'  # The name of the new branch
#                       (or already existing -> overwrites it)
#                       )
# Won't work without ROOT and rootpy