To reweight the MC to make it more similar then the real data, one can use simple binning of certain variables and division. A more sophisticated approach is the split of the event-space via decision trees and a boosted reweighting. For further information, also see:http://arogozhnikov.github.io/2015/10/09/gradient-boosted-reweighter.html
In this notebook, the GBreweighter from the hep_ml repository will be used.
The gradient boosted reweighting approach has its pros but also its cons. The most prequious one is the search for good hyper-parameters, as it is highly sensible to them and over-fitting is achieved very fast. See also this presentation:NOT-YET-UPLOADED.
So we will have two steps:
Let's first start with some data-initialization (mostly, you want an initialization via settings.initialize() before! See also the IO_handling_and_settings-HowTo).
%matplotlib inline
import numpy as np
import pandas as pd
from raredecay.tools.data_storage import HEPDataStorage
from raredecay.analysis.physical_analysis import reweightCV, reweight, add_branch_to_rootfile
# 5 columns
rows = 1000
data1 = pd.DataFrame(np.random.normal(size=(rows, 5)), columns=['one', 'two', 'three', 'four', 'five'])
data1 = HEPDataStorage(data1,
sample_weights=abs(np.random.normal(size=rows)),
data_name="random real data")
# Second DataStorage
data2 = pd.DataFrame(np.random.normal(size=(rows, 5)), columns=['one', 'two', 'three', 'four', 'five'])
data2 = HEPDataStorage(data2,
sample_weights=abs(np.random.normal(size=rows)),
data_name="random MC")
We will later call a function, which does everything. So it is a good way to define some of your variables already before, like the hyper-parameters of the reweighter and the columns you want to train the reweighter on.
# we collect all parameters for the reweighter (in this case we use the GBReweighter) in a dict
reweight_cfg = dict( # GB reweighter configuration, comments are "good" values
n_estimators=23, # 20-25
max_depth=3, # 3-6 or number of features
learning_rate=0.1, # 0.1
min_samples_leaf=200, # 200
# loss_regularization=7, # 3-8 # probably not yet implemented, only in newest git-hub version
gb_args=dict(
subsample=0.8, # 0.8
#random_state=43,
min_samples_split=200 # 200
)
)
# so we do with the branches we want to use. To use all, simply use "None"
branches = ['two', 'three']
To know, whether this hyper-parameters are good, we first reweight the data itself in a KFold way and get some scores to know how good we reweighted.
output_cv = reweightCV(real_data=data1, mc_data=data2,
n_folds=3, # the folds to use for the reweighting, ~10
reweighter='gb', # Default value. Could also be 'bins'
reweight_cfg=reweight_cfg, # A dict containing the config for the reweighter
columns=branches, # The branches to use for the reweighting
scoring=True, # Whether to obtain scores or not
n_folds_scoring=5, # In how many folds to split the data for certain scorings
score_clf='xgb', # Which clf to use. For more options, also see ml_analysis.make_clf()
apply_weights=True # Adds the weights directly to the mc_data
)
We get a lot of output, the scores are explained in the docs or the presentation mentioned above. All the things written are also returned. Let's have a look at the output of the function; it's a dict containing all the scores as well as the new weights (which also have been added to the MC)
print output_cv
After we are content with our hyper-parameters we can do the reweighting. We need three different data:
As we have already created the first datasets, we need the third one (which would be our MC of the decay we want to discover).
data_to_reweight = pd.DataFrame(np.random.normal(size=(rows, 5)), columns=['one', 'two', 'three', 'four', 'five'])
data_to_reweight = HEPDataStorage(data_to_reweight,
sample_weights=abs(np.random.normal(size=rows)),
data_name="random MC of decay")
The function is then similar to the upper one.
output_reweighting = reweight(real_data=data1, mc_data=data2, apply_data=data_to_reweight,
reweighter='gb', # Default value. The reweighter to use
reweight_cfg=reweight_cfg, # Again the configuration to use for the reweighter
columns=branches, # Same as above, the branches for the reweighting
apply_weights=True # Default value. Add the weights to the "apply_data"
)
Often, we have our data in a ROOT-Tree and want to add the weights there as well. This can be done with the following function:
new_weights = output_reweighting['weights']
#add_branch_to_rootfile(filename='/home/data/K2ee_MC.root', # name of the file
# treename='DecayTree', # name of the tree in the file
# new_branch=new_weights, # The data to add
# branch_name='weights_gb' # The name of the new branch
# (or already existing -> overwrites it)
# )
# Won't work without ROOT and rootpy