Machine Learning

Created on Sat Mar 26 11:29:01 2016

@author: Jonas Eschle “Mayou36”

The Machine Learning Analysis module consists of machine-learning functions which are mostly wrappers around already existing algorithms.

Several “new types” of formats are introduced by using the available formats from all the libraries (scikit-learn, pandas, numpy etc) and brings together what belongs together. It takes away all the unnecessary work done so many times for the simple tasks.

The functions serve as basic tools, which do already a lot of the work.

raredecay.analysis.ml_analysis.backward_feature_elimination(original_data, target_data=None, features=None, clf='xgb', n_folds=10, max_feature_elimination=None, max_difference_to_best=0.08, keep_features=None, take_target_from_data=False)[source]

Train and score on each feature subset, eliminating features backwards.

To know, which features make a big impact on the training of the clf and which don’t, there are several techniques to find out. The most reliable, but also cost-intensive one, is the recursive backward feature elimination. A classifier gets trained first on all the features and is validated with the KFold-technique and the ROC AUC. Then, a feature is removed and the classifier is trained and tested again. This is done for all features once. The feature where the auc drops the least is then removed and the next round starts from the beginning but with one feature less.

The function ends either if:

  • no features are left
  • max_feature_elimination features have been eliminated
  • the time limit max_feature_elimination is reached
  • the difference between the most useless features auc and the best (the run done with all features in the beginning) is higher then max_difference_to_best
Parameters:
  • original_data (HEPDataStorage) – The original data
  • target_data (HEPDataStorage) – The target data
  • features (list(str, str, str,..)) – List of strings containing the features/columns to be used for the hyper-optimization or feature selection.
  • clf (str {'xgb, 'rdf, 'erf', 'gb', 'ada', 'nn'} or config-dict) – For possible options, see also make_clf()
  • n_folds (int > 1) – How many folds you want to split your data in when doing KFold-splits to measure the performance of the classifier.
  • max_feature_elimination (int >= 1 or str "hhhh:mm") – How many features should be maximal eliminated before it stopps or how much time it can take (approximately) to do the elimination. If the time runs out before other criterias are true (no features left, max_difference to high...), it just returns the results so far.
  • max_difference_to_best (float) –

    The maximum difference between the “least worst” features auc and the best (usually the one with all features) auc before it stopps.

    In other words, it only eliminates features until the elimination would lead to a roc auc lower by max_difference_to_best then the roc auc with all features (= highest roc auc).

  • keep_features – A list of features that won’t be eliminated. The algorithm does not test the metric if that feature were removed. This saves quite some time.
  • take_target_from_data (boolean) – Old, will be removed. Use if target-data == None.
Returns:

out – Return a dictionary containing the evaluation:

  • ‘roc_auc’ : an ordered-dict with the feature that was removed and the roc auc evaluated without that feature.
  • ‘scores’ : All the roc auc with every feature removed once. Basically a pandas DataFrame containing all results.

Return type:

dict

raredecay.analysis.ml_analysis.best_metric_cut(mc_data, real_data, prediction_branch, metric='precision', plot_importance=3)[source]

Find the best threshold cut for a given metric.

Test the metric for every possible threshold cut and returns the highest value. Plots the metric versus cuts as well.

Parameters:
  • mc_data (HEPDataStorage) – The MC data
  • real_data (HEPDataStorage) – The real data
  • prediction_branch (str) – The branch name containing the predictions to test.
  • metric (str {punzi (punzi_fom()), precision (precision_measure())} or simple metric) – Can be a valid string pointing to a metric or a simple metric taking only tpr and fpr: metric(tpr, fpr, weights=None)
  • plot_importance (int {0, 1, 2, 3, 4, 5}) – The higher the importance, the more likely the plots will be shown. All plots will be saved anyway if an output path was initialized.
Returns:

out – Return a dict containing the best threshold cut as well as the metric value. The keywords are:

  • best_threshold_cut: the best cut on the predictions
  • best_metric: the value of the metric when applying the best cut.

Return type:

dict

raredecay.analysis.ml_analysis.classify(original_data=None, target_data=None, features=None, validation=10, clf='xgb', extended_report=False, get_predictions=False, plot_title=None, curve_name=None, weights_ratio=0, importance=3, plot_importance=3, target_from_data=False, **kwargs)[source]

Training and/or testing a classifier or kfolded predictions.

Classify is a multi-purpose function which does most of the things around machine-learning. It can be used for:

  • Training a clf.
    A quite simple task. You give some data, specify a clf and set validation to False (not mandatory actually, but pay attention if validation is set to an integer)
  • Predict data.
    Use either a pre-trained (see above) classifier or specify one with a string and give in some data to the validation and no to the original_data or target_data. Set get_predictions to True and you’re done.
  • Get a ROC curve of two datasets with K-Folding.
    Specify the two input data (original_data and target_data) and use cross-validation by setting validation to the number of folds
Parameters:
  • original_data (HEPDataStorage) – The original data for the training
  • target_data (HEPDataStorage or None) – The target data for the training. If None, only the original_data will be used for the training.
  • features (list(str, str, str...)) – List with features/columns to use in training.
  • validation (int >= 1 or HEPDataStorage) –

    You can either do cross-validation or give a testsample for the data.

    • Cross-validation:
      Enter an integer, which is the number of folds
    • Validation-dataset:
      Enter a HEPDataStorage which contains data to be tested on. The target-label will be taken from it, so ensure that they are not None! To use two datasets, you can also use a list of maximum two datastorages.
  • clf (classifier, see make_clf()) – The classifier to be used for the training and predicting. It can also be a pretrained classifier as argument.
  • extended_report (boolean) – If True, make extended reports on the classifier as well as on the data, including feature correlation, feature importance etc.
  • get_predictions (boolean) – If True, return a dictionary containing the prediction probabilities, the true y-values, weights and more. Have a look at the return values.
  • plot_title (str) – A part of the title of the plots and general name of the call. Will also be printed in the output to identify with the intention this function was called.
  • curve_name (str) – A labeling for the plotted data.
  • weights_ratio (int >= 0) – The ratio of the weights, actually the class-weights.
  • importance (int {0, 1, 2, 3, 4, 5}) – The higher the importance, the more likely the output will be printed. All output will be saved anyway if an output path was initialized.
  • plot_importance (int {0, 1, 2, 3, 4, 5}) – The higher the importance, the more likely the plots will be shown. All plots will be saved anyway if an output path was initialized.
  • target_from_data (boolean) – OUTDATED; not encouraged to use If True, the target-labeling (the y) will be taken from the data directly and not created. Otherwise, 0 will be assumed for the original_data and 1 for the target_data.
  • kwargs arguments (additional) –
    original_test_weights
    : pandas Series
    Weights for the test sample if you don’t want to use the same weights as in the training
    target_test_weights
    : pandas Series
    Weights for the test sample if you don’t want to use the same weights as in the training
Returns:

  • out (clf) – Return the trained classifier.
  • .. note:: – If validation was choosen to be KFold, the returned classifier well be instance of FoldingClassifier()!
  • out (float (only if validation is not None)) – Return the score (recall or roc auc) of the validation. If only one class (sort of labels, mostly if data for validation is provided) is given, the recall will be computed. Otherwise the ROC-AUC (like for cross-validation)
  • out (dict (only if *get_predictions is True)*) – Return a dict containing the predictions, probability and more.
    • ‘y_pred’ : predictions of the classifier
    • ‘y_proba’ : prediciton probabilities
    • ‘y_true’ : the true labels of the data (if available)
    • ‘weights’ : the weights of the corresponding predicitons

raredecay.analysis.ml_analysis.make_clf(clf, n_cpu=None, dict_only=False)[source]

Return a classifier-dict. Takes a str, config-dict or clf-dict or clf.

This function is used to bring classifiers into the “same” format. It takes several kind of arguments, extracts the information, sorts it and creates an instance of a classifier if needed.

Currently implemented classifiers are found below

Parameters:
  • clf (dict or str or sklearn/REP-classifier) –

    There are several ways to pass the classifier to the function.

    • Pure classifier: You can pass a classifier to the method, either a scikit-learn or a REP classifier.
    • Classifier with name: you can name your classifier by either:
      • using a dict with {‘my_clf_1’: clf}
      • using a dict with {‘name’: ‘my_clf_1’, ‘clf’: clf} where clf referes to the classifier and ‘my_clf_1’ can be any name.
    • Configuration for a clf: Instead of instantiating the clf outside, you can also pass a configuration-dictionary. This has to look like:
      • {‘clf_type‘: config-dict, ‘name’: ‘my_clf_1’} (name is optional) whereas ‘clf_type’ has to be any of the implemented clf-types like ‘xgb’, ‘rdf’, ‘ada’ etc.
    • Get a standard-clf: providing a string only refering to an implemented clf-type, you will get a classifier using the configuration in meta_config
  • n_cpu (int or None) –

    The number of cpus to use for this classifier. If the classifier is not parallelizable, an according parallel_profile (also see in REP-docs) will be created; ‘threads-n’ with n the number of cpus specified before.

    Warning

    This overwrites the global n-cpu settings for this specific classifier

  • dict_only (boolean) – If True, only a dictionary will be returned containing the name, config, clf_type and parallel_profile, n_cpu, but no classifier instance will be created.
Returns:

out – A dictionary containing the name (‘name’) of the classifier as well as the classifier itself (‘clf’). If dict_only is True, no clf will be returned but a ‘clf_type’ as well as a ‘config’ key. Additionally, there are more values that can be contained: if a configuration and not an already instantiated clf is given:

  • parallel_profile: the parallel-profile (for different REP-functions) which is set according to the n_cpus entered as well as the n_cpus used. If n cpus should be used, the classifier takes, the profile will be None. If the classifier is using only 1 cpu, the profile will be ‘threads-n’ with n = n_cpus.
  • n_cpus: The number of cpus used in the classifier.

Return type:

dict

raredecay.analysis.ml_analysis.optimize_hyper_parameters(original_data, target_data=None, clf=None, features=None, n_eval=1, n_checks=10, n_folds=10, generator_type='subgrid', take_target_from_data=False, **kwargs)[source]

Optimize the hyperparameters of a classifiers.

Hyper-parameter optimization of a classifier is an important task. Two datasets are required as well as a clf (not an instance, a dict). For more information about which classifiers are valid, see also make_clf().

The optimization does not happen automatic but checks the hyper-parameter space provided. Every clf-parameter that is a list or numpy array is considered a point. The search-technique can be specified under generator_type.

It is possible to set a time limit instead of a n_eval limit. It estimates the time needed for a run and extrapolates. This extrapolation is not too precise, it can be at worst plus approximately 20% of allowed run-time,

Parameters:
  • original_data (HEPDataStorage) – The original data
  • target_data (HEPDataStorage) – The target data
  • clf (config-dict) – For possible options, see also make_clf(). The difference is, for the feature you want to have optimised, use an iterable instead of a single value, e.g. ‘n_estimators’: [1, 2, 3, 4] etc.
  • features (list(str, str, str,..)) – List of strings containing the features/columns to be used for the hyper-optimization.
  • n_eval (int > 1 or str "hh...hh:mm") – How many evaluations should be done; how many points in the hyperparameter-space should be tested. This can either be an integer, which then represents the number of evaluations done or a string in the format of “hours:minutes” (e.g. “3:25”, “1569:01” (quite long...), “0:12”), which represents the approximat time it should take for the hyperparameter-search (not the exact upper limit)
  • n_checks (1 <= int <= n_folds) – Number of checks on each KFolded dataset will be done. For example, you split your data into 10 folds, but may only want to train/test on 3 different ones.
  • n_folds (int > 1) – How many folds you want to split your data in when doing train/test sets to measure the performance of the classifier.
  • generator_type (str {'subgrid', 'regression', 'random'}) –

    The generator searches the hyper-parameter space. Different generators can be used using different strategies to search for the global maximum.

    • subgrid : For larger grids, first performe search on smaller subgrids to better know the rough topology of the space.
    • regression : using an estimator doing regression on the already known hyper-parameter space points to estimate where to test for the next one.
    • random : Randomly choose points in the hyper-parameter space.
  • take_target_from_data (Boolean) – OUTDATED; not encouraged to use If True, the target-labeling (the y) will be taken from the data directly and not created. Otherwise, 0 will be assumed for the original_data and 1 for the target_data.
raredecay.analysis.ml_analysis.reweight_Kfold(mc_data, real_data, columns=None, n_folds=10, reweighter='gb', meta_cfg=None, n_reweights=1, score_columns=None, score_clf='xgb', add_weights_to_data=True, mcreweighted_as_real_score=False)[source]

Kfold reweight the data by “itself” for scoring and hyper-parameters.

Warning

Do NOT use for the real reweighting process! (except if you really want to reweight the data “by itself”)

If you want to figure out the hyper-parameters for a reweighting process or just want to find out how good the reweighter works, you may want to apply this to the data itself. This means:

  • train a reweighter on mc/real
  • apply it to get new weights for mc
  • compare the mc/real distribution

The problem arises with biasing your reweighter. As in classification tasks, where you split your data into train/test sets for Kfolds, you want to do the same here. Therefore:

  • split the mc data into (n_folds-1)/n_folds (training)
  • train the reweighter on the training mc/complete real (if mcreweighted_as_real_score is True, the real data will be folded too for unbiasing the score)
  • reweight the leftout mc test-fold
  • do this n_folds times
  • getting unbiased weights

The parameters are more or less the same as for the reweight_train() and reweight_weights()

Parameters:
  • mc_data (HEPDataStorage) – The Monte-Carlo data, which has to be “fitted” to the real data.
  • real_data (HEPDataStorage) – Same as mc_data but for the real data.
  • columns (list of strings) – The columns/features/branches you want to use for the reweighting.
  • n_folds (int >= 1) – The number of folds to split the data. Usually, the more folds the “better” the reweighting (especially for small datasets). If n_folds = 1, the data will be reweighted directly and the benefit of Kfolds and the unbiasing disappears
  • reweighter ({'gb', 'bins'}) – Specify which reweighter to use. - gb: GradientBoosted Reweighter from REP - bins: Binned Reweighter from REP
  • meta_cfg (dict) – Contains the parameters for the bins/gb-reweighter. See also BinsReweighter() and GBReweighter().
  • n_reweights (int) – As the reweighting often yields different weights depending on random parameters like the splitting of the data, the new weights can be produced by taking the average of the weights over many reweighting runs. n_reweights is the number of reweight runs to average over.
  • score_columns (list(str, str, str,..)) – The columns to use for the scoring. It is often a good idea to use different (and more) columns for the scoring then for the reweighting itself. A good idea is to use the same columns as for the selection later on.
  • score_clf (clf or clf-dict or str) – The classifier to be used for the scoring. Has to be a valid argument to make_clf().
  • add_weights_to_data (boolean) – If True, the new weights will be added (in place) to the mc data and returned. Otherwise, the weights will only be returned.
  • mcreweighted_as_real_score (boolean or str) – If a string, it has to be an implemented classifier in classify. If true, the default (‘xgb’ most probably) will be used. | If not False, calculate and print the score. This scoring is based on a clf, which was trained on the not reweighted mc and real data and tested on the reweighted mc, and then predicts how many it “thinks” are real datapoints. | Intuitively, a classifiers learns to distinguish between mc and real and then classifies mc reweighted data labeled as real; he says, how “real” the reweighted data looks like. So a higher score is better. Drawback of this method is, it is completely blind to over-fitting of the reweighter. To get a relation, the classifier also predicts the mc (which should be an under limit) as well as the real data (which should be an upper limit). | Even dough this scoring sais not a lot about how well the reweighting worked, we can say, that if the score is higher than the real one, it has somehow over-fitted (if a classifier cannot classify, say, more than 70% of the real data as real, it should not be able to classify more than 70% of the reweighted mc as real. Reweighted mc should not “look more real” than real data)
Returns:

out – Return the new weights.

Return type:

Series

raredecay.analysis.ml_analysis.reweight_train(mc_data, real_data, columns=None, reweighter='gb', reweight_saveas=None, meta_cfg=None, weights_mc=None, weights_real=None)[source]

Return a trained reweighter from a (mc/real) distribution comparison.

Reweighting a distribution is a “making them the same” by changing the weights of the bins (instead of 1) for each event. Mostly, and therefore the naming, you want to change the mc-distribution towards the real one.
There are two possibilities
  • normal bins reweighting:
    divides the bins from one distribution by the bins of the other distribution. Easy and fast, but unstable and inaccurat for higher dimensions.
  • Gradient Boosted reweighting:
    uses several decision trees to reweight the bins. Slower, but more accurat. Very useful in higher dimensions. But be aware, that you can easily screw up things by overfitting.
Parameters:
  • mc_data (HEPDataStorage) – The Monte-Carlo data to compare with the real data.
  • real_data (HEPDataStorage) – Same as mc_data but for the real data.
  • columns (list of strings) – The columns/features/branches you want to use for the reweighting.
  • reweighter ({'gb', 'bins'}) –

    Specify which reweighter to be used.

    • gb: The GradientBoosted Reweighter from REP, GBReweighter()
    • bins: The simple bins reweighter from REP, BinsReweighter()
  • reweight_saveas (string) – To save a trained reweighter in addition to return it. The value is the filepath + name.
  • meta_cfg (dict) – Contains the parameters for the bins/gb-reweighter. See also BinsReweighter() and GBReweighter().
  • weights_mc (numpy.array [n_samples]) – Explicit weights for the Monte-Carlo data. Only specify if you don’t want to use the weights in the HEPDataStorage.
  • weights_real (numpy.array [n_samples]) – Explicit weights for the real data. Only specify if you don’t want to use the weights in the HEPDataStorage.
Returns:

out – Reweighter is trained to the data. Can, for example, be used with predict_weights()

Return type:

object of type reweighter

raredecay.analysis.ml_analysis.reweight_weights(reweight_data, reweighter_trained, columns=None, normalize=True, add_weights_to_data=True)[source]

Apply reweighter to the data and (add +) return the weights.

Can be seen as a wrapper for the predict_weights() method. Additional functionality:

  • Takes a trained reweighter as argument, but can also unpickle one from a file.
Parameters:
  • reweight_data (HEPDataStorage) – The data for which the reweights are to be predicted.
  • reweighter_trained ((pickled) reweighter (from hep_ml)) – The trained reweighter, which predicts the new weights.
  • columns (list(str, str, str,..)) – The columns to use for the reweighting.
  • normalize (boolean or int) – If True, the weights will be normalized (scaled) to the value of normalize.
  • add_weights_to_data (boolean) – If set to False, the weights will only be returned and not updated in the data (HEPDataStorage). If you want to use the data later on in the script with the new weights, set this value to True.
Returns:

out – Return an instance of pandas Series of shape [n_samples] containing the new weights.

Return type:

Series