Permutation Test for F-score Differences in Python

code-snippet python stats

Today I discovered a paper by Yeh (2000) which discusses how one might test differences in F-scores between multiple classifiers for significance. The author suggests, among other things, to employ permutation tests to do this. Permutation tests simulate* data under the assumption that the classifiers are equal in predictive validity / have equal F-scores. The observed p-value is then the frequency with which the observed difference between the F-scores is more extreme than the simulated differences.

The intuition behind this is that if two models have equally good F-scores, it would not matter from which model the predictions come from. Thus, the predictions of two models are randomly* shuffled between the classifiers for each observation. Any difference between F-scores, under this assumption, would thus be random.

Permutation tests are a powerful concept that I can highly recommend, but I was not aware that they can be even applied to the evaluation of machine learning models.

*Simulation is necessary since the number of possible permutations grows by 2^n for two classifiers.

I implemented a Python notebook which implements a permutation test for the comparison of F-scores in binary classification (it’s most readable if you follow the link):

Click here to view this notebook in full screen

References

Yeh, A. (2000). More accurate tests for the statistical significance of result differences. arXiv preprint cs/0008005.

Permutation Test for F-score Differences in Python

Click here to view this notebook in full screen

References

Questions? Thoughts? Generate a Comment to this Post!

Related

Recent Blogging on Statistical Concepts

Code Snippet: How to Embed a Jupyter Notebook in Your Hugo Static Website

Code Snippet: Including Shiny Apps in Your Static Website with Hugo