Today I discovered a paper by Yeh (2000) which discusses how one might test differences in F-scores between multiple classifiers for significance. The author suggests, among other things, to employ permutation tests to do this. Permutation tests simulate* data under the assumption that the classifiers are equal in predictive validity / have equal F-scores. The observed p-value is then the frequency with which the observed difference between the F-scores is more extreme than the simulated differences.
The intuition behind this is that if two models have equally good F-scores, it would not matter from which model the predictions come from. Thus, the predictions of two models are randomly* shuffled between the classifiers for each observation. Any difference between F-scores, under this assumption, would thus be random.
Permutation tests are a powerful concept that I can highly recommend, but I was not aware that they can be even applied to the evaluation of machine learning models.
*Simulation is necessary since the number of possible permutations grows by 2^n for two classifiers.
I implemented a Python notebook which implements a permutation test for the comparison of F-scores in binary classification (it’s most readable if you follow the link):
Click here to view this notebook in full screen
References
Yeh, A. (2000). More accurate tests for the statistical significance of result differences. arXiv preprint cs/0008005.