Dear Octa,
you do make one of the crucial points:
"how can a model be correct if it does not explicitly include all variables?"
(e.g. psychology, or more generally points 1 to 6 that Renato mentioned)
This is indeed a frequent point of discussion between certain communities of sports scientists, more generally in the applied sciences, and not one that can be easily brushed away.
Before talking about this, I would just like to point out, since you mention "sides": for me, there are no "sides" in science. There is only reasoning which is logically correct, and reasoning which is incomplete or faulty.
If there are statements which intuitively contradict, following the reasoning then allows to find out in which sense either is right or wrong, or whether the contradiction is just imagined.
(this is one aspect of an approach called "evidence-based science".
A related, more down-to-earth thing: one can be of different opinion but still be friends. Often it turns out the opinions were not so different at all, since based on similar reasoning.)
Going back to the question
"how can a model be correct if it does not explicitly include all variables?"
I think the answer is, there is no reason to assume that a model must include all variables to be correct. After all, it's just a model.
Thus a model can, in principle, be correct without explicitly including all variables.
Of course this depends what you mean with "correct" (or more generally your epistemiological = how-to-gain-understanding beliefs), but if you mean "useful" in whichever sense, this line of reasoning should be fine.
Being "useful", on the other hand, depends on what you use the model for.
In our case, a model is considered useful for predicting performance if it predicts performance with low error.
Whether the prediction error is low can be checked by the statistical methods we used in the manuscript. Essentially it amounts to predicting a lot of times and computing the error of prediction by comparing the predicted values to the true values which were held out for checking.
The reasoning here is if the model has predicted well a lot of times, it is plausible to assume that it will do so in the future under similar circumstances.
In principle, any model could have a low error, including one that just uses your phone number, though I would suspect this to be a rather bad model when it comes to athletic performance. But if you would find that such a model is useful, then that's how it is, and then you would start wondering why.
What we found is that a model that assumes an athlete characterized by three numbers is best (on the data base) to predict performance.
This does not imply anything about what exactly the numbers mean, and in fact the data set would not allow a conclusion in this direction - but there are indications that they may capture somehow a combination of the points (1) to (6) that Renato mentioned, this may include psychological or behavioural aspects as well. But we do not quite know since these were not explicitly measured in the data base.
Citing myself and Duncan from the "discussion" section:
Our analysis provides strong evidence that the three-number-summary captures physiological and/or social/behavioural characteristics of the athletes, e.g., training state, specialization, and which distance an athlete chooses to attempt. While the data set does not allow us to separate these potential influences or to make statements about cause and effect, we conjecture that combining the three-number-summary with specific experimental paradigms will lead to a clarification; further, we conjecture that a combination of the three-number-summary with additional data, e.g. training logs, high-frequency training measurements or clinical parameters, will lead to a better understanding of (C) existing physiological models.
Does this make sense, regarding your question?
Regarding the data we use: it is indeed the better part of performances (only people who finish those official races) in British (not human) history, and strictly spoken our results only apply to similar circumstances. But this is all we had, and it was already difficult to get.
We were aware of this issue, so checked is how this selection may affect the model. What we actually found is that the model seems more stable when learnt from better athletes, since they are more consistent at events; this still holds when you try to predict the less good athletes.
Maybe we should do a more principled experiment on how stable the model is under selecting performance percentiles.