Thanks for splitting this off into a separate thread. I think it makes much more sense to discuss this away from Sandra's ongoing run.
scam_watcheroo wrote:
To recap, Sam Robson did a detailed analysis of a few days of Mimi's Strava data to try to show what he thinks is authentic data. However, there are major flaws in the assumptions made in his article that included incorrectly using Benfords Law and applying smoothing to Mimi's cadence data on the assumption that Sandra's data must be smoothed when that is in fact not the case as shown by Garmin documentation (Smart Recording samples at various intervals, it does not smooth).
Just to clarify, in the Benford's Law section, my entire point was that I assumed that it probably WOULD NOT hold for these data since they lie within a very narrow range (and this indeed proved to be the case). Nothing in my report suggested it even came close to holding. I was, however, then interested to see if there was anything suggestive about the non-significant digits (note that this is subtly different to Benford's Law which requires the digits to be at the same position from the left hand side of the number, whereas I was looking for the first digit from the right hand side) and there does appear to be a uniformish distribution. Thinking about it more, this is likely a result of the fact that Mimi's data is a mixture of two overlapping normalish distributions with a high standard deviation, whilst the faked data is taken from a single distribution (which importantly does not vary much from data set to data set).
The smoothing point is worth me looking into however. Note that I did not smooth Mimi's data for my analyses as this would reduce the fidelity. I only looked at smoothing when looking at explanations for why the high 200+ spa cadence measurements were not seen in both data sets. My assumption was that they were smoothed out in Sandra's data, but this does not seem to be the case. It says on the Garmin website that the "Smart Recording records key points where the fitness device changes direction, speed, heart rate or elevation." Presumably it needs to have some way internally of determining what is a key change compared to what is a random error. I wonder whether these would skip unfeasibly high recordings like those seen in Mimi's? The ideal test would be to see data from Sandra in both Smart Recording and 1s capture mode for one day, but I think that she is probably a little busy right now!
Note that neither of these points affect my conclusions, and it seems to me that you agree that the data are genuine. I know that there are other anomalies that still need looking into as well.
One thing that is odd to me is that I do not see the missing cadence data that you have described. I am using the same tool as you to get the raw .gpx data, and then parsing the data in R as per my code (which is neither missing data nor adding in values for missing data). Yet I see zero missing values for the cadence (at least in the few files that I analysed). Any idea why this might be? Could something be happening when you port them over to Excel?