Not only that, they additionally ran an experiment with the training temperature turned way up (2.0) and truncation turned off such that the majority of SFT examples were incoherent (63% IIRC). Yet the model finetuned on these broken examples still improved over baseline.
Maybe this vaguely still makes sense in some way, because there is actually some useful signal purely in the model "internalizing" the behavior of its own sampler.
I don't know enough to say anything more formal, but it feels like exposing the model to its own output might help it "learn" to work with the sampler to get to a goal. I know that this is partly one of the reasons why RL is helpful, because aside from shifting the output towards a specific reward (rlvr or rlhf) it's also the only place where things are optimized at an actual "end to end sampled sequence of tokens" level instead of "next logits level" like in pretraining (which is why the highest probability suffix completion isn't necessarily simply greedy highest logit choices)