This is an interesting study. A similar permutation approach appears already in the Taylorformer paper (https://arxiv.org/pdf/2305.19141v1). The authors use a Transformer decoder for continuous processes, like time series. During training, each sequence is shuffled randomly. Each sequence element has a positional encoding. Then, they use log-likelihood on the shuffled sequence.
There, the permutation helps with predictions for interpolation, extrapolation and irregularly sampled data. Also, they show it helps with 'consistency', i.e., roughly the MSE is the same regardless of the generated order.
What might this paper add to our understanding or application of these ideas?
The idea of permuting the sequence order also appears in the Transformer Nerural Process paper: https://arxiv.org/pdf/2207.04179.
What might this paper add to our understanding or application of these ideas?
The idea of permuting the sequence order also appears in the Transformer Nerural Process paper: https://arxiv.org/pdf/2207.04179.