So I discovered them in the tweets.
As the paper Polyak Parameter Ensemble: Exponential Parameter Growth Leads to Better Generalization shows, Polyak Ensemble is for better generalization.
Another in the tweet is gradient accumulation. I don’t know if I’ll need it since I often makes the batch size as large as possible.