Polyak averaging and gradient accumulation in the Keras

So I discovered them in the tweets.

As the paper Polyak Parameter Ensemble: Exponential Parameter Growth Leads to Better Generalization shows, Polyak Ensemble is for better generalization.

Another in the tweet is gradient accumulation. I don’t know if I’ll need it since I often makes the batch size as large as possible.