Polyak averaging and gradient accumulation in the Keras – Passion is like genius; a miracle.

So I discovered them in the tweets.

Gradient accumulation consists of updating your model weights only every N training steps, using the average of the gradients of the past N steps. It's a nice feature to enable faster convergence when you have a very small batch size, like 4 or 8 (which leads to noisy gradients). pic.twitter.com/7LZc65hZxX
— François Chollet (@fchollet) December 21, 2023

As the paper Polyak Parameter Ensemble: Exponential Parameter Growth Leads to Better Generalization shows, Polyak Ensemble is for better generalization.

Another in the tweet is gradient accumulation. I don’t know if I’ll need it since I often makes the batch size as large as possible.