Study EECC - Insights into Electrical, Electronic, Communication and Computer Engineering: 1月 2025

2025/01/10

Check Points in Machine Learning

Q: What is the check point in machine learning & deep learning?

A: It is used to preserve the temporary models during training.

With the development of large language models (LLMs), models are becoming increasingly larger. As a result, research on utilizing model checkpoints has become important. Some machine learning experts are investigating methods to resume checkpoint models from interrupted training progress.

Checkpoint (檢查站/關口)在深度學習的領域，是指訓練過程中所保存的模型。

隨著大型語言模型(Large Language Model, LLM)的發展，現在的模型越來越大，因此Checkpoint的保留有其重要性，有學者在研究訓練中斷後如何重新從Checkpoint繼續先前未完成的訓練。

References:

Machine Learning Checkpoinging (deepchecks)

Resume Training from Checkpoint Network (Matlab)

Rojas, E., Kahira, A. N., Meneses, E., Gomez, L. B., & Badia, R. M. (2020). A study of checkpointing in large scale training of deep neural networks. arXiv preprint arXiv:2012.00825.

Xiang, L., Lu, X., Zhang, R., & Hu, Z. (2024, May). SSDC: A Scalable Sparse Differential Checkpoint for Large-scale Deep Recommendation Models. In 2024 IEEE International Symposium on Circuits and Systems (ISCAS) (pp. 1-5). IEEE.