DeepSeek V3 is a large language model with 671 billion parameters, of which 37 billion are activated for each token. It employs a Mixture-of-Experts (MoE) architecture and is trained on 14.8 trillion tokens. The model incorporates Multi-head Latent Attention (MLA) to reduce memory usage and DeepSeekMoE for economical training. DeepSeek V3 also uses a DualPipe algorithm that overlaps computation and communication. The model utilizes an FP8 mixed-precision framework for training, which is designed to reduce memory and computational costs. It achieves performance comparable to leading closed-source models, while requiring only 2.788 million H800 GPU hours for full training.