Abstract: Machine translation is a crucial task in natural language processing, and its role in promoting political, economic, and cultural exchanges is becoming increasingly significant. In high-resource languages like Chinese and English, the effectiveness of machine translation has approached human translation levels. However, for low-resource languages such as Tibetan, the accuracy of Tibetan machine translation still needs improvement due to the lack of a sufficient amount of large-scale publicly available parallel corpora. In Chinese-Tibetan translation, when it comes to translating phrases, the existing machine translation results are not accurate due to their brevity and deep semantic information, such as abbreviations. To help translation models better capture and convey semantic information, this paper constructs a Chinese-Tibetan phrase translation corpus based on semantic information augmentation. This corpus contains 7,000 Chinese-Tibetan phrase translation data. The original data for Chinese-Tibetan phrases is obtained from the Tibetan Language and Writing Network of Tibet, and the augmented semantic information includes Chinese explanations for Chinese phrases and example sentences containing Chinese phrases. This part of the content is obtained through the generation of large language models and professional proofreading. The publication of this dataset is of great value in promoting the development of Chinese-Tibetan information processing.
Keywords: machine translation; low-resource; Tibetan; ; phrase; semantic information; dataset