Abstract: Cross language summarization is an important research direction in the field of natural language processing. It aims to generate abstracts of target language from the source language text. It can help people better understand and disseminate information between different languages. In recent years, with the development of deep learning and pre training techniques, cross language summarization tasks have made significant progress on high resource language data. However, due to the scarcity of data available in low resource languages such as Tibetan, the research on Tibetan Chinese cross language abstracts is still in its infancy. In order to promote the research on Tibetan Chinese cross language summary, this paper constructs a dataset that can be used for Tibetan Chinese cross language summary generation tasks, including 2000 samples in the format of json files. There are two keys in each json file, where text corresponds to news content in Tibetan source language and summary corresponds to news summary in Chinese target language. The data in this article is crawled from China Tibetan Netcom. In order to ensure the data quality, when crawling the data, we remove the news agency, pictures, videos, pictures, video name descriptions, reporters and other irrelevant content, leaving only the body of the news, and then use the existing sophisticated commercial translation tools to translate the Tibetan source language news abstracts into Chinese target language abstracts. At the same time, in order to further improve the quality of the data set, we evaluated the quality of the data set from the aspects of fact consistency, sufficiency, fluency, etc. of the summary, and obtained 2000 high-quality samples after screening. The release of this dataset is of great value in promoting the development of Tibetan Chinese cross language abstracts.
Keywords: Tibetan Chinese cross language summarization; Tibetan; low resources; dataset