Ancient-Modern Chinese Dataset

Introduction

We create a new large-scale Ancient-Modern Chinese parallel corpus which contains 1.24M bilingual pairs. To our best knowledge, this is the first large high-quality Ancient-Modern Chinese dataset which includes 984,611 pairs in training set, 48,980 pairs in validation set, and 50,000 pairs in test set.

Download Link

Download

Paper

Ancient-Modern Chinese Translation with a New Large Training Dataset, TALLIP 2019 [Paper]

An Automatic Evaluation Metric for Ancient-Modern Chinese Translation, Neural Computing and Applications. NCAA 2020

AnchiBERT: A Pre-Trained Model for Ancient ChineseLanguage Understanding and Generation, arXiv 2020. [Paper]