GitHub - wb14123/couplet-dataset: Dataset for couplets. 70万条对联数据库。

对联数据集。

Also available on HuggingFace.

This is a project to fetch couplets from 冯重朴_梨味斋散叶_的博客

This dataset contains more than 700,000 couplets.

Run the spider:

scrapy runspider sina_spider.py

It will store the data into ./output/.

There is an already fetched and cleaned dataset that can be used directly with the seq2seq model. You can download it at here.

The downloaded data contains 5 files:

train/in.txt: The input of the couplets. Each line is an input. Each word is split by space.
train/out.txt: The output of the couplets. Each line is the output for the same line in the in.txt. Each word is split by space.
test/in.txt: Same as train/in.txt but with less data.
test/out.txt: Same as train/out.txt but with less data.
vocabs: Vocabs file. Add <s> and <\s> as the first vocabs, which will be used to train in the seq2seq mode.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
sina_spider.py		sina_spider.py