对联数据集。
Also available on HuggingFace.
This is a project to fetch couplets from 冯重朴_梨味斋散叶_的博客
This dataset contains more than 700,000 couplets.
scrapy runspider sina_spider.py
It will store the data into ./output/.
There is an already fetched and cleaned dataset that can be used directly with the seq2seq model. You can download it at here.
The downloaded data contains 5 files:
train/in.txt: The input of the couplets. Each line is an input. Each word is split by space.train/out.txt: The output of the couplets. Each line is the output for the same line in thein.txt. Each word is split by space.test/in.txt: Same astrain/in.txtbut with less data.test/out.txt: Same astrain/out.txtbut with less data.vocabs: Vocabs file. Add<s>and<\s>as the first vocabs, which will be used to train in the seq2seq mode.