Computer Science > Computation and Language

arXiv:2304.08460v1 (cs)

[Submitted on 17 Apr 2023 (this version), latest version 3 Oct 2024 (v3)]

Title:LongForm: Optimizing Instruction Tuning for Long Text Generation with Corpus Extraction

Authors:Abdullatif Köksal, Timo Schick, Anna Korhonen, Hinrich Schütze

View PDF

Abstract:Instruction tuning enables language models to generalize more effectively and better follow user intent. However, obtaining instruction data can be costly and challenging. Prior works employ methods such as expensive human annotation, crowd-sourced datasets with alignment issues, or generating noisy examples via LLMs. We introduce the LongForm dataset, which is created by leveraging English corpus examples with augmented instructions. We select a diverse set of human-written documents from existing corpora such as C4 and Wikipedia and generate instructions for the given documents via LLMs. This approach provides a cheaper and cleaner instruction-tuning dataset and one suitable for long text generation. We finetune T5, OPT, and LLaMA models on our dataset and show that even smaller LongForm models have good generalization capabilities for text generation. Our models outperform 10x larger language models without instruction tuning on various tasks such as story/recipe generation and long-form question answering. Moreover, LongForm models outperform prior instruction-tuned models such as FLAN-T5 and Alpaca by a large margin. Finally, our models can effectively follow and answer multilingual instructions; we demonstrate this for news generation. We publicly release our data and models: this https URL.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2304.08460 [cs.CL]
	(or arXiv:2304.08460v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2304.08460

Submission history

From: Abdullatif Köksal [view email]
[v1] Mon, 17 Apr 2023 17:36:35 UTC (690 KB)
[v2] Wed, 14 Feb 2024 18:00:33 UTC (690 KB)
[v3] Thu, 3 Oct 2024 15:46:13 UTC (1,352 KB)

Computer Science > Computation and Language

Title:LongForm: Optimizing Instruction Tuning for Long Text Generation with Corpus Extraction

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:LongForm: Optimizing Instruction Tuning for Long Text Generation with Corpus Extraction

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators