Task Adapters Generation from Instructions (2024)

Huanxuan Liao1,2, Shizhu He1,2 , Yao Xu1,2, Yuanzhe Zhang1,2,
Yanchao Hao3, Shengping Liu4, Kang Liu1,2, Jun Zhao1,2
1 The Key Laboratory of Cognition and Decision Intelligence for Complex Systems,
Institute of Automation, Chinese Academy of Sciences, Beijing, China
2 School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
3 Platform and Content Group, Tencent, Beijing, China 4 Unisound, Beijing, China
liaohuanxuan2023@ia.ac.cn {shizhu.he, yao.xu, kliu, jzhao}@nlpr.ia.ac.cn
Corresponding author

Abstract

Large language models (LLMs) have acquired the ability to solve general tasks by utilizing instruction finetuning (IFT).However, IFT still relies heavily on instance training of extensive task data, which greatly limits the adaptability of LLMs to real-world scenarios where labeled task instances are scarce and broader task generalization becomes paramount.Contrary to LLMs, humans acquire skills and complete tasks not merely through repeated practice but also by understanding and following instructional guidelines. This paper is dedicated to simulating human learning to address the shortcomings of instance training, focusing on instruction learning to enhance cross-task generalization. Within this context, we introduce Task Adapters Generation from Instructions (TAGI), which automatically constructs the task-specific model in a parameter generation manner based on the given task instructions without retraining for unseen tasks.Specifically, we utilize knowledge distillation to enhance the consistency between TAGI developed through Learning with Instruction and task-specific models developed through Training with Instance, by aligning the labels, output logits, and adapter parameters between them. TAGI is endowed with cross-task generalization capabilities through a two-stage training process that includes hypernetwork pretraining and finetuning.We evaluate TAGI on the Super-Natural Instructions and P3 datasets. The experimental results demonstrate that TAGI can match or even outperform traditional meta-trained models and other hypernetwork models, while significantly reducing computational requirements. Our code will be available at https://github.com/Xnhyacinth/TAGI.

1 Introduction

Large language models (LLMs) have acquired the ability to solve general tasks by utilizing instruction finetuning (IFT), which describes different tasks in the same natural language format [3; 6; 22]. However, IFT still relies heavily on instance training of extensive task data {(Description, [Demostrations], Source, Target)} [36; 38], which faces significant limitations in adapting LLMs to real-world scenarios where labeled task instances are scarce and broader task generalization becomes paramount.

Therefore, for better cross-task generalization, the "zero-shot" learning ability of LLMs is crucial for real-world applications: models learned with instructions can achieve non-trivial performance on unseen tasks with just a single instruction that provides a comprehensive description of the task(e.g., "You will be given sentences in which your task is to recognize the name of a person."). Traditionally, achieving this capability involves meta-training the model by associating each input with specific task instructions [20; 36].For example, GPT-3 [24] has demonstrated strong "zero-shot" capabilities through meta-training. However, these methods heavily depend on the foundation model’s abilities and are inefficient for various unseen tasks [21; 43], as they require reprocessing extensive task instructions and some supplementary task data (e.g., examples from few-shot instances) for each input (see the top of Figure1).

Task Adapters Generation from Instructions (1)

In recent years, researchers have begun to explore meta-learning to enhance the cross-task generalization capabilities of LLMs, aiming to construct flexible, reusable and robust task-specific models [1; 33]. For example, task-specific models such as Adapter [11], LoRA [12], and Prefix [17] have been constructed by a hypernetwork [8]. This approach significantly enhances task generalization by processing instructions efficiently, reducing redundant computations [26]. However, these methods heavily depend on a substantial corpus of training instances, which can hinder their capacity to efficiently learn and construct task-specific models based on provided instructions [13].

In fact, contrary to LLMs, humans acquire skills and complete tasks not only through repeated practice but also by understanding and following instructional guidelines [15]. For example, a tourist with basic knowledge of riding vehicles can easily learn to use new ones abroad for the first time with the help of travel guides.This paper aims to mimic the way humans learn skills by understanding instructions. This shift represents a modest evolution in task model construction, transitioning from traditional instance training models to a contemporary approach focused on instruction learning.By providing task instructions, the novel paradigm offers an automated solution for generating task-specific adapters and seamlessly integrating them into the base model. This approach aims to streamline the development of task-specific models while enhancing their ability to generalize across diverse tasks with instructions.

Guided by this goal, we introduce Task Adapters Generation from Instructions (TAGI), which converts instructions to task-specific adapters using a hypernetwork. Under the knowledge distillation framework [10; 35], we enable models to the "Learning with Instruction" paradigm in a manner analogous to the "Training with Instance" paradigm. TAGI will enhance the alignment between the task-specific model θksubscript𝜃𝑘\theta_{k}italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (acting as the teacher) and the vanilla LLM θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT combined with the generated task adapters ΔksubscriptΔ𝑘\Delta_{k}roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (acting as the student) (see the bottom of Figure1). This alignment is achieved not only through instance training but also by incorporating parameter learning for task-specific models based on instructions. Specifically, we align the student under two distinct paradigms, encompassing not just the targets and logits, but also the adapters’ parameters by an L2 regularization within instruction, which represents the enhancement of the understanding of instructions and the ability to generate more efficient task-specific adapters. Moreover, TAGI endows the model with task generalization capabilities through a two-stage training process: hypernetwork pretraining on standard text pretraining data (e.g., C4 [28]), followed by finetuning on meta-training tasks. This allows it to generalize effectively across unseen tasks without sacrificing performance.

We evaluate TAGI on the Super-Natural Instructions (SNI) [36] and P3 [29] datasets. Experimental results demonstrate its ability to effectively generate adapters for unseen tasks, surpassing meta-trained models by 2% in SNI and 5% in P3, while significantly reducing computational demands by 60%, and outperforming other hypernetwork models by 7%.Notably, our method does not require additional parameter updating or gradient back-propagation, and it avoids the inefficiency of repeatedly encoding instructions during inference.We summarize our contributions as follows:

  • We propose a novel model construction paradigm by imitating human learning abilities, Learning with Instruction, for the cross-task generalization of the LLMs. To the best of our knowledge, it is the first time that a task-specific model has been generated based on instruction learning, and its capabilities and parameters are distilled from a teacher model trained on instance learning.

  • We used a knowledge distillation framework to develop task-specific models within the instruction learning paradigm. By aligning model parameters comprehensively, the TAGI method improves the model’s ability to understand instructions and solve unseen tasks more accurately and efficiently.

  • Comprehensive quantitative and qualitative assessments have highlighted the effectiveness of TAGI on two publicly available large-scale instruction datasets, with lower inference costs.

2 Related Work

TAGI draws inspiration from previous research on instruction following, hypernetworks and knowledge distillation. In this section, we will delve into the pioneering work in these areas.

Instruction Following is often used to evaluate the cross-task generalization of LLMs, and it is dedicated to handling any task described in natural language. Recent findings suggest that additional finetuning of LLMs with instructions substantially improves their zero-shot capabilities [6; 37; 38].Moreover, large-scale multi-task meta-training has been shown to equip models with the ability to address new tasks in zero- or few-shot scenarios, facilitated by standard task formats and prompts [29; 43] alongside providing concise task instructions and select examples [23; 36]. However, the instructions and examples can significantly escalate the computational burden compared to task-specific models. Existing works attempt to mitigate this issue involved creating adapters to separately process instructions and examples [13; 41] with reduced performance. To overcome these limitations, we introduce a new paradigm that draws on instruction-based learning, simulating instance training to enhance the perception and processing capabilities of LLMs for handling unseen tasks.

Hypernetworks [8; 30] are neural networks that generate weights for other neural networks [4], which are designed to use fewer parameters to dynamically build task-specific models [9; 32]. Notable works such as HyperTuning [26], HINT [13], and Hypter [41] have all adopted hypernetworks to convert task instructions and demonstrations into adapters for LLMs. And MEND [5] utilizes hypernetworks to compress demonstrations for distilled vectors. Although they all avoided processing lengthy instructions repeatedly and utilized adapters to make training and testing more cost-effective [18], they still have a performance loss compared to meta-training [7]. The proposed method TAGI incorporates the utilization of hypernetworks, which are instrumental in generating task-specific adapters that are seamlessly integrated into LLMs. Compared to existing models based on hypernetworks, TAGI not only trains at the instance level but also incorporates knowledge distillation to supervise the adapters generated by hypernetworks, thereby achieving both efficiency and effectiveness.

Knowledge Distillation is a technique where a smaller model (student) learns to mimic the predictions of a larger model (teacher), aiming to retain performance while reducing computational resources [10]. Indeed, the application of knowledge distillation is the essential difference between the proposed method in this paper and other hypernetwork-based methods such as HINT [13] and Hypter [41]. Recently, some works [31] utilize knowledge distillation to finetune small language models such as T5 [28], enabling them to act as LLMs with pre-prompting without any given prompts. Compared with the typical knowledge distillation methods of LLMs, the proposed method TAGI in this paper further utilizes model parameter alignment and aims to mimic another learning paradigm of human skill learning. We not only calculate the Kullback–Leibler (KL) divergence [14] between teacher and student models [10], but also compute the L2 regularization between the generated adapter by instruction learning and task-specific models by instance training.

3 Methods

3.1 Problem Setting

Cross-task Generalization: Given a set of tasks 𝒯={𝒯1,,𝒯|𝒯|}𝒯subscript𝒯1subscript𝒯𝒯\mathcal{T}=\{\mathcal{T}_{1},...,\mathcal{T}_{|\mathcal{T}|}\}caligraphic_T = { caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_T start_POSTSUBSCRIPT | caligraphic_T | end_POSTSUBSCRIPT }, where each task 𝒯isubscript𝒯𝑖\mathcal{T}_{i}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT contains a set of (source, target) samples 𝒟i={(s1,t1),,(sn,tn)}subscript𝒟𝑖subscript𝑠1subscript𝑡1subscript𝑠𝑛subscript𝑡𝑛\mathcal{D}_{i}=\{(s_{1},t_{1}),...,(s_{n},t_{n})\}caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) }. We categorize these tasks into three distinct non-overlapping groups for validating out-of-distribution generalization: meta-train (𝒯trainsubscript𝒯𝑡𝑟𝑎𝑖𝑛\mathcal{T}_{train}caligraphic_T start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT), meta-valid (𝒯validsubscript𝒯𝑣𝑎𝑙𝑖𝑑\mathcal{T}_{valid}caligraphic_T start_POSTSUBSCRIPT italic_v italic_a italic_l italic_i italic_d end_POSTSUBSCRIPT), and meta-test (𝒯testsubscript𝒯𝑡𝑒𝑠𝑡\mathcal{T}_{test}caligraphic_T start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT), assuming all tasks adhere to a text-to-text format. For example, 𝒯trainsubscript𝒯𝑡𝑟𝑎𝑖𝑛\mathcal{T}_{train}caligraphic_T start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT comprises tasks like translation and question answering, the 𝒯validsubscript𝒯𝑣𝑎𝑙𝑖𝑑\mathcal{T}_{valid}caligraphic_T start_POSTSUBSCRIPT italic_v italic_a italic_l italic_i italic_d end_POSTSUBSCRIPT and 𝒯testsubscript𝒯𝑡𝑒𝑠𝑡\mathcal{T}_{test}caligraphic_T start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT encompass tasks such as paraphrasing and natural language inference respectively.Within the 𝒯trainsubscript𝒯𝑡𝑟𝑎𝑖𝑛\mathcal{T}_{train}caligraphic_T start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT, the goal is to utilize the data for training and transfer knowledge to facilitate learning to resolve the test tasks. For all methods discussed, aside from the original unsupervised pretraining of the language model backbone on separate corpora, the model learning primarily takes place through multi-task training on the 𝒯trainsubscript𝒯𝑡𝑟𝑎𝑖𝑛\mathcal{T}_{train}caligraphic_T start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT.

3.2 Task Adapters Generation from Instructions (TAGI)

In this section, we will introduce the detailed method of TAGI. For each (unseen) task, TAGI consists of two core components: a hypernetwork §3.2.1 which receives task instructions and generates parameter-efficient adapters, and a task-specific model which combines the vanilla LLM and the generated adapters from hypernetwork.

Unlike traditional meta-training methods, we transition from training with instance to learning with instruction, which not only addresses efficiency issues at the instance level but also incorporates parameter alignment for the task-specific model parameters at the instruction level.Specifically, the complete process is shown in Figure2, we initially train the LoRA modules §3.2.2 on various upstream tasks (seen tasks) with task datasets of meta-train (𝒯trainsubscript𝒯𝑡𝑟𝑎𝑖𝑛\mathcal{T}_{train}caligraphic_T start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT). Specifically, for N𝑁Nitalic_N distinct upstream tasks, we independently train N𝑁Nitalic_N LoRA modules, with each module denoted as ΔisubscriptΔ𝑖\Delta_{i}roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for task 𝒯i𝒯subscript𝒯𝑖𝒯\mathcal{T}_{i}\in\mathcal{T}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_T, presumed to represent the optimal model for its respective task. Subsequently, TAGI is committed to building proprietary models for downstream tasks (unseen tasks). Its training process is bifurcated into two primary phases: hypernetwork pretraining §3.2.3 and hypernetwork finetuning §3.2.4 which encompasses distillation and alignment.

Task Adapters Generation from Instructions (2)


3.2.1 Hypernetwork for Converting Instructions into LoRA

A pivotal element of our model is the hypernetwork that converts task instructions (descriptions and demonstrations) into a parameter-efficient module. Our hypernetwork comprises two crucial components: the encoder, derived from the vanilla LLM111We find that re-using the encoder from the vanilla LLM works well [13]., is designed to minimize encoding biases by converting task instructions into a continuous contextual representation. This representation is then fused with LLM input and concated with encoded input for the decoder. Additionally, the adapter generator, utilizing a basic MLP design, is both lightweight and efficient, effectively converting encoded instructions into parameter-efficient modules.

Encoder: Prior studies simply concatenated encoded instructions with inputs, overlooking the interactions between them. To address this, we integrated a hierarchical cross-attention layer into the encoder of the LLM to refine the input representation with embedded instruction details. Specifically, for an input x𝑥xitalic_x and its corresponding task instruction ixsubscript𝑖𝑥i_{x}italic_i start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, we initially employ the encoder within the hypernetwork to encode the instruction into representations Ixs×dsubscriptI𝑥superscript𝑠𝑑\textbf{I}_{x}\in\mathbb{R}^{s\times d}I start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_s × italic_d end_POSTSUPERSCRIPT. Then, we feed the x𝑥xitalic_x into the model and obtain the output representation SlsubscriptS𝑙\textbf{S}_{l}S start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT from the self-attention sublayer in the l𝑙litalic_l-th layer. Ultimately, SlsubscriptS𝑙\textbf{S}_{l}S start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is processed through the l𝑙litalic_l-th cross-attention layer, resulting in a text representation that is enriched with instruction information:

Fl=CrossAttentionLayerl(Sl,Ix)subscriptF𝑙subscriptCrossAttentionLayer𝑙subscriptS𝑙subscriptI𝑥\textbf{F}_{l}=\text{CrossAttentionLayer}_{l}(\textbf{S}_{l},\textbf{I}_{x})F start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = CrossAttentionLayer start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( S start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , I start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT )(1)

where CrossAttentionLayerlsubscriptCrossAttentionLayer𝑙\text{CrossAttentionLayer}_{l}CrossAttentionLayer start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT conducts multi-head attention on the query, key, and value matrices, followed by residual connection and layer normalization. The final input to the decoder is the concatenation of the encoded instruction and the encoded fusion input, i.e., (Ix;Fl)subscriptI𝑥subscriptF𝑙(\textbf{I}_{x};\textbf{F}_{l})( I start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ; F start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ).

Adapter Generator: Considering the efficiency and effectiveness, we utilize a two-layer multi-layer perceptron (MLP) to generate parameter-efficient modules (e.g., LoRA) for the encoded instruction. To differentiate between the query 𝒬𝒬\mathcal{Q}caligraphic_Q and value 𝒱𝒱\mathcal{V}caligraphic_V matrices as well as the layers, we introduce layer ids idxl{𝒬,𝒱}{0,,2×#blocks}superscriptsubscriptidx𝑙𝒬𝒱02#blocks\text{idx}_{l}^{\{\mathcal{Q},\mathcal{V}\}}\in\{0,\ldots,2\times\#\text{%blocks}\}idx start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT { caligraphic_Q , caligraphic_V } end_POSTSUPERSCRIPT ∈ { 0 , … , 2 × # blocks } as positional information. We use a unique network for each layer and share it between 𝒬𝒬\mathcal{Q}caligraphic_Q and 𝒱𝒱\mathcal{V}caligraphic_V (i.e., one network is used for a certain layer LoRA generation).

LoRAl{𝒬,𝒱}=MLPl(Ixk;idxl{𝒬,𝒱}|idxl𝒬=2l,idxl𝒱=2l+1)\text{LoRA}_{l}^{\{\mathcal{Q},\mathcal{V}\}}=\text{MLP}_{l}(\textbf{I}_{x_{k}%};\text{idx}_{l}^{\{\mathcal{Q},\mathcal{V}\}}~{}|~{}\text{idx}_{l}^{\mathcal{%Q}}=2l,\text{idx}_{l}^{\mathcal{V}}=2l+1)LoRA start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT { caligraphic_Q , caligraphic_V } end_POSTSUPERSCRIPT = MLP start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( I start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ; idx start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT { caligraphic_Q , caligraphic_V } end_POSTSUPERSCRIPT | idx start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_Q end_POSTSUPERSCRIPT = 2 italic_l , idx start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_V end_POSTSUPERSCRIPT = 2 italic_l + 1 )(2)

where LoRAl𝒬superscriptsubscriptLoRA𝑙𝒬\text{LoRA}_{l}^{\mathcal{Q}}LoRA start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_Q end_POSTSUPERSCRIPT and LoRAl𝒱superscriptsubscriptLoRA𝑙𝒱\text{LoRA}_{l}^{\mathcal{V}}LoRA start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_V end_POSTSUPERSCRIPT are the l𝑙litalic_l-th LoRA of 𝒬𝒬\mathcal{Q}caligraphic_Q and 𝒱𝒱\mathcal{V}caligraphic_V, respectively.

3.2.2 LoRA Tuning for Task-specific Models

LoRA [12] efficiently reduces the number of trainable parameters by decomposing the update of the LLM’s attention weight matrix (denoted as 𝐖0d×ksubscript𝐖0superscript𝑑𝑘\mathbf{W}_{0}\in\mathbb{R}^{d\times k}bold_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_k end_POSTSUPERSCRIPT) into low-rank matrices. Specifically, LoRA updates the weight matrix as 𝐖0+δ𝐖=𝐖0+𝐀𝐁subscript𝐖0𝛿𝐖subscript𝐖0𝐀𝐁\mathbf{W}_{0}+\delta\mathbf{W}=\mathbf{W}_{0}+\mathbf{AB}bold_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_δ bold_W = bold_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_AB, with 𝐀d×r𝐀superscript𝑑𝑟\mathbf{A}\in\mathbb{R}^{d\times r}bold_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_r end_POSTSUPERSCRIPT and 𝐁r×k𝐁superscript𝑟𝑘\mathbf{B}\in\mathbb{R}^{r\times k}bold_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_k end_POSTSUPERSCRIPT being trainable low-rank matrices of rank r𝑟ritalic_r, significantly smaller in dimensions than d𝑑ditalic_d and k𝑘kitalic_k. We finetune a robust baseline to derive the LoRA parameters ΔisubscriptΔ𝑖\Delta_{i}roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for task-specific models for i𝑖iitalic_i-th task, facilitating LLM instruction learning and parameter alignment. SNI is categorized into 60 types based on task types, while P3 encompasses 36 categories, corresponding to 60 and 36 parameter modules, respectively.

3.2.3 Hypernetwork Pretraining for Preliminary Generalization

Previous research [5; 26] has demonstrated that pretraining hypernetworks can substantially improve the model’s cross-task generalization capabilities. Adhering to the HINT [13], we pretrain the hypernetwork on C4 [28] before finetuning it on a diverse multi-task prompt dataset. As illustrated in the right segment of Figure2, given an input sequence, we partition it into randomly sized segments a𝑎aitalic_a, b𝑏bitalic_b, and c𝑐citalic_c, where a𝑎aitalic_a is fed into the hypernetwork, b𝑏bitalic_b into the LLM, and c𝑐citalic_c is the segment to predict. During this stage, training is conducted by minimizing the cross-entropy loss predsubscriptpred\mathcal{L}_{\text{pred}}caligraphic_L start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT, aiming to ensure that the hypernetwork learns to recognize instructions to enhance generalization ability.

pred=logP(LLM+Hypernetwork(a))(c|b)subscriptpredlogsubscript𝑃LLMHypernetwork𝑎conditional𝑐𝑏\mathcal{L}_{\text{pred}}=\text{log}P_{(\text{LLM}+\text{Hypernetwork}(a))}(c~%{}|~{}b)caligraphic_L start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT = log italic_P start_POSTSUBSCRIPT ( LLM + Hypernetwork ( italic_a ) ) end_POSTSUBSCRIPT ( italic_c | italic_b )(3)

3.2.4 Hypernetwork Finetuning for Instruction Learning

At this stage, TAGI is finetuned on a multi-task prompt dataset, enabling it to learn the generation of optimal parameters from task instructions, thereby ensuring effective generalization to future unseen tasks. Similar to the pretraining phase, task instructions (alongside some few-shot samples) replace a𝑎aitalic_a, the main input replaces b𝑏bitalic_b, and the target replaces c𝑐citalic_c. In each iteration, the hypernetwork generates LoRA parameters and encodes the instructions. LoRA is a parameter-efficient module (i.e., inserting into the model), and the encoded instructions are integrated with the encoder’s embeddings for information fusion and concatenated with the fused encoding input during decoding. Beyond the standard predsubscriptpred\mathcal{L}_{\text{pred}}caligraphic_L start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT, we employ knowledge distillation for instruction learning: a strong baseline combining complete task instructions and input, serves as the teacher, while the model incorporating generated LoRA parameters with the input, acts as the student. The KL divergence klsubscriptkl\mathcal{L}_{\text{kl}}caligraphic_L start_POSTSUBSCRIPT kl end_POSTSUBSCRIPT measures the discrepancy in word probability distributions between the two models as an implicit learning outcome, and the MSE loss inssubscriptins\mathcal{L}_{\text{ins}}caligraphic_L start_POSTSUBSCRIPT ins end_POSTSUBSCRIPT calculates the difference between the generated parameters and those of task-specific parameter-efficient modules as an explicit learning intermediate result. The formulation of finetuning is as follows:

ins=MSE(Δi,Hypernetwork(a))subscriptinsMSEsubscriptΔ𝑖Hypernetwork𝑎\mathcal{L}_{\text{ins}}=\text{MSE}(\Delta_{i},\text{Hypernetwork}(a))caligraphic_L start_POSTSUBSCRIPT ins end_POSTSUBSCRIPT = MSE ( roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , Hypernetwork ( italic_a ) )(4)
kl=KL(P(LLM+Δi)(x|(a;b))||P(LLM+Hypernetwork(a))(x|b))\mathcal{L}_{\text{kl}}=\text{KL}(P_{(\text{LLM}+\Delta_{i})}(x~{}|~{}(a;b))~{%}||~{}P_{(\text{LLM}+\text{Hypernetwork}(a))}(x~{}|~{}b))caligraphic_L start_POSTSUBSCRIPT kl end_POSTSUBSCRIPT = KL ( italic_P start_POSTSUBSCRIPT ( LLM + roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( italic_x | ( italic_a ; italic_b ) ) | | italic_P start_POSTSUBSCRIPT ( LLM + Hypernetwork ( italic_a ) ) end_POSTSUBSCRIPT ( italic_x | italic_b ) )(5)
finetune=pred+λ1kl+λ2inssubscriptfinetunesubscriptpredsubscript𝜆1subscriptklsubscript𝜆2subscriptins\mathcal{L}_{\text{finetune}}=\mathcal{L}_{\text{pred}}+\lambda_{1}\mathcal{L}%_{\text{kl}}+\lambda_{2}\mathcal{L}_{\text{ins}}caligraphic_L start_POSTSUBSCRIPT finetune end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT kl end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT ins end_POSTSUBSCRIPT(6)

where a𝒯i𝑎subscript𝒯𝑖a\in\mathcal{T}_{i}italic_a ∈ caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, ΔisubscriptΔ𝑖\Delta_{i}roman_Δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the optimal LoRA modules of the i𝑖iitalic_i-th task, λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the hyper-parameter to control the importance of distillation in finetuning.

4 Experiments

We first present the datasets (§4.1) and baselines (§4.2) used in our evaluation and then discuss three research questions (RQs):

RQ1: Can the proposed instruction learning paradigm effectively learn the ability of instance training? Can it support cross-task generalization of LLMs? (§4.4)

RQ2: How many foundation tasks does TAGI need to learn to achieve better results? (§4.5)

RQ3: What is the impact of different modules and learning stages on TAGI? (§4.7)

4.1 Datasets

To demonstrate the generality of our method, we evaluate our approach on two popular multi-task instruction datasets222We provide the full list of datasets and more details in the A.2.: Super-Natural Instructions (SNI) [36] and T0 split of P3 (P3) [29].

SNI comprising over 1,600 task datasets, each dataset includes a task definition and a set of fixed positive and negative demonstrations. We follow the previous research [13; 26] and examine two settings: only using the task definition as the input to the hypernetwork (‘Def’), and using the definition along with two few-shot positive examples (‘Def + 2 Pos’). We only use the English tasks in the dataset and the model’s generation is evaluated on a set of 119 unseen tasks using ROUGE-L.

P3 composed of 62 task datasets, the T0 model is trained with these tasks divided into meta-training and meta-test sets. The format of the prompts takes into consideration 0-shot reasoning and typically includes instructions or possible answer options. We follow the precedent work [40] by using the T0 training subset 36 tasks to train our model. The evaluation is conducted based on the accuracy scores of multiple-choice questions for unseen 11 tasks in the meta-test set (MTest11).

Pre-Instr.Low Infer.Instr.Unseen
MethodTrainFus.CostLearningTask
Simple FT
T0 [29] / Tk-Instruct [36]✔✔✔
Hypter [41]
HyperTuning [26]
HINT [13]✔✔
TAGI (Ours)✔✔✔

4.2 Baselines

We compare the characteristics of TAGI against eight primary groups of baselines (as shown in Table1): 1) No FT: models without finetuning. 2) HyperTuning [26]: models that use hypernetwork to convert demonstrations into adapters without instruction fusion. 3) Hypter [41]: models based on hypernetwork do not use pretraining. 4) HINT [13]: models pretrain hypernetwork and concat instruction. 5) T0 and Tk-Instruct: strong baselines fully finetuned on P3 and SNI respectively with instruction concatenated. 6) Full FT: models fineuned on target tasks. 7) Decoder-only model: decoder-only models fully finetuned like GPT-2 [27] and OPT [42]. 8) FiD-ICL [40]: ICL method use encoder intermediate fusion.

4.3 Implementations

We limit our scope to encoder-decoder models for our experiments333We have discussed in detail the encoder-decoder and decoder-only models in B.1.. We use T5-LM-Adapt444https://huggingface.co/google/t5-xl-lm-adapt and T0 [29] as initializations in our experiments. The two model groups have the same architectural framework but differ in weight; T0 uses T5-LM-Adapt for initialization and undergoes multi-task training on the P3 meta-training set. For SNI, only T5-LM-Adapt is considered, and three different sizes are tested: Base (250M), XL (3B), and XXL (11B), with the teacher model being TK-Instruct [36]. For P3, we experimented with two sets of models of three different sizes: Base (250M), Large (800M), and XL (3B) with the only template as input, while the teacher model being FiD-ICL [40] with 16-shot examples. The A.4 contains more implementation details and experimental settings.

4.4 Main Results

Super-Natural Instructions. We report the performance and inference costs of TAGI models and baselines in subsection4.4. Our analysis and findings yield several key insights:

\bullet Firstly, methods lacking finetuning exhibit subpar performance. As shown in the first row of the table, the performance of No FT is significantly lower than other baseline methods by approximately 30 points (except for Hypter), which underscores the critical role of inductive bias, introduced during meta-training, in enhancing the model’s instructional adherence and cross-task generalization.

Def (Zero-shot)Def + 2 Pos. (Few-shot)Avg. Rel.
MethodBase (250M)XL (3B)XXL (11B)Base (250M)XL (3B)XXL (11B)FLOPs
No FT8.814.326.29.413.630.5×\times×1.0
Tk-Instruct35.348.053.642.154.062.0×\times×1.0
\hdashline # Decoder-only model
GPT-2 XL (1.5B)-38.2--45.3-×\times×0.33
OPT (13B)--44.8--51.5×\times×0.36
\hdashline # Hypernetwork-based model
Hypter12.116.815.510.614.213.4×\times×0.35
HyperTuning-38.9--48.6-×\times×0.34
HINT33.347.251.141.853.256.4×\times×0.37
TAGI (Ours)35.348.452.3 42.556.358.4 ×\times×0.39

\bullet Secondly, TAGI demonstrates notable improvements over other hypernetwork-based baselines, with only a marginal increase in inference overhead (see subsection4.4 last column). We find that TAGI still outperforms the advanced method HINT (2absent2\geq 2≥ 2 points) while achieving similar computational savings. This highlights the efficacy of instruction learning with knowledge distillation.The underperformance of HINT and Hypertuning may stem from their sole reliance on cross-entropy with the target during meta-training, lacking explicit supervision of intermediate task-specific module parameters and implicit supervision of the teacher outcome. This deficiency impedes their ability to fully leverage instruction tasks for generating superior adapter parameters during meta-test.

\bullet Thirdly, TAGI consistently matches or even surpasses robust baselines in both zero- and few-shot settings. Comparing TAGI with multi-task finetuning approaches such as Full FT and TK-Instruct, we observe that TAGI achieves comparable performance (02.302.30-2.30 - 2.3 points) except for 11B while utilizing approximately 2.5 ×\times× fewer FLOPs. TAGI’s performance on the 11B model is somewhat lacking, potentially attributable to either insufficient training due to resource limitations or a decrement in performance stemming from the omission of parameter alignment constraints due to time constraints555We discuss the trend and possible reasons in B.2. In alignment with prior research, TAGI significantly surpasses GPT-2 and OPT-13B in comparative analyses with decoder-only models (10absent10\geq 10≥ 10 points in GPT2 and 7absent7\geq 7≥ 7 points in OPT-13B), affirming the superiority of encoder-decoder models within similar meta-learning frameworks. Overall, TAGI fulfills its objective by enhancing cross-task generalization capabilities through instruction learning and striking an optimal balance between performance and efficiency.

T5-LMT0Avg. Rel.
MethodBase (250M)Large (800M)XL (3B)Base (250M)Large (800M)XL (3B)Infer. Time
# MTest11 Avg.
Zero-shot43.941.542.649.152.457.6×\times×1.0
Full FT44.645.547.251.956.661.4×\times×1.0
Metatrain 44.152.453.150.152.456.8×\times×1.0
\hdashline # ICL-based method
Concat-ICLα44.247.6-48.653.2-×\times×4.1
FiD-ICLα47.055.260.051.053.458.2×\times×1.9
Ensemble-ICLα44.654.552.649.953.757.7×\times×13.2
\hdashline # Hypernetwork-based model
Hypter-----56.2-
HINT-----60.3-
TAGI (Ours)45.654.758.950.853.858.8×\times×0.88
# HyperT5 Avg. (Without SCloze dataset)
FiD-ICLα46.955.860.651.753.958.5×\times×1.9
HyperTuning-54.659.6----
TAGI (Ours)46.756.059.851.754.659.2×\times×0.88

P3. We report results on the T0 evaluation set in subsection4.4, with full results in C.2.

\bullet Firstly, examining the ICL-based methods presented in the middle section, it is evident that all three ICL aggregation strategies achieve superior performance. This underscores the utility of instructions and demonstrations in aiding LLMs. However, these methods require concatenating extensive demonstrations during both training and inference, which significantly increases computational demands and reduces efficiency (×\times×2 - ×\times×13.2 inference time). In contrast, TAGI by leveraging solely task instructions one time, attains comparable or superior accuracy levels while significantly curtailing computational burdens (×\times×0.88).TAGI demonstrates a slight disadvantage (merely 1.21.21.21.2 points) to FiD-ICL [40] on T5-LM, yet it outperforms other methods (1absent1\geq 1≥ 1 point). For T0, it is only 1.5 points lower than Full FT and exceeds all ICL-based methods. Notably, TAGI does not require the 16 examples like the ICL-based method, nor does it necessitate repeated processing of instructions like the baselines, significantly reducing inference overhead.

\bullet A comparison of the first three lines of results indicates that for large or XL models, initializing with T5-LM outperforms T0. We hypothesize that the process of training T5-LM to transition into T0 might result in the dilution of world knowledge or the diminishment of certain specific capabilities, thereby attenuating the benefits derived from meta-training. Conversely, for models of base size, T0 serves as a more effective initialization point.

\bullet Furthermore, TAGI outperforms competing hypernetwork models666 Because HINT is designed for TPU and Hypertuning is not open-sourced, we didn’t calculate their inference time. However, based on SNI experiments, it can be inferred that the trend of time expenditure is consistent..By comparing the last two columns, it is evident that the performance in MTest11 surpasses HINT and Hypertuning by 0.50.50.50.5 and 4.64.64.64.6 points respectively. Additionally, in the HyperT5 evaluation, the performance exceeds Hypertuning by 1111 point. This aligns with prior findings, suggesting that instruction learning augments the hypernetwork’s task comprehension and its capacity to generate task-specific adapters.

4.5 Varying Number of Meta-Training Tasks

A fundamental component of our methodology is incorporating parameter alignment in instruction learning. Consequently, it is imperative to examine the effect of varying the number of tasks on which parameter alignment is applied on outcomes and its influence on the generalization capabilities of LLMs. To this end, we conduct a comprehensive experimental analysis to compare the efficacy of instruction learning with parameter alignment across a spectrum of task quantities against instruction learning devoid of parameter alignment.Tasks are organized in descending order based on the number of datasets encompassed within each. Subsequently, a predetermined number of tasks are sequentially selected for meta-training purposes. This approach allows us to systematically evaluate the impact of parameter alignment on learning and generalization as the number of tasks varied.

From Figure3, we find that, firstly, an increase in the number of tasks correlates with improved performance across all methods, suggesting that meta-training across a broader array of tasks enhances the model’s instruction-following capabilities. However, the practical limitations of sourcing a sufficient quantity of tasks for meta-training must be acknowledged. Secondly, it was observed that the TAGI model exhibits lower overall performance in the absence of parameter alignment for instruction learning, yet it demonstrates a smaller relative standard deviation and less variability in performance in response to the number of tasks. This pattern aligns with the expected outcomes of instruction learning, highlighting the efficacy of our approach in bolstering the model’s ability to adhere to task instructions and generate task-specific adapters.

Task Adapters Generation from Instructions (3)


4.6 Parameter Size against Performance

We analyzed the proportion of generated parameter sizes relative to the total parameter size during the generation of various ranks, and compared this to the performance of the full meta-training fine-tuning method, as demonstrated in Figure 4 and Table 7. We can find that TAGI requires only about 10% of the parameters to outperform full meta-training fine-tuning which indicates that the limited parameters generated by the Hypernetwork serve as an optimal solution for task completion. The ability to adaptively construct models tailored to specific tasks removes the necessity for additional fine-tuning, underscoring TAGI’s effectiveness and efficiency.

Task Adapters Generation from Instructions (4)


MethodDefDef + 2Pos.P3
TK-Instruct48.054.0-
TK-Instruct-LoRA47.554.6-
TK-Instruct-Prefix42.654.2-
\hdashlineHypertuning38.948.659.6
HINT47.253.260.3
TAGI48.456.360.6
Ablation Study
w/o pretraining47.155.658.3
w/o Instr. Fus.35.140.644.2
w/o cesubscriptce\mathcal{L}_{\text{ce}}caligraphic_L start_POSTSUBSCRIPT ce end_POSTSUBSCRIPT47.655.459.8
w/o klsubscriptkl\mathcal{L}_{\text{kl}}caligraphic_L start_POSTSUBSCRIPT kl end_POSTSUBSCRIPT45.753.957.3
w/o inssubscriptins\mathcal{L}_{\text{ins}}caligraphic_L start_POSTSUBSCRIPT ins end_POSTSUBSCRIPT47.555.259.4
w/o Hypernetwork43.850.7-

4.7 Ablation Study

To evaluate the significance of each component within the TAGI model, we conducted a series of experiments across two meta-task datasets utilizing the T5-LM-XL (3B) model. The results as depicted in the Table4, highlight that the instructions fusion plays a pivotal role in enhancing model performance. This process facilitates dynamic interaction between the input and the instructions, enriching the model’s input with additional contextual information, reminiscent of the substantial benefits observed with ICL. Moreover, pretraining emerges as a critical phase, markedly improving the capabilities of models that have not undergone pretraining, thereby significantly enhancing their proficiency in interpreting and executing task instructions. Furthermore, the systematic removal of various components during the finetuning phase indicates a consistent decline in performance, underscoring the integral contribution of each component to the model’s overall efficacy.

Compared to meta-learning methods such as LoRA fine-tuning (rank=32) "Tk-Instruct-LoRA", prefix fine-tuning (num_virtual_tokens=32) "Tk-Instruct-prefix", and full fine-tuning "Tk-Instruct", our TAGI method enhances task comprehension and utilization which achieved through a hypernetwork that dynamically generates adapter LoRA insertions into the LLM based on input, leads to better cross-task generalization capabilities. Notably, prefix fine-tuning excels in the Def + 2Pos scenario, likely due to its effective integration of information from positive examples. Conversely, the Def scenario performs less satisfactorily, indicating that instructions alone are insufficient for optimal results. Comparative analysis with other hypernetwork models reveals that TAGI’s ablation performance remains robust, affirming the effectiveness of each step in bolstering TAGI’s operational efficiency.

5 Conclusions

In this paper, we introduce an innovative method of instruction learning designed to emulate instance training. This approach enables the model to achieve specified tasks and learn from instructions on how to address a category of problems. The proposed TAGI seamlessly integrates instruction into the input and processes the instruction simultaneously, thereby ensuring minimal inference overhead. Concurrently, we employ a knowledge distillation framework to facilitate instruction learning for distilling skills and aligning task-specific models. This allows the hypernetwork to transform task instructions into an efficient module inserted into the LLMs, thereby boosting generalization performance. Remarkably, TAGI consistently equals or surpasses the efficacy of conventional meta-training approaches while requiring fewer FLOPs and obviating the need for additional model parameters updating or gradient back-propagation. Future work will investigate more potent hypernetwork pretraining techniques and develop superior instruction fusion methods to augment the hypernetwork’s expressive capability, thereby enhancing the model’s ability to generalize to unseen tasks. Moreover, future work will investigate various task type classifications and the generalization effects of cross-modal tasks in instruction learning.

6 Acknowledgements

This work was supported by National Key R&D Program of China (No. 2022YFF0711900) and the National Natural Science Foundation of China (No.62376270, No.62276264). This work was supported by the Youth Innovation Promotion Association CAS.

References

  • [1]Jonathan Baxter.Learning to Learn.Springer US, 1998.
  • [2]Christos Baziotis, Mikel Artetxe, James Cross, and Shruti Bhosale.Multilingual machine translation with hyper-adapters, 2022.
  • [3]Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, JaredD Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei.Language models are few-shot learners.In H.Larochelle, M.Ranzato, R.Hadsell, M.F. Balcan, and H.Lin, editors, Advances in Neural Information Processing Systems, volume33, pages 1877–1901. Curran Associates, Inc., 2020.
  • [4]VinodKumar Chauhan, Jiandong Zhou, Ping Lu, Soheila Molaei, and DavidA. Clifton.A brief review of hypernetworks in deep learning, 2023.
  • [5]Tong Chen, Qirun Dai, Zhijie Deng, and Dequan Wang.Demonstration distillation for efficient in-context learning, 2024.
  • [6]HyungWon Chung, LeHou, Shayne Longpre, Barret Zoph, YiTay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, ShixiangShane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, EdH. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, QuocV. Le, and Jason Wei.Scaling instruction-finetuned language models, 2022.
  • [7]Budhaditya Deb, Guoqing Zheng, and AhmedHassan Awadallah.Boosting natural language generation from instructions with meta-learning, 2022.
  • [8]David Ha, Andrew Dai, and QuocV. Le.Hypernetworks, 2016.
  • [9]Yun He, HuaixiuSteven Zheng, YiTay, Jai Gupta, YuDu, Vamsi Aribandi, Zhe Zhao, YaGuang Li, Zhao Chen, Donald Metzler, Heng-Tze Cheng, and EdH. Chi.Hyperprompt: Prompt-based task-conditioning of transformers, 2022.
  • [10]Geoffrey Hinton, Oriol Vinyals, and Jeff Dean.Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015.
  • [11]Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin deLaroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly.Parameter-efficient transfer learning for nlp, 2019.
  • [12]EdwardJ Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, LuWang, and Weizhu Chen.LoRA: Low-rank adaptation of large language models.In International Conference on Learning Representations, 2022.
  • [13]Hamish Ivison, Akshita Bhagia, Yizhong Wang, Hannaneh Hajishirzi, and Matthew Peters.Hint: Hypernetwork instruction tuning for efficient zero-shot generalisation.ACL, 2023.
  • [14]JamesM. Joyce.Kullback-Leibler Divergence, pages 720–722.Springer Berlin Heidelberg, Berlin, Heidelberg, 2011.
  • [15]Sharon Kim, Mahjabeen Raza, and Edward Seidman.Improving 21st-century teaching skills: The key to effective 21st-century learners.Springer US, 2019.
  • [16]Quentin Lhoest, Albert Villanovadel Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, Joe Davison, Mario Šaško, Gunjan Chhablani, Bhavitvya Malik, Simon Brandeis, Teven LeScao, Victor Sanh, Canwen Xu, Nicolas Patry, Angelina McMillan-Major, Philipp Schmid, Sylvain Gugger, Clément Delangue, Théo Matussière, Lysandre Debut, Stas Bekman, Pierric Cistac, Thibault Goehringer, Victor Mustar, François Lagunas, Alexander Rush, and Thomas Wolf.Datasets: A community library for natural language processing.In Heike Adel and Shuming Shi, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 175–184, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
  • [17]XiangLisa Li and Percy Liang.Prefix-tuning: Optimizing continuous prompts for generation.In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, Online, August 2021. Association for Computational Linguistics.
  • [18]Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin Raffel.Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning, 2022.
  • [19]Shayne Longpre, LeHou, TuVu, Albert Webson, HyungWon Chung, YiTay, Denny Zhou, QuocV. Le, Barret Zoph, Jason Wei, and Adam Roberts.The flan collection: Designing data and methods for effective instruction tuning, 2023.
  • [20]Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi.MetaICL: Learning to learn in context.In Marine Carpuat, Marie-Catherine deMarneffe, and IvanVladimir MezaRuiz, editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2791–2809, Seattle, United States, July 2022. Association for Computational Linguistics.
  • [21]Swaroop Mishra, Daniel Khashabi, Chitta Baral, Yejin Choi, and Hannaneh Hajishirzi.Reframing instructional prompts to gptk’s language, 2022.
  • [22]Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi.Cross-task generalization via natural language crowdsourcing instructions.In ACL, 2022.
  • [23]Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi.Cross-task generalization via natural language crowdsourcing instructions, 2022.
  • [24]Long Ouyang, Jeff Wu, XuJiang, Diogo Almeida, CarrollL. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe.Training language models to follow instructions with human feedback, 2022.
  • [25]Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, LuFang, Junjie Bai, and Soumith Chintala.PyTorch: an imperative style, high-performance deep learning library.Curran Associates Inc., Red Hook, NY, USA, 2019.
  • [26]Jason Phang, YiMao, Pengcheng He, and Weizhu Chen.Hypertuning: Toward adapting large language models without back-propagation, 2022.
  • [27]Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.Language models are unsupervised multitask learners.2019.
  • [28]Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and PeterJ. Liu.Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020.
  • [29]Victor Sanh, Albert Webson, Colin Raffel, StephenH. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, TevenLe Scao, Arun Raja, Manan Dey, MSaiful Bari, Canwen Xu, Urmish Thakker, ShanyaSharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, ZhengXin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, JasonAlan Fries, Ryan Teehan, Tali Bers, Stella Biderman, Leo Gao, Thomas Wolf, and AlexanderM. Rush.Multitask prompted training enables zero-shot task generalization, 2022.
  • [30]Jürgen Schmidhuber.Learning to control fast-weight memories: An alternative to dynamic recurrent networks.Neural Computation, 4:131–139, 1992.
  • [31]Charlie Snell, Dan Klein, and Ruiqi Zhong.Learning by distilling context, 2022.
  • [32]YiTay, Zhe Zhao, Dara Bahri, Donald Metzler, and Da-Cheng Juan.Hypergrid transformers: Towards a single model for multiple tasks.In International Conference on Learning Representations, 2021.
  • [33]Sebastian Thrun and LorienY. Pratt.Learning to learn: Introduction and overview.In Learning to Learn, 1998.
  • [34]Thomas Wang, Adam Roberts, Daniel Hesslow, TevenLe Scao, HyungWon Chung, IzBeltagy, Julien Launay, and Colin Raffel.What language model architecture and pretraining objective work best for zero-shot generalization?, 2022.
  • [35]Wenhui Wang, Furu Wei, LiDong, Hangbo Bao, Nan Yang, and Ming Zhou.Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.Advances in Neural Information Processing Systems, 33:5776–5788, 2020.
  • [36]Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva Naik, Arjun Ashok, ArutSelvan Dhanasekaran, Anjana Arunkumar, David Stap, Eshaan Pathak, Giannis Karamanolakis, Haizhi Lai, Ishan Purohit, Ishani Mondal, Jacob Anderson, Kirby Kuznia, Krima Doshi, KuntalKumar Pal, Maitreya Patel, Mehrad Moradshahi, Mihir Parmar, Mirali Purohit, Neeraj Varshney, PhaniRohitha Kaza, Pulkit Verma, RavsehajSingh Puri, Rushang Karia, Savan Doshi, ShailajaKeyur Sampat, Siddhartha Mishra, Sujan ReddyA, Sumanta Patro, Tanay Dixit, and Xudong Shen.Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks.In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085–5109, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
  • [37]Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, AdamsWei Yu, Brian Lester, Nan Du, AndrewM. Dai, and QuocV Le.Finetuned language models are zero-shot learners.In International Conference on Learning Representations, 2022.
  • [38]Orion Weller, Nicholas Lourie, Matt Gardner, and MatthewE. Peters.Learning from task descriptions.In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1361–1375, Online, November 2020. Association for Computational Linguistics.
  • [39]Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven LeScao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush.Transformers: State-of-the-art natural language processing.In Qun Liu and David Schlangen, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October 2020. Association for Computational Linguistics.
  • [40]Qinyuan Ye, IzBeltagy, Matthew Peters, Xiang Ren, and Hannaneh Hajishirzi.FiD-ICL: A fusion-in-decoder approach for efficient in-context learning.In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8158–8185, Toronto, Canada, July 2023. Association for Computational Linguistics.
  • [41]Qinyuan Ye and Xiang Ren.Learning to generate task-specific adapters from task description, 2021.
  • [42]Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, XiVictoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, PunitSingh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer.Opt: Open pre-trained transformer language models, 2022.
  • [43]Ruiqi Zhong, Kristy Lee, Zheng Zhang, and Dan Klein.Adapting language models for zero-shot learning by meta-tuning on dataset and prompt collections, 2021.
  • [44]Ahmet Üstün, Arianna Bisazza, Gosse Bouma, Gertjan van Noord, and Sebastian Ruder.Hyper-x: A unified hypernetwork for multi-task multilingual transfer, 2022.

Appendix A Experimantal Settings

A.1 Problem Setting

Meta-Training and Inference: Our methodology rigorously adheres to the protocol outlined in MetaICL [20]. In the meta-train phase, we commence by selecting a task 𝒯𝒯\mathcal{T}caligraphic_T from 𝒯trainsubscript𝒯𝑡𝑟𝑎𝑖𝑛\mathcal{T}_{train}caligraphic_T start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT, followed by the sampling of k𝑘kitalic_k support examples {(xi(s),yi(s))}subscriptsuperscript𝑥𝑠𝑖subscriptsuperscript𝑦𝑠𝑖\{(x^{(s)}_{i},y^{(s)}_{i})\}{ ( italic_x start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } and m𝑚mitalic_m query examples {(xi(q),yi(q))}subscriptsuperscript𝑥𝑞𝑖subscriptsuperscript𝑦𝑞𝑖\{(x^{(q)}_{i},y^{(q)}_{i})\}{ ( italic_x start_POSTSUPERSCRIPT ( italic_q ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT ( italic_q ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } from the chosen task. The proposed hypernetwork is then adjusted to minimize the overall loss, focusing on generating a task model that can accurately predict the target sequences (e.g., answer) for source sequences (e.g. question). During the meta-test/inference phase, for each novel task in 𝒯testsubscript𝒯𝑡𝑒𝑠𝑡\mathcal{T}_{test}caligraphic_T start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT, we employ instructions to create the task-specific adapter, to optimize the model’s performance across all query examples {(xi(q),yi(q))}subscriptsuperscript𝑥𝑞𝑖subscriptsuperscript𝑦𝑞𝑖\{(x^{(q)}_{i},y^{(q)}_{i})\}{ ( italic_x start_POSTSUPERSCRIPT ( italic_q ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT ( italic_q ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) }.

DatasetExamples per TaskTrainTest
Super-Natural Instructions10075,41711,810
P3-90,897,4542,940,068
P3 (Sampling)1000290,0002,940,068

A.2 Datasets

During the pretraining phase, we utilized the C4 dataset [28], truncating each sequence to 1024 tokens. For the training phase, we employed Super-Natural Instructions (SNI) [36] and P3 datasets [29] for meta-training and meta-test. For SNI, we adhered to the default settings [13, 36], which include 100 examples per task for both the training and test splits. For P3, we used the data and prompts provided by T0. All prompts related to the meta-training tasks were included in the meta-training process, while the meta-test phase utilized evaluation prompts specified by T0 [29]. We treated ANLI R1, R2, and R3 as three distinct tasks, resulting in 11 tasks for the original meta-test in P3 (Meta-Test-11). Due to resource constraints, we deviated from the sampling procedures of prior work, opting to sample 1000 examples per task for each prompt template. This approach yielded a smaller dataset size, as detailed in Table5. For further information on P3 refer to [29]. Additionally, to facilitate comparison with the Hypertuning method, we excluded the StoryCloze task from the evaluation since it was not included in the datasets for the HyperT5 evaluation.

A.3 Split Sizes for Varying Number of Meta-Training Tasks

As shown in AppendixD and AppendixD, we present a comprehensive list of the two datasets, including the number of tasks or templates contained in each task and the task divisions from §4.5 experiments. The divisions in the table are cumulative; thus, the second division includes both the first and the second divisions. For SNI, tasks were sorted in descending order based on the number of tasks they contained and then divided into specified sizes (6, 15, 30, 60). For P3, we selected a specified number of tasks (5, 10, 20, 36) based on the task classification in the original paper, which includes categories such as Multiple-Choice QA, Closed-Book QA, Summarization, Structure-To-Text, Paraphrase Identification, Sentiment, Topic Classification, and Extractive QA.

We obtain all our data from huggingface datasets [16]. In the following, we provide the dataset links:

Additionally, the Super-Natural Instructions dataset (previously known as Natural Instructions-v2) has undergone some changes over time. In our experiments, we use the v2.6 version.

A.4 Implementations

Our implementations are based on huggingface transformers v4.23.1 [39] using PyTorch v1.13.1 [25] and deepspeed777https://github.com/microsoft/DeepSpeed v0.10.0. All experiments were conducted on 4 A100 NVIDIA GPUs, each equipped with 80GB of memory, and eight A6000 NVIDIA GPUs with 48GB of memory. Unless otherwise specified, the rank of LoRA generated by the hypernetwork is 32, and we use the Adamw optimizer with a learning rate of 5e-5 and a linear warmup rate of 0.02. We pre-train all models for 50,000 steps using C4 [28] with a batch size of 8 samples and sequences of length 1024.

A.5 T0-Base/Large/3B

T0 [29] provides model checkpoints only in sizes 3B and 11B. Additionally, HINT [13] and FiD-ICL [40] re-pretrained T0 and found that the model was not sufficiently trained, achieving better results after reproduction. Therefore, we used the T0 model 888https://huggingface.co/qinyuany/fid-icl-t0-large reproduced by FiD-ICL to conduct a series of experiments.

Finetuning
SNIP3
LoRA TuningPretrainingBase (250M)XL(3B)XXL (11B)Base (250M)Large (800M)XL(3B)
Max Input Len10241024102410241024512512512
Max Output Len128-128128128646464
Optimizeradamwadafactoradamwadamwadamwadamwadamwadamw
Learning Rate1e-41e-31e-45e-55e-51e-41e-45e-5
precisionbf16float32bf16bf16bf16bf16bf16bf16
# Training Steps1000050000200002000020000200002000020000
# Warmup Steps--# 2% of total training steps
Batch Size88821842
Gradient Accumulation21242244
LoRA Rank# 32

A.6 Hyperparameter

The complete stable hyperparameter set used for training runs can be found in Table6.

Appendix B Additional Experiments and Findings

B.1 Why we choose Enc-Dec Models?

Previous work has suggested that models with an encoder-decoder (enc-dec) structure have advantages over decoder-only (dec-only) models in terms of task generalization and instruction-following capabilities [19, 34, 40]. Therefore, in our experiments, we only considered models with an enc-dec structure (T5-LM and T0). Our experimental results demonstrated that enc-dec models indeed have an advantage when compared, although dec-only models might have higher computational efficiency due to their ability to cache KV and have fewer layers. However, our method, TAGI, significantly improves performance in various aspects with only a slight increase in computational overhead. We encode the task instructions only once based on the original computation.

B.2 T5-LM-XXL Training Trend

In this section, we detail how the performance of the T5-LM-XXL (11B) model surpasses the hypernetwork models but falls short of the meta-trained strong baseline Tk-Instruct by 1-4 points, as mentioned earlier in §4.4. The primary reason is insufficient training; when replicating the Tk-Instruct experiment, our results were significantly lower than reported when finetuning for only 20,000 steps. Consequently, we analyzed the performance of our TAGI model at different finetuning steps. As shown on the left side of Figure5, performance steadily improves with more steps with substantial growth. Thus, we reasonably predict that increasing the steps to 50,000 or more could surpass Tk-Instruct. Another possible reason is the lack of parameter alignment for the 11B model due to limited resources. Our previous analysis has shown that parameter alignment is crucial, with larger models benefiting more. Therefore, we analyzed performance with a small number of tasks for parameter alignment. As shown on the right side of Figure5, performance with parameter alignment for 6 and 15 tasks is better than without alignment. Based on these trends, it can be inferred that performance with full task parameter alignment could surpass Tk-Instruct.

Task Adapters Generation from Instructions (5)


B.3 Analysis on Hyperparameters

To explore the optimal hyperparameter settings for our experiments, we conducted a series of tests and error analyses using the T5-LM-Base (800M) model. The findings presented in Table7 reveal that variations in hyperparameters can lead to performance fluctuations, particularly with higher learning rates or reduced finetuning steps. Given the varying pre-training conditions of models of different sizes, a size-specific analysis is essential; however, details on larger models are omitted here due to resource limitations.

We observed that different settings of LoRA minimally affect performance, leading us to select a balanced size of 32. Similarly, the impact of the warmup ratio is negligible; thus, based on our experience, we chose a warmup ratio of one percent of the maximum finetuning steps. While more finetuning steps generally correlate with improved performance, excessive finetuning can result in overfitting on meta-training tasks, thereby diminishing generalizability. Moreover, increased finetuning steps require greater computational resources. Consequently, we determined that the optimal number of finetuning steps is 20,000 based on our experimental outcomes.

Learning RateLoRA RankTraining StepsWarmup Ratio
Method5e-51e-43e-41e-31632641500020000250000.010.020.03
SNI
Def + 2 Pos.
Tk-Instruct [36]41.341.842.238.9---41.441.842.141.541.840.6
TAGI (Ours)42.142.540.339.741.842.542.341.842.542.442.342.541.9
Def
Tk-Instruct [36]35.034.232.631.7---34.434.234.535.034.234.3
TAGI (Ours)34.335.333.531.834.835.335.434.235.335.434.835.334.9
P3
MTest11 Avg.
Metatrain43.344.143.640.9---44.044.144.344.244.143.6
TAGI (Ours)44.045.644.041.644.845.645.544.345.645.245.145.644.8

B.4 How λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are tuned?

In the experiment, we set λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to two different values: λ1=5subscript𝜆15\lambda_{1}=5italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 5 and λ2=sigmoid(ins)subscript𝜆2sigmoidsubscriptins\lambda_{2}=\text{sigmoid}(\mathcal{L}_{\text{ins}})italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = sigmoid ( caligraphic_L start_POSTSUBSCRIPT ins end_POSTSUBSCRIPT ). The effects of these different λ𝜆\lambdaitalic_λ values on the results are illustrated in Figure 6 and Table 8. We maintained all other conditions constant and only varied λ𝜆\lambdaitalic_λ to perform an ablation experiment at Def+2Pos. scenario.

Task Adapters Generation from Instructions (6)


λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTλ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPTRougeL
0.5sigmoid(ins)𝑠𝑖𝑔𝑚𝑜𝑖𝑑subscriptinssigmoid(\mathcal{L}_{\text{ins}})italic_s italic_i italic_g italic_m italic_o italic_i italic_d ( caligraphic_L start_POSTSUBSCRIPT ins end_POSTSUBSCRIPT )40.1
2sigmoid(ins)𝑠𝑖𝑔𝑚𝑜𝑖𝑑subscriptinssigmoid(\mathcal{L}_{\text{ins}})italic_s italic_i italic_g italic_m italic_o italic_i italic_d ( caligraphic_L start_POSTSUBSCRIPT ins end_POSTSUBSCRIPT )40.9
5sigmoid(ins)𝑠𝑖𝑔𝑚𝑜𝑖𝑑subscriptinssigmoid(\mathcal{L}_{\text{ins}})italic_s italic_i italic_g italic_m italic_o italic_i italic_d ( caligraphic_L start_POSTSUBSCRIPT ins end_POSTSUBSCRIPT )42.5
10sigmoid(ins)𝑠𝑖𝑔𝑚𝑜𝑖𝑑subscriptinssigmoid(\mathcal{L}_{\text{ins}})italic_s italic_i italic_g italic_m italic_o italic_i italic_d ( caligraphic_L start_POSTSUBSCRIPT ins end_POSTSUBSCRIPT )38.7
50.241.3
50.541.6
51.041.2

B.5 Inference Cost

To analyze the computational efficiency of the TAGI model compared to the standard instruction training model (full fine-tuning), let’s consider a scenario where we have to process n𝑛nitalic_n samples, each of length i𝑖iitalic_i, along with a task instruction of length t𝑡titalic_t. We assume the output sequence length is negligible and thus ignore it in our computations.

In a typical full fine-tuning setup, such as Tk-Instruct, each input is concatenated with the task instruction, requiring the model to process the combined input sequence. If we denote the number of FLOPs required to process a single token with an encoder-decoder model as N𝑁Nitalic_N, where N𝑁Nitalic_N is the total number of model parameters, then the total computation cost for all samples can be estimated as:FLOPsstandard=Nn(t+i)subscriptFLOPsstandard𝑁𝑛𝑡𝑖\text{FLOPs}_{\text{standard}}=N\cdot n(t+i)FLOPs start_POSTSUBSCRIPT standard end_POSTSUBSCRIPT = italic_N ⋅ italic_n ( italic_t + italic_i )Here, each of the n𝑛nitalic_n samples includes both the instruction and the sample input, leading to n(t+i)𝑛𝑡𝑖n(t+i)italic_n ( italic_t + italic_i ) tokens being processed.

Our TAGI model, on the other hand, processes the task instruction only once, regardless of the number of samples. This unique feature significantly reduces the computation required, especially as the number of samples or the length of the instruction increases. The total computation cost in this model is given by:FLOPsTAGI=N(t+ni)subscriptFLOPsTAGI𝑁𝑡𝑛𝑖\text{FLOPs}_{\text{TAGI}}=N\cdot(t+ni)FLOPs start_POSTSUBSCRIPT TAGI end_POSTSUBSCRIPT = italic_N ⋅ ( italic_t + italic_n italic_i )In this case, the instruction length t𝑡titalic_t is processed only once, and each sample is processed separately, resulting in a total of (t+ni)𝑡𝑛𝑖(t+ni)( italic_t + italic_n italic_i ) tokens being processed.

Appendix C Extended Results

C.1 Characteristics Comparison of the Proposed TAGI and Other Baselines

Here, we report a full comparison of methods and the proposed TAGI in Table9, also visualized in Table1. In this report, we compare various methods across eight dimensions. Finetuning on target tasks yields good performance; however, it necessitates retuning when applied to unseen tasks and fails to address these effectively. Strong baseline meta-training methods excel at handling unseen tasks by enabling models to solve problems based on task-specific instructions. Nevertheless, these methods are limited to instance-level operations and entail repetitive processing of concatenated instructions and comprehensive finetuning, resulting in significant parameter updates and high inference costs.

Hypter [41] initially introduced the approach of considering tasks at the task level, treating identical tasks as a unified entity, and employing a hypernetwork to generate adapters that represent specific task models from instructions. Building on this, Hypertuning [26] uses demonstrations to generate adapters and pretrains the hypernetwork to boost its expressive capabilities. Both strategies avoid the direct input of instructions and rely on hypernetwork, which reduces parameter updates and lowers computational demands during inference. However, they suffer from notable performance degradation due to the lack of instructional information in the input.

HINT [13] addresses this issue by appending instructions post-encoder, thus eliminating redundant computations. Although these methods facilitate learning at the task level, they do not engage in instruction-based learning, i.e., they do not explicitly supervise the hypernetwork’s generation process to aid in understanding instructions and generating parameters.

The proposed TAGI rectifies these deficiencies by integrating cross-attention for enhanced information fusion and supervised learning of adapter weights within HINT. This innovation aids in generalizing to unseen tasks without increasing the computational burden.

Meta-Pre-Instr.Instr.Low Up.Low Infer.Instr.Unseen
MethodTrainTrainConcat.Fus.ParamsCostLearningTask
Simple FT
T0 [29] / Tk-Instruct [36]✔✔✔
Hypter [41]
HyperTuning [26]
HINT [13]✔✔
TAGI (Ours)✔✔✔

C.2 P3 Full Results

Table10 reports the per-task performance and average accuracy on P3 reported in subsection4.4.

MethodANLI (R1)(R2)(R3)HSwagCBCOPARTEWiCWSCWGDSClozeMTest11 Avg.HyperT5 Avg.
Random33.433.433.433.425.050.050.052.750.063.550.050.044.746.8
# Base(250M)
T5-LM 33.433.333.533.524.744.354.347.949.757.949.854.143.945.2
T5-LM Full FT 33.834.533.433.524.866.545.751.153.746.349.850.944.646.5
T5-LM Metatrain31.030.329.533.125.040.552.651.250.258.447.466.644.144.6
T5-LM-FiD 33.032.433.133.426.742.558.854.651.157.950.376.347.046.9
T5-LM-TAGI32.131.531.733.125.044.554.753.752.360.550.864.045.646.7
T0 32.331.532.433.126.545.865.969.351.656.751.276.149.149.9
T0 Full FT 33.532.633.933.929.173.266.368.053.150.951.079.051.953.1
T0 Metatrain32.131.531.533.229.550.464.268.247.761.652.880.850.150.8
T0-FiD 32.731.732.933.626.254.968.268.151.960.351.382.351.051.7
T0-TAGI32.731.131.935.029.849.367.170.049.061.254.479.650.851.7
# Large(800M)
T5-LM 32.732.133.432.725.333.850.549.051.050.450.547.841.542.9
T5-LM Full FT 34.135.133.633.626.165.447.151.753.547.549.956.545.546.9
T5-LM Metatrain31.330.030.533.427.060.477.671.947.056.454.887.252.453.3
T5-LM-FiD 34.433.933.435.828.360.281.172.650.763.755.691.655.255.8
T5-LM-TAGI33.733.532.535.127.862.979.076.152.957.958.286.254.756.0
T0 34.132.234.236.026.156.876.665.350.856.453.988.452.452.5
T0 Full FT 35.334.535.436.233.180.180.869.254.153.256.390.056.657.8
T0 Metatrain32.931.531.835.524.559.477.065.148.856.757.688.052.452.8
T0-FiD 33.431.832.835.726.160.777.667.152.159.154.789.553.453.9
T0-TAGI32.731.532.936.627.361.379.668.748.259.956.489.453.854.6
HyperT5-Prefix 33.4---32.360.173.971.551.163.051.1--54.6
HyperT5-LoRA 33.6---33.049.574.267.452.064.052.9--53.3
# XL(3B)
T5-LM 32.732.233.432.724.632.753.148.850.857.650.951.442.643.9
T5-LM Full FT 34.635.534.333.927.167.854.850.753.747.750.763.347.248.4
T5-LM Metatrain32.731.532.334.333.359.574.869.552.653.854.288.453.153.8
T5-LM-FiD 39.339.837.640.431.467.092.378.850.464.561.296.560.060.6
T5-LM-TAGI37.737.836.139.332.068.289.476.653.661.259.694.258.959.8
T0 38.038.435.740.026.567.782.280.153.557.357.894.057.657.9
T0 Full FT 38.537.538.839.238.781.988.080.155.959.561.495.061.463.0
T0 Metatrain37.037.333.240.424.866.981.978.952.760.255.692.856.857.3
T0-FiD 38.639.036.540.528.562.987.474.652.162.761.095.558.258.5
T0-TAGI38.739.535.641.026.568.787.878.252.261.859.895.658.859.2
HyperT5-Prefix 38.7---33.669.688.479.553.157.656.6--59.6
HyperT5-LoRA 35.3---30.866.483.368.550.360.056.1--56.4

Appendix D Limitations

Large Language Models. Due to computational constraints, most of our experiments were conducted using models with 3Babsent3𝐵\leq 3B≤ 3 italic_B parameters. Given the complexity of our research, we restricted our focus to encoder-decoder models, which have demonstrated superior performance in cross-task generalization [34], which we explore further in B.1. Consequently, it remains uncertain whether instruction learning can be effectively scaled to larger models (7Babsent7𝐵\geq 7B≥ 7 italic_B parameters) or commonly used decoder-only models. However, since our method preserves the original model parameters without compromising performance, we anticipate its applicability to broader research in the future.

Training Costs. Although TAGI is computationally efficient during inference, its training cost is significantly higher. This is due to the additional requirements beyond the foundation laid by previous work, including the introduction of knowledge distillation, running a hypernetwork to generate adapters for each batch, and pre-training some downstream task-specific models. Consequently, while TAGI may be highly efficient for inference and suitable for users with limited resources, training a unique TAGI model presents considerable challenges.

Datasets. In the SNI study, our investigation was limited to tasks in English, leaving the generalization capabilities in a multilingual context unexplored. However, given the proven effectiveness of hypernetwork methods in achieving multilingual generalization [2, 44], we are optimistic about the potential directions for our future research in this domain. Furthermore, in P3, we adopted the methodologies of T0 [29] and FiD-ICL [40], concentrating primarily on natural language processing (NLP) tasks amenable to ranking classification. This focus included tasks related to classification and multiple-choice questions but excluded other types of generative tasks. Looking ahead, we aim to develop new research resources and broaden our experimental scope and evaluations to encompass a more diverse array of categories.

Task# Num of Task
First Split (6 Tasks)
Question Answering157
Program Execution90
Question Generation51
Sentiment Analysis42
Misc.36
Toxic Language Detection32
Second Split (15 Tasks)
Text Categorization28
Commonsense Classification23
Text Matching17
Named Entity Recognition17
Information Extraction17
Wrong Candidate Generation15
Text Completion14
Question Understanding13
Text to Code12
Third Split (30 Tasks)
Summarization12
Dialogue Generation11
Word Semantics10
Story Composition9
Speaker Identification9
Pos Tagging9
Linguistic Probing9
Fill in The Blank8
Text Quality Evaluation7
Stereotype Detection7
Sentence Composition7
Negotiation Strategy Detection7
Gender Classification7
Coherence Classification6
Word Relation Classification5
Fourth Split (60 Tasks)
Explanation5
Text Simplification4
Sentence Perturbation4
Paraphrasing4
Mathematics4
Intent Identification4
Dialogue State Tracking4
Code to Text4
Sentence Ordering3
Fact Verification3
Answer Verification3
Translation2
Style Transfer2
Stance Detection2
Speaker Relation Classification2
Question Decomposition2
Number Conversion2
Irony Detection2
Grammar Error Detection2
Spelling Error Detection1
Spam Classification1
Sentence Expansion1
Sentence Compression1
Punctuation Error Detection1
Preposition Prediction1
Poem Generation1
Entity Relation Classification1
Entity Generation1
Discourse Relation Classification1
Discourse Connective Identification1

Task# Num of Prompts
Meta-Train
First Split (5 Tasks)
cosmos_qa13
kilt_tasks_hotpotqa5
amazon_polarity9
cnn_dailymail_3.0.09
common_gen9
Second Split (10 Tasks)
glue_mrpc7
adversarial_qa_dbert5
ag_news7
dream5
gigaword9
Third Split (20 Tasks)
paws12
wiki_qa11
ropes12
quoref11
dbpedia_144
multi_news6
imdb10
quail13
quartz8
wiki_bio5
Fourth Split (36 Tasks)
adversarial_qa_dbidaf5
adversarial_qa_droberta5
duorc_SelfRC9
duorc_ParaphraseRC9
cos_e_v1.1111
qasc8
sciq5
glue_qqp6
social_i_qa6
wiki_hop_original9
wiqa8
app_reviews4
rotten_tomatoes10
yelp_review_full7
samsum7
xsum10
Meta-Test
super_glue_wsc.fixed
winogrande_winogrande_xl
super_glue_cb
super_glue_rte
anli(r1/r2/r3)
super_glue_copa
hellaswag
super_glue_wic
story_cloze

Task Adapters Generation from Instructions (2024)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Dr. Pierre Goyette

Last Updated:

Views: 5245

Rating: 5 / 5 (50 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Dr. Pierre Goyette

Birthday: 1998-01-29

Address: Apt. 611 3357 Yong Plain, West Audra, IL 70053

Phone: +5819954278378

Job: Construction Director

Hobby: Embroidery, Creative writing, Shopping, Driving, Stand-up comedy, Coffee roasting, Scrapbooking

Introduction: My name is Dr. Pierre Goyette, I am a enchanting, powerful, jolly, rich, graceful, colorful, zany person who loves writing and wants to share my knowledge and understanding with you.