Huanxuan Liao1,2, Shizhu He1,2 , Yao Xu1,2, Yuanzhe Zhang1,2,
Yanchao Hao3, Shengping Liu4, Kang Liu1,2, Jun Zhao1,2
1 The Key Laboratory of Cognition and Decision Intelligence for Complex Systems,
Institute of Automation, Chinese Academy of Sciences, Beijing, China
2 School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
3 Platform and Content Group, Tencent, Beijing, China 4 Unisound, Beijing, China
liaohuanxuan2023@ia.ac.cn {shizhu.he, yao.xu, kliu, jzhao}@nlpr.ia.ac.cn
Corresponding author
Abstract
Large language models (LLMs) have acquired the ability to solve general tasks by utilizing instruction finetuning (IFT).However, IFT still relies heavily on instance training of extensive task data, which greatly limits the adaptability of LLMs to real-world scenarios where labeled task instances are scarce and broader task generalization becomes paramount.Contrary to LLMs, humans acquire skills and complete tasks not merely through repeated practice but also by understanding and following instructional guidelines. This paper is dedicated to simulating human learning to address the shortcomings of instance training, focusing on instruction learning to enhance cross-task generalization. Within this context, we introduce Task Adapters Generation from Instructions (TAGI), which automatically constructs the task-specific model in a parameter generation manner based on the given task instructions without retraining for unseen tasks.Specifically, we utilize knowledge distillation to enhance the consistency between TAGI developed through Learning with Instruction and task-specific models developed through Training with Instance, by aligning the labels, output logits, and adapter parameters between them. TAGI is endowed with cross-task generalization capabilities through a two-stage training process that includes hypernetwork pretraining and finetuning.We evaluate TAGI on the Super-Natural Instructions and P3 datasets. The experimental results demonstrate that TAGI can match or even outperform traditional meta-trained models and other hypernetwork models, while significantly reducing computational requirements. Our code will be available at https://github.com/Xnhyacinth/TAGI.
1 Introduction
Large language models (LLMs) have acquired the ability to solve general tasks by utilizing instruction finetuning (IFT), which describes different tasks in the same natural language format [3; 6; 22]. However, IFT still relies heavily on instance training of extensive task data {(Description, [Demostrations], Source, Target)} [36; 38], which faces significant limitations in adapting LLMs to real-world scenarios where labeled task instances are scarce and broader task generalization becomes paramount.
Therefore, for better cross-task generalization, the "zero-shot" learning ability of LLMs is crucial for real-world applications: models learned with instructions can achieve non-trivial performance on unseen tasks with just a single instruction that provides a comprehensive description of the task(e.g., "You will be given sentences in which your task is to recognize the name of a person."). Traditionally, achieving this capability involves meta-training the model by associating each input with specific task instructions [20; 36].For example, GPT-3 [24] has demonstrated strong "zero-shot" capabilities through meta-training. However, these methods heavily depend on the foundation model’s abilities and are inefficient for various unseen tasks [21; 43], as they require reprocessing extensive task instructions and some supplementary task data (e.g., examples from few-shot instances) for each input (see the top of Figure1).
In recent years, researchers have begun to explore meta-learning to enhance the cross-task generalization capabilities of LLMs, aiming to construct flexible, reusable and robust task-specific models [1; 33]. For example, task-specific models such as Adapter [11], LoRA [12], and Prefix [17] have been constructed by a hypernetwork [8]. This approach significantly enhances task generalization by processing instructions efficiently, reducing redundant computations [26]. However, these methods heavily depend on a substantial corpus of training instances, which can hinder their capacity to efficiently learn and construct task-specific models based on provided instructions [13].
In fact, contrary to LLMs, humans acquire skills and complete tasks not only through repeated practice but also by understanding and following instructional guidelines [15]. For example, a tourist with basic knowledge of riding vehicles can easily learn to use new ones abroad for the first time with the help of travel guides.This paper aims to mimic the way humans learn skills by understanding instructions. This shift represents a modest evolution in task model construction, transitioning from traditional instance training models to a contemporary approach focused on instruction learning.By providing task instructions, the novel paradigm offers an automated solution for generating task-specific adapters and seamlessly integrating them into the base model. This approach aims to streamline the development of task-specific models while enhancing their ability to generalize across diverse tasks with instructions.
Guided by this goal, we introduce Task Adapters Generation from Instructions (TAGI), which converts instructions to task-specific adapters using a hypernetwork. Under the knowledge distillation framework [10; 35], we enable models to the "Learning with Instruction" paradigm in a manner analogous to the "Training with Instance" paradigm. TAGI will enhance the alignment between the task-specific model (acting as the teacher) and the vanilla LLM combined with the generated task adapters (acting as the student) (see the bottom of Figure1). This alignment is achieved not only through instance training but also by incorporating parameter learning for task-specific models based on instructions. Specifically, we align the student under two distinct paradigms, encompassing not just the targets and logits, but also the adapters’ parameters by an L2 regularization within instruction, which represents the enhancement of the understanding of instructions and the ability to generate more efficient task-specific adapters. Moreover, TAGI endows the model with task generalization capabilities through a two-stage training process: hypernetwork pretraining on standard text pretraining data (e.g., C4 [28]), followed by finetuning on meta-training tasks. This allows it to generalize effectively across unseen tasks without sacrificing performance.
We evaluate TAGI on the Super-Natural Instructions (SNI) [36] and P3 [29] datasets. Experimental results demonstrate its ability to effectively generate adapters for unseen tasks, surpassing meta-trained models by 2% in SNI and 5% in P3, while significantly reducing computational demands by 60%, and outperforming other hypernetwork models by 7%.Notably, our method does not require additional parameter updating or gradient back-propagation, and it avoids the inefficiency of repeatedly encoding instructions during inference.We summarize our contributions as follows:
- •
We propose a novel model construction paradigm by imitating human learning abilities, Learning with Instruction, for the cross-task generalization of the LLMs. To the best of our knowledge, it is the first time that a task-specific model has been generated based on instruction learning, and its capabilities and parameters are distilled from a teacher model trained on instance learning.
- •
We used a knowledge distillation framework to develop task-specific models within the instruction learning paradigm. By aligning model parameters comprehensively, the TAGI method improves the model’s ability to understand instructions and solve unseen tasks more accurately and efficiently.
- •
Comprehensive quantitative and qualitative assessments have highlighted the effectiveness of TAGI on two publicly available large-scale instruction datasets, with lower inference costs.
2 Related Work
TAGI draws inspiration from previous research on instruction following, hypernetworks and knowledge distillation. In this section, we will delve into the pioneering work in these areas.
Instruction Following is often used to evaluate the cross-task generalization of LLMs, and it is dedicated to handling any task described in natural language. Recent findings suggest that additional finetuning of LLMs with instructions substantially improves their zero-shot capabilities [6; 37; 38].Moreover, large-scale multi-task meta-training has been shown to equip models with the ability to address new tasks in zero- or few-shot scenarios, facilitated by standard task formats and prompts [29; 43] alongside providing concise task instructions and select examples [23; 36]. However, the instructions and examples can significantly escalate the computational burden compared to task-specific models. Existing works attempt to mitigate this issue involved creating adapters to separately process instructions and examples [13; 41] with reduced performance. To overcome these limitations, we introduce a new paradigm that draws on instruction-based learning, simulating instance training to enhance the perception and processing capabilities of LLMs for handling unseen tasks.
Hypernetworks [8; 30] are neural networks that generate weights for other neural networks [4], which are designed to use fewer parameters to dynamically build task-specific models [9; 32]. Notable works such as HyperTuning [26], HINT [13], and Hypter [41] have all adopted hypernetworks to convert task instructions and demonstrations into adapters for LLMs. And MEND [5] utilizes hypernetworks to compress demonstrations for distilled vectors. Although they all avoided processing lengthy instructions repeatedly and utilized adapters to make training and testing more cost-effective [18], they still have a performance loss compared to meta-training [7]. The proposed method TAGI incorporates the utilization of hypernetworks, which are instrumental in generating task-specific adapters that are seamlessly integrated into LLMs. Compared to existing models based on hypernetworks, TAGI not only trains at the instance level but also incorporates knowledge distillation to supervise the adapters generated by hypernetworks, thereby achieving both efficiency and effectiveness.
Knowledge Distillation is a technique where a smaller model (student) learns to mimic the predictions of a larger model (teacher), aiming to retain performance while reducing computational resources [10]. Indeed, the application of knowledge distillation is the essential difference between the proposed method in this paper and other hypernetwork-based methods such as HINT [13] and Hypter [41]. Recently, some works [31] utilize knowledge distillation to finetune small language models such as T5 [28], enabling them to act as LLMs with pre-prompting without any given prompts. Compared with the typical knowledge distillation methods of LLMs, the proposed method TAGI in this paper further utilizes model parameter alignment and aims to mimic another learning paradigm of human skill learning. We not only calculate the Kullback–Leibler (KL) divergence [14] between teacher and student models [10], but also compute the L2 regularization between the generated adapter by instruction learning and task-specific models by instance training.
3 Methods
3.1 Problem Setting
Cross-task Generalization: Given a set of tasks , where each task contains a set of (source, target) samples . We categorize these tasks into three distinct non-overlapping groups for validating out-of-distribution generalization: meta-train (), meta-valid (), and meta-test (), assuming all tasks adhere to a text-to-text format. For example, comprises tasks like translation and question answering, the and encompass tasks such as paraphrasing and natural language inference respectively.Within the , the goal is to utilize the data for training and transfer knowledge to facilitate learning to resolve the test tasks. For all methods discussed, aside from the original unsupervised pretraining of the language model backbone on separate corpora, the model learning primarily takes place through multi-task training on the .
3.2 Task Adapters Generation from Instructions (TAGI)
In this section, we will introduce the detailed method of TAGI. For each (unseen) task, TAGI consists of two core components: a hypernetwork §3.2.1 which receives task instructions and generates parameter-efficient adapters, and a task-specific model which combines the vanilla LLM and the generated adapters from hypernetwork.
Unlike traditional meta-training methods, we transition from training with instance to learning with instruction, which not only addresses efficiency issues at the instance level but also incorporates parameter alignment for the task-specific model parameters at the instruction level.Specifically, the complete process is shown in Figure2, we initially train the LoRA modules §3.2.2 on various upstream tasks (seen tasks) with task datasets of meta-train (). Specifically, for distinct upstream tasks, we independently train LoRA modules, with each module denoted as for task , presumed to represent the optimal model for its respective task. Subsequently, TAGI is committed to building proprietary models for downstream tasks (unseen tasks). Its training process is bifurcated into two primary phases: hypernetwork pretraining §3.2.3 and hypernetwork finetuning §3.2.4 which encompasses distillation and alignment.
3.2.1 Hypernetwork for Converting Instructions into LoRA
A pivotal element of our model is the hypernetwork that converts task instructions (descriptions and demonstrations) into a parameter-efficient module. Our hypernetwork comprises two crucial components: the encoder, derived from the vanilla LLM111We find that re-using the encoder from the vanilla LLM works well [13]., is designed to minimize encoding biases by converting task instructions into a continuous contextual representation. This representation is then fused with LLM input and concated with encoded input for the decoder. Additionally, the adapter generator, utilizing a basic MLP design, is both lightweight and efficient, effectively converting encoded instructions into parameter-efficient modules.
Encoder: Prior studies simply concatenated encoded instructions with inputs, overlooking the interactions between them. To address this, we integrated a hierarchical cross-attention layer into the encoder of the LLM to refine the input representation with embedded instruction details. Specifically, for an input and its corresponding task instruction , we initially employ the encoder within the hypernetwork to encode the instruction into representations . Then, we feed the into the model and obtain the output representation from the self-attention sublayer in the -th layer. Ultimately, is processed through the -th cross-attention layer, resulting in a text representation that is enriched with instruction information:
(1) |
where conducts multi-head attention on the query, key, and value matrices, followed by residual connection and layer normalization. The final input to the decoder is the concatenation of the encoded instruction and the encoded fusion input, i.e., .
Adapter Generator: Considering the efficiency and effectiveness, we utilize a two-layer multi-layer perceptron (MLP) to generate parameter-efficient modules (e.g., LoRA) for the encoded instruction. To differentiate between the query and value matrices as well as the layers, we introduce layer ids as positional information. We use a unique network for each layer and share it between and (i.e., one network is used for a certain layer LoRA generation).
(2) |
where and are the -th LoRA of and , respectively.
3.2.2 LoRA Tuning for Task-specific Models
LoRA [12] efficiently reduces the number of trainable parameters by decomposing the update of the LLM’s attention weight matrix (denoted as ) into low-rank matrices. Specifically, LoRA updates the weight matrix as , with and being trainable low-rank matrices of rank , significantly smaller in dimensions than and . We finetune a robust baseline to derive the LoRA parameters for task-specific models for -th task, facilitating LLM instruction learning and parameter alignment. SNI is categorized into 60 types based on task types, while P3 encompasses 36 categories, corresponding to 60 and 36 parameter modules, respectively.
3.2.3 Hypernetwork Pretraining for Preliminary Generalization
Previous research [5; 26] has demonstrated that pretraining hypernetworks can substantially improve the model’s cross-task generalization capabilities. Adhering to the HINT [13], we pretrain the hypernetwork on C4 [28] before finetuning it on a diverse multi-task prompt dataset. As illustrated in the right segment of Figure2, given an input sequence, we partition it into randomly sized segments , , and , where is fed into the hypernetwork, into the LLM, and is the segment to predict. During this stage, training is conducted by minimizing the cross-entropy loss , aiming to ensure that the hypernetwork learns to recognize instructions to enhance generalization ability.
(3) |
3.2.4 Hypernetwork Finetuning for Instruction Learning
At this stage, TAGI is finetuned on a multi-task prompt dataset, enabling it to learn the generation of optimal parameters from task instructions, thereby ensuring effective generalization to future unseen tasks. Similar to the pretraining phase, task instructions (alongside some few-shot samples) replace , the main input replaces , and the target replaces . In each iteration, the hypernetwork generates LoRA parameters and encodes the instructions. LoRA is a parameter-efficient module (i.e., inserting into the model), and the encoded instructions are integrated with the encoder’s embeddings for information fusion and concatenated with the fused encoding input during decoding. Beyond the standard , we employ knowledge distillation for instruction learning: a strong baseline combining complete task instructions and input, serves as the teacher, while the model incorporating generated LoRA parameters with the input, acts as the student. The KL divergence measures the discrepancy in word probability distributions between the two models as an implicit learning outcome, and the MSE loss calculates the difference between the generated parameters and those of task-specific parameter-efficient modules as an explicit learning intermediate result. The formulation of finetuning is as follows:
(4) |
(5) |
(6) |
where , is the optimal LoRA modules of the -th task, and are the hyper-parameter to control the importance of distillation in finetuning.
4 Experiments
We first present the datasets (§4.1) and baselines (§4.2) used in our evaluation and then discuss three research questions (RQs):
RQ1: Can the proposed instruction learning paradigm effectively learn the ability of instance training? Can it support cross-task generalization of LLMs? (§4.4)
RQ2: How many foundation tasks does TAGI need to learn to achieve better results? (§4.5)
RQ3: What is the impact of different modules and learning stages on TAGI? (§4.7)
4.1 Datasets
To demonstrate the generality of our method, we evaluate our approach on two popular multi-task instruction datasets222We provide the full list of datasets and more details in the A.2.: Super-Natural Instructions (SNI) [36] and T0 split of P3 (P3) [29].
SNI comprising over 1,600 task datasets, each dataset includes a task definition and a set of fixed positive and negative demonstrations. We follow the previous research [13; 26] and examine two settings: only using the task definition as the input to the hypernetwork (‘Def’), and using the definition along with two few-shot positive examples (‘Def + 2 Pos’). We only use the English tasks in the dataset and the model’s generation is evaluated on a set of 119 unseen tasks using ROUGE-L.
P3 composed of 62 task datasets, the T0 model is trained with these tasks divided into meta-training and meta-test sets. The format of the prompts takes into consideration 0-shot reasoning and typically includes instructions or possible answer options. We follow the precedent work [40] by using the T0 training subset 36 tasks to train our model. The evaluation is conducted based on the accuracy scores of multiple-choice questions for unseen 11 tasks in the meta-test set (MTest11).
Pre- Instr. Low Infer. Instr. Unseen Method Train Fus. Cost Learning Task Simple FT ✗ ✔ ✗ ✗ ✗ T0 [29] / Tk-Instruct [36] ✗ ✔ ✗ ✗ ✔✔✔ Hypter [41] ✗ ✗ ✔ ✗ ✔ HyperTuning [26] ✔ ✗ ✔ ✗ ✔ HINT [13] ✔ ✗ ✔ ✗ ✔✔ TAGI (Ours) ✔ ✔ ✔ ✔ ✔✔✔
4.2 Baselines
We compare the characteristics of TAGI against eight primary groups of baselines (as shown in Table1): 1) No FT: models without finetuning. 2) HyperTuning [26]: models that use hypernetwork to convert demonstrations into adapters without instruction fusion. 3) Hypter [41]: models based on hypernetwork do not use pretraining. 4) HINT [13]: models pretrain hypernetwork and concat instruction. 5) T0 and Tk-Instruct: strong baselines fully finetuned on P3 and SNI respectively with instruction concatenated. 6) Full FT: models fineuned on target tasks. 7) Decoder-only model: decoder-only models fully finetuned like GPT-2 [27] and OPT [42]. 8) FiD-ICL [40]: ICL method use encoder intermediate fusion.
4.3 Implementations
We limit our scope to encoder-decoder models for our experiments333We have discussed in detail the encoder-decoder and decoder-only models in B.1.. We use T5-LM-Adapt444https://huggingface.co/google/t5-xl-lm-adapt and T0 [29] as initializations in our experiments. The two model groups have the same architectural framework but differ in weight; T0 uses T5-LM-Adapt for initialization and undergoes multi-task training on the P3 meta-training set. For SNI, only T5-LM-Adapt is considered, and three different sizes are tested: Base (250M), XL (3B), and XXL (11B), with the teacher model being TK-Instruct [36]. For P3, we experimented with two sets of models of three different sizes: Base (250M), Large (800M), and XL (3B) with the only template as input, while the teacher model being FiD-ICL [40] with 16-shot examples. The A.4 contains more implementation details and experimental settings.
4.4 Main Results
Super-Natural Instructions. We report the performance and inference costs of TAGI models and baselines in subsection4.4. Our analysis and findings yield several key insights:
Firstly, methods lacking finetuning exhibit subpar performance. As shown in the first row of the table, the performance of No FT is significantly lower than other baseline methods by approximately 30 points (except for Hypter), which underscores the critical role of inductive bias, introduced during meta-training, in enhancing the model’s instructional adherence and cross-task generalization.
Def (Zero-shot) Def + 2 Pos. (Few-shot) Avg. Rel. Method Base (250M) XL (3B) XXL (11B) Base (250M) XL (3B) XXL (11B) FLOPs No FT 8.8 14.3 26.2 9.4 13.6 30.5 1.0 Tk-Instruct† 35.3 48.0 53.6 42.1 54.0 62.0 1.0 \hdashline # Decoder-only model GPT-2 XL (1.5B)∗ - 38.2 - - 45.3 - 0.33 OPT (13B)∗ - - 44.8 - - 51.5 0.36 \hdashline # Hypernetwork-based model Hypter∗ 12.1 16.8 15.5 10.6 14.2 13.4 0.35 HyperTuning† - 38.9 - - 48.6 - 0.34 HINT∗ 33.3 47.2 51.1 41.8 53.2 56.4 0.37 TAGI (Ours) 35.3 48.4 52.3 ‡ 42.5 56.3 58.4 ‡ 0.39
Secondly, TAGI demonstrates notable improvements over other hypernetwork-based baselines, with only a marginal increase in inference overhead (see subsection4.4 last column). We find that TAGI still outperforms the advanced method HINT ( points) while achieving similar computational savings. This highlights the efficacy of instruction learning with knowledge distillation.The underperformance of HINT and Hypertuning may stem from their sole reliance on cross-entropy with the target during meta-training, lacking explicit supervision of intermediate task-specific module parameters and implicit supervision of the teacher outcome. This deficiency impedes their ability to fully leverage instruction tasks for generating superior adapter parameters during meta-test.
Thirdly, TAGI consistently matches or even surpasses robust baselines in both zero- and few-shot settings. Comparing TAGI with multi-task finetuning approaches such as Full FT and TK-Instruct, we observe that TAGI achieves comparable performance ( points) except for 11B while utilizing approximately 2.5 fewer FLOPs. TAGI’s performance on the 11B model is somewhat lacking, potentially attributable to either insufficient training due to resource limitations or a decrement in performance stemming from the omission of parameter alignment constraints due to time constraints555We discuss the trend and possible reasons in B.2. In alignment with prior research, TAGI significantly surpasses GPT-2 and OPT-13B in comparative analyses with decoder-only models ( points in GPT2 and points in OPT-13B), affirming the superiority of encoder-decoder models within similar meta-learning frameworks. Overall, TAGI fulfills its objective by enhancing cross-task generalization capabilities through instruction learning and striking an optimal balance between performance and efficiency.
T5-LM T0 Avg. Rel. Method Base (250M) Large (800M) XL (3B) Base (250M) Large (800M) XL (3B) Infer. Time # MTest11 Avg. Zero-shot 43.9 41.5 42.6 49.1 52.4 57.6 1.0 Full FT 44.6 45.5 47.2 51.9 56.6 61.4 1.0 Metatrain ♡ 44.1 52.4 53.1 50.1 52.4 56.8 1.0 \hdashline # ICL-based method Concat-ICLα 44.2 47.6 - 48.6 53.2 - 4.1 FiD-ICLα 47.0 55.2 60.0 51.0 53.4 58.2 1.9 Ensemble-ICLα 44.6 54.5 52.6 49.9 53.7 57.7 13.2 \hdashline # Hypernetwork-based model Hypter∗ - - - - - 56.2 - HINT∗ - - - - - 60.3 - TAGI (Ours) 45.6 54.7 58.9 50.8 53.8 58.8 0.88 # HyperT5 Avg. (Without SCloze dataset) FiD-ICLα 46.9 55.8 60.6 51.7 53.9 58.5 1.9 HyperTuning† - 54.6 59.6 - - - - TAGI (Ours) 46.7 56.0 59.8 51.7 54.6 59.2 0.88
P3. We report results on the T0 evaluation set in subsection4.4, with full results in C.2.
Firstly, examining the ICL-based methods presented in the middle section, it is evident that all three ICL aggregation strategies achieve superior performance. This underscores the utility of instructions and demonstrations in aiding LLMs. However, these methods require concatenating extensive demonstrations during both training and inference, which significantly increases computational demands and reduces efficiency (2 - 13.2 inference time). In contrast, TAGI by leveraging solely task instructions one time, attains comparable or superior accuracy levels while significantly curtailing computational burdens (0.88).TAGI demonstrates a slight disadvantage (merely points) to FiD-ICL [40] on T5-LM, yet it outperforms other methods ( point). For T0, it is only 1.5 points lower than Full FT and exceeds all ICL-based methods. Notably, TAGI does not require the 16 examples like the ICL-based method, nor does it necessitate repeated processing of instructions like the baselines, significantly reducing inference overhead.
A comparison of the first three lines of results indicates that for large or XL models, initializing with T5-LM outperforms T0. We hypothesize that the process of training T5-LM to transition into T0 might result in the dilution of world knowledge or the diminishment of certain specific capabilities, thereby attenuating the benefits derived from meta-training. Conversely, for models of base size, T0 serves as a more effective initialization point.
Furthermore, TAGI outperforms competing hypernetwork models666 Because HINT is designed for TPU and Hypertuning is not open-sourced, we didn’t calculate their inference time. However, based on SNI experiments, it can be inferred that the trend of time expenditure is consistent..By comparing the last two columns, it is evident that the performance in MTest11 surpasses HINT and Hypertuning by and points respectively. Additionally, in the HyperT5 evaluation, the performance exceeds Hypertuning by point. This aligns with prior findings, suggesting that instruction learning augments the hypernetwork’s task comprehension and its capacity to generate task-specific adapters.
4.5 Varying Number of Meta-Training Tasks
A fundamental component of our methodology is incorporating parameter alignment in instruction learning. Consequently, it is imperative to examine the effect of varying the number of tasks on which parameter alignment is applied on outcomes and its influence on the generalization capabilities of LLMs. To this end, we conduct a comprehensive experimental analysis to compare the efficacy of instruction learning with parameter alignment across a spectrum of task quantities against instruction learning devoid of parameter alignment.Tasks are organized in descending order based on the number of datasets encompassed within each. Subsequently, a predetermined number of tasks are sequentially selected for meta-training purposes. This approach allows us to systematically evaluate the impact of parameter alignment on learning and generalization as the number of tasks varied.
From Figure3, we find that, firstly, an increase in the number of tasks correlates with improved performance across all methods, suggesting that meta-training across a broader array of tasks enhances the model’s instruction-following capabilities. However, the practical limitations of sourcing a sufficient quantity of tasks for meta-training must be acknowledged. Secondly, it was observed that the TAGI model exhibits lower overall performance in the absence of parameter alignment for instruction learning, yet it demonstrates a smaller relative standard deviation and less variability in performance in response to the number of tasks. This pattern aligns with the expected outcomes of instruction learning, highlighting the efficacy of our approach in bolstering the model’s ability to adhere to task instructions and generate task-specific adapters.
4.6 Parameter Size against Performance
We analyzed the proportion of generated parameter sizes relative to the total parameter size during the generation of various ranks, and compared this to the performance of the full meta-training fine-tuning method, as demonstrated in Figure 4 and Table 7. We can find that TAGI requires only about 10% of the parameters to outperform full meta-training fine-tuning which indicates that the limited parameters generated by the Hypernetwork serve as an optimal solution for task completion. The ability to adaptively construct models tailored to specific tasks removes the necessity for additional fine-tuning, underscoring TAGI’s effectiveness and efficiency.
Method Def Def + 2Pos. P3 TK-Instruct 48.0 54.0 - TK-Instruct-LoRA 47.5 54.6 - TK-Instruct-Prefix 42.6 54.2 - \hdashlineHypertuning 38.9 48.6 59.6 HINT 47.2 53.2 60.3 TAGI 48.4 56.3 60.6 Ablation Study w/o pretraining 47.1 55.6 58.3 w/o Instr. Fus. 35.1 40.6 44.2 w/o 47.6 55.4 59.8 w/o 45.7 53.9 57.3 w/o 47.5 55.2 59.4 w/o Hypernetwork 43.8 50.7 -
4.7 Ablation Study
To evaluate the significance of each component within the TAGI model, we conducted a series of experiments across two meta-task datasets utilizing the T5-LM-XL (3B) model. The results as depicted in the Table4, highlight that the instructions fusion plays a pivotal role in enhancing model performance. This process facilitates dynamic interaction between the input and the instructions, enriching the model’s input with additional contextual information, reminiscent of the substantial benefits observed with ICL. Moreover, pretraining emerges as a critical phase, markedly improving the capabilities of models that have not undergone pretraining, thereby significantly enhancing their proficiency in interpreting and executing task instructions. Furthermore, the systematic removal of various components during the finetuning phase indicates a consistent decline in performance, underscoring the integral contribution of each component to the model’s overall efficacy.
Compared to meta-learning methods such as LoRA fine-tuning (rank=32) "Tk-Instruct-LoRA", prefix fine-tuning (num_virtual_tokens=32) "Tk-Instruct-prefix", and full fine-tuning "Tk-Instruct", our TAGI method enhances task comprehension and utilization which achieved through a hypernetwork that dynamically generates adapter LoRA insertions into the LLM based on input, leads to better cross-task generalization capabilities. Notably, prefix fine-tuning excels in the Def + 2Pos scenario, likely due to its effective integration of information from positive examples. Conversely, the Def scenario performs less satisfactorily, indicating that instructions alone are insufficient for optimal results. Comparative analysis with other hypernetwork models reveals that TAGI’s ablation performance remains robust, affirming the effectiveness of each step in bolstering TAGI’s operational efficiency.
5 Conclusions
In this paper, we introduce an innovative method of instruction learning designed to emulate instance training. This approach enables the model to achieve specified tasks and learn from instructions on how to address a category of problems. The proposed TAGI seamlessly integrates instruction into the input and processes the instruction simultaneously, thereby ensuring minimal inference overhead. Concurrently, we employ a knowledge distillation framework to facilitate instruction learning for distilling skills and aligning task-specific models. This allows the hypernetwork to transform task instructions into an efficient module inserted into the LLMs, thereby boosting generalization performance. Remarkably, TAGI consistently equals or surpasses the efficacy of conventional meta-training approaches while requiring fewer FLOPs and obviating the need for additional model parameters updating or gradient back-propagation. Future work will investigate more potent hypernetwork pretraining techniques and develop superior instruction fusion methods to augment the hypernetwork’s expressive capability, thereby enhancing the model’s ability to generalize to unseen tasks. Moreover, future work will investigate various task type classifications and the generalization effects of cross-modal tasks in instruction learning.
6 Acknowledgements
This work was supported by National Key R&D Program of China (No. 2022YFF0711900) and the National Natural Science Foundation of China (No.62376270, No.62276264). This work was supported by the Youth Innovation Promotion Association CAS.
References
- [1]Jonathan Baxter.Learning to Learn.Springer US, 1998.
- [2]Christos Baziotis, Mikel Artetxe, James Cross, and Shruti Bhosale.Multilingual machine translation with hyper-adapters, 2022.
- [3]Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, JaredD Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei.Language models are few-shot learners.In H.Larochelle, M.Ranzato, R.Hadsell, M.F. Balcan, and H.Lin, editors, Advances in Neural Information Processing Systems, volume33, pages 1877–1901. Curran Associates, Inc., 2020.
- [4]VinodKumar Chauhan, Jiandong Zhou, Ping Lu, Soheila Molaei, and DavidA. Clifton.A brief review of hypernetworks in deep learning, 2023.
- [5]Tong Chen, Qirun Dai, Zhijie Deng, and Dequan Wang.Demonstration distillation for efficient in-context learning, 2024.
- [6]HyungWon Chung, LeHou, Shayne Longpre, Barret Zoph, YiTay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, ShixiangShane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, EdH. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, QuocV. Le, and Jason Wei.Scaling instruction-finetuned language models, 2022.
- [7]Budhaditya Deb, Guoqing Zheng, and AhmedHassan Awadallah.Boosting natural language generation from instructions with meta-learning, 2022.
- [8]David Ha, Andrew Dai, and QuocV. Le.Hypernetworks, 2016.
- [9]Yun He, HuaixiuSteven Zheng, YiTay, Jai Gupta, YuDu, Vamsi Aribandi, Zhe Zhao, YaGuang Li, Zhao Chen, Donald Metzler, Heng-Tze Cheng, and EdH. Chi.Hyperprompt: Prompt-based task-conditioning of transformers, 2022.
- [10]Geoffrey Hinton, Oriol Vinyals, and Jeff Dean.Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015.
- [11]Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin deLaroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly.Parameter-efficient transfer learning for nlp, 2019.
- [12]EdwardJ Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, LuWang, and Weizhu Chen.LoRA: Low-rank adaptation of large language models.In International Conference on Learning Representations, 2022.
- [13]Hamish Ivison, Akshita Bhagia, Yizhong Wang, Hannaneh Hajishirzi, and Matthew Peters.Hint: Hypernetwork instruction tuning for efficient zero-shot generalisation.ACL, 2023.
- [14]JamesM. Joyce.Kullback-Leibler Divergence, pages 720–722.Springer Berlin Heidelberg, Berlin, Heidelberg, 2011.
- [15]Sharon Kim, Mahjabeen Raza, and Edward Seidman.Improving 21st-century teaching skills: The key to effective 21st-century learners.Springer US, 2019.
- [16]Quentin Lhoest, Albert Villanovadel Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, Joe Davison, Mario Šaško, Gunjan Chhablani, Bhavitvya Malik, Simon Brandeis, Teven LeScao, Victor Sanh, Canwen Xu, Nicolas Patry, Angelina McMillan-Major, Philipp Schmid, Sylvain Gugger, Clément Delangue, Théo Matussière, Lysandre Debut, Stas Bekman, Pierric Cistac, Thibault Goehringer, Victor Mustar, François Lagunas, Alexander Rush, and Thomas Wolf.Datasets: A community library for natural language processing.In Heike Adel and Shuming Shi, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 175–184, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
- [17]XiangLisa Li and Percy Liang.Prefix-tuning: Optimizing continuous prompts for generation.In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, Online, August 2021. Association for Computational Linguistics.
- [18]Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin Raffel.Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning, 2022.
- [19]Shayne Longpre, LeHou, TuVu, Albert Webson, HyungWon Chung, YiTay, Denny Zhou, QuocV. Le, Barret Zoph, Jason Wei, and Adam Roberts.The flan collection: Designing data and methods for effective instruction tuning, 2023.
- [20]Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi.MetaICL: Learning to learn in context.In Marine Carpuat, Marie-Catherine deMarneffe, and IvanVladimir MezaRuiz, editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2791–2809, Seattle, United States, July 2022. Association for Computational Linguistics.
- [21]Swaroop Mishra, Daniel Khashabi, Chitta Baral, Yejin Choi, and Hannaneh Hajishirzi.Reframing instructional prompts to gptk’s language, 2022.
- [22]Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi.Cross-task generalization via natural language crowdsourcing instructions.In ACL, 2022.
- [23]Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi.Cross-task generalization via natural language crowdsourcing instructions, 2022.
- [24]Long Ouyang, Jeff Wu, XuJiang, Diogo Almeida, CarrollL. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe.Training language models to follow instructions with human feedback, 2022.
- [25]Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, LuFang, Junjie Bai, and Soumith Chintala.PyTorch: an imperative style, high-performance deep learning library.Curran Associates Inc., Red Hook, NY, USA, 2019.
- [26]Jason Phang, YiMao, Pengcheng He, and Weizhu Chen.Hypertuning: Toward adapting large language models without back-propagation, 2022.
- [27]Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.Language models are unsupervised multitask learners.2019.
- [28]Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and PeterJ. Liu.Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020.
- [29]Victor Sanh, Albert Webson, Colin Raffel, StephenH. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, TevenLe Scao, Arun Raja, Manan Dey, MSaiful Bari, Canwen Xu, Urmish Thakker, ShanyaSharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, ZhengXin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, JasonAlan Fries, Ryan Teehan, Tali Bers, Stella Biderman, Leo Gao, Thomas Wolf, and AlexanderM. Rush.Multitask prompted training enables zero-shot task generalization, 2022.
- [30]Jürgen Schmidhuber.Learning to control fast-weight memories: An alternative to dynamic recurrent networks.Neural Computation, 4:131–139, 1992.
- [31]Charlie Snell, Dan Klein, and Ruiqi Zhong.Learning by distilling context, 2022.
- [32]YiTay, Zhe Zhao, Dara Bahri, Donald Metzler, and Da-Cheng Juan.Hypergrid transformers: Towards a single model for multiple tasks.In International Conference on Learning Representations, 2021.
- [33]Sebastian Thrun and LorienY. Pratt.Learning to learn: Introduction and overview.In Learning to Learn, 1998.
- [34]Thomas Wang, Adam Roberts, Daniel Hesslow, TevenLe Scao, HyungWon Chung, IzBeltagy, Julien Launay, and Colin Raffel.What language model architecture and pretraining objective work best for zero-shot generalization?, 2022.
- [35]Wenhui Wang, Furu Wei, LiDong, Hangbo Bao, Nan Yang, and Ming Zhou.Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.Advances in Neural Information Processing Systems, 33:5776–5788, 2020.
- [36]Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva Naik, Arjun Ashok, ArutSelvan Dhanasekaran, Anjana Arunkumar, David Stap, Eshaan Pathak, Giannis Karamanolakis, Haizhi Lai, Ishan Purohit, Ishani Mondal, Jacob Anderson, Kirby Kuznia, Krima Doshi, KuntalKumar Pal, Maitreya Patel, Mehrad Moradshahi, Mihir Parmar, Mirali Purohit, Neeraj Varshney, PhaniRohitha Kaza, Pulkit Verma, RavsehajSingh Puri, Rushang Karia, Savan Doshi, ShailajaKeyur Sampat, Siddhartha Mishra, Sujan ReddyA, Sumanta Patro, Tanay Dixit, and Xudong Shen.Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks.In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085–5109, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
- [37]Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, AdamsWei Yu, Brian Lester, Nan Du, AndrewM. Dai, and QuocV Le.Finetuned language models are zero-shot learners.In International Conference on Learning Representations, 2022.
- [38]Orion Weller, Nicholas Lourie, Matt Gardner, and MatthewE. Peters.Learning from task descriptions.In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1361–1375, Online, November 2020. Association for Computational Linguistics.
- [39]Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven LeScao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush.Transformers: State-of-the-art natural language processing.In Qun Liu and David Schlangen, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October 2020. Association for Computational Linguistics.
- [40]Qinyuan Ye, IzBeltagy, Matthew Peters, Xiang Ren, and Hannaneh Hajishirzi.FiD-ICL: A fusion-in-decoder approach for efficient in-context learning.In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8158–8185, Toronto, Canada, July 2023. Association for Computational Linguistics.
- [41]Qinyuan Ye and Xiang Ren.Learning to generate task-specific adapters from task description, 2021.
- [42]Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, XiVictoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, PunitSingh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer.Opt: Open pre-trained transformer language models, 2022.
- [43]Ruiqi Zhong, Kristy Lee, Zheng Zhang, and Dan Klein.Adapting language models for zero-shot learning by meta-tuning on dataset and prompt collections, 2021.
- [44]Ahmet Üstün, Arianna Bisazza, Gosse Bouma, Gertjan van Noord, and Sebastian Ruder.Hyper-x: A unified hypernetwork for multi-task multilingual transfer, 2022.
Appendix A Experimantal Settings
A.1 Problem Setting
Meta-Training and Inference: Our methodology rigorously adheres to the protocol outlined in MetaICL [20]. In the meta-train phase, we commence by selecting a task from , followed by the sampling of support examples and query examples from the chosen task. The proposed hypernetwork is then adjusted to minimize the overall loss, focusing on generating a task model that can accurately predict the target sequences (e.g., answer) for source sequences (e.g. question). During the meta-test/inference phase, for each novel task in , we employ instructions to create the task-specific adapter, to optimize the model’s performance across all query examples .
Dataset | Examples per Task | Train | Test |
Super-Natural Instructions | 100 | 75,417 | 11,810 |
P3 | - | 90,897,454 | 2,940,068 |
P3 (Sampling) | 1000 | 290,000 | 2,940,068 |
A.2 Datasets
During the pretraining phase, we utilized the C4 dataset [28], truncating each sequence to 1024 tokens. For the training phase, we employed Super-Natural Instructions (SNI) [36] and P3 datasets [29] for meta-training and meta-test. For SNI, we adhered to the default settings [13, 36], which include 100 examples per task for both the training and test splits. For P3, we used the data and prompts provided by T0. All prompts related to the meta-training tasks were included in the meta-training process, while the meta-test phase utilized evaluation prompts specified by T0 [29]. We treated ANLI R1, R2, and R3 as three distinct tasks, resulting in 11 tasks for the original meta-test in P3 (Meta-Test-11). Due to resource constraints, we deviated from the sampling procedures of prior work, opting to sample 1000 examples per task for each prompt template. This approach yielded a smaller dataset size, as detailed in Table5. For further information on P3 refer to [29]. Additionally, to facilitate comparison with the Hypertuning method, we excluded the StoryCloze task from the evaluation since it was not included in the datasets for the HyperT5 evaluation.
A.3 Split Sizes for Varying Number of Meta-Training Tasks
As shown in AppendixD and AppendixD, we present a comprehensive list of the two datasets, including the number of tasks or templates contained in each task and the task divisions from §4.5 experiments. The divisions in the table are cumulative; thus, the second division includes both the first and the second divisions. For SNI, tasks were sorted in descending order based on the number of tasks they contained and then divided into specified sizes (6, 15, 30, 60). For P3, we selected a specified number of tasks (5, 10, 20, 36) based on the task classification in the original paper, which includes categories such as Multiple-Choice QA, Closed-Book QA, Summarization, Structure-To-Text, Paraphrase Identification, Sentiment, Topic Classification, and Extractive QA.
We obtain all our data from huggingface datasets [16]. In the following, we provide the dataset links:
- •
Super-Natural Instructions: https://github.com/allenai/natural-instructions
- •
Additionally, the Super-Natural Instructions dataset (previously known as Natural Instructions-v2) has undergone some changes over time. In our experiments, we use the v2.6 version.
A.4 Implementations
Our implementations are based on huggingface transformers v4.23.1 [39] using PyTorch v1.13.1 [25] and deepspeed777https://github.com/microsoft/DeepSpeed v0.10.0. All experiments were conducted on 4 A100 NVIDIA GPUs, each equipped with 80GB of memory, and eight A6000 NVIDIA GPUs with 48GB of memory. Unless otherwise specified, the rank of LoRA generated by the hypernetwork is 32, and we use the Adamw optimizer with a learning rate of 5e-5 and a linear warmup rate of 0.02. We pre-train all models for 50,000 steps using C4 [28] with a batch size of 8 samples and sequences of length 1024.
A.5 T0-Base/Large/3B
T0 [29] provides model checkpoints only in sizes 3B and 11B. Additionally, HINT [13] and FiD-ICL [40] re-pretrained T0 and found that the model was not sufficiently trained, achieving better results after reproduction. Therefore, we used the T0 model 888https://huggingface.co/qinyuany/fid-icl-t0-large reproduced by FiD-ICL to conduct a series of experiments.
Finetuning SNI P3 LoRA Tuning Pretraining Base (250M) XL(3B) XXL (11B) Base (250M) Large (800M) XL(3B) Max Input Len 1024 1024 1024 1024 1024 512 512 512 Max Output Len 128 - 128 128 128 64 64 64 Optimizer adamw adafactor adamw adamw adamw adamw adamw adamw Learning Rate 1e-4 1e-3 1e-4 5e-5 5e-5 1e-4 1e-4 5e-5 precision bf16 float32 bf16 bf16 bf16 bf16 bf16 bf16 # Training Steps 10000 50000 20000 20000 20000 20000 20000 20000 # Warmup Steps - - # 2% of total training steps Batch Size 8 8 8 2 1 8 4 2 Gradient Accumulation 2 1 2 4 2 2 4 4 LoRA Rank # 32
A.6 Hyperparameter
The complete stable hyperparameter set used for training runs can be found in Table6.
Appendix B Additional Experiments and Findings
B.1 Why we choose Enc-Dec Models?
Previous work has suggested that models with an encoder-decoder (enc-dec) structure have advantages over decoder-only (dec-only) models in terms of task generalization and instruction-following capabilities [19, 34, 40]. Therefore, in our experiments, we only considered models with an enc-dec structure (T5-LM and T0). Our experimental results demonstrated that enc-dec models indeed have an advantage when compared, although dec-only models might have higher computational efficiency due to their ability to cache KV and have fewer layers. However, our method, TAGI, significantly improves performance in various aspects with only a slight increase in computational overhead. We encode the task instructions only once based on the original computation.
B.2 T5-LM-XXL Training Trend
In this section, we detail how the performance of the T5-LM-XXL (11B) model surpasses the hypernetwork models but falls short of the meta-trained strong baseline Tk-Instruct by 1-4 points, as mentioned earlier in §4.4. The primary reason is insufficient training; when replicating the Tk-Instruct experiment, our results were significantly lower than reported when finetuning for only 20,000 steps. Consequently, we analyzed the performance of our TAGI model at different finetuning steps. As shown on the left side of Figure5, performance steadily improves with more steps with substantial growth. Thus, we reasonably predict that increasing the steps to 50,000 or more could surpass Tk-Instruct. Another possible reason is the lack of parameter alignment for the 11B model due to limited resources. Our previous analysis has shown that parameter alignment is crucial, with larger models benefiting more. Therefore, we analyzed performance with a small number of tasks for parameter alignment. As shown on the right side of Figure5, performance with parameter alignment for 6 and 15 tasks is better than without alignment. Based on these trends, it can be inferred that performance with full task parameter alignment could surpass Tk-Instruct.
B.3 Analysis on Hyperparameters
To explore the optimal hyperparameter settings for our experiments, we conducted a series of tests and error analyses using the T5-LM-Base (800M) model. The findings presented in Table7 reveal that variations in hyperparameters can lead to performance fluctuations, particularly with higher learning rates or reduced finetuning steps. Given the varying pre-training conditions of models of different sizes, a size-specific analysis is essential; however, details on larger models are omitted here due to resource limitations.
We observed that different settings of LoRA minimally affect performance, leading us to select a balanced size of 32. Similarly, the impact of the warmup ratio is negligible; thus, based on our experience, we chose a warmup ratio of one percent of the maximum finetuning steps. While more finetuning steps generally correlate with improved performance, excessive finetuning can result in overfitting on meta-training tasks, thereby diminishing generalizability. Moreover, increased finetuning steps require greater computational resources. Consequently, we determined that the optimal number of finetuning steps is 20,000 based on our experimental outcomes.
Learning Rate LoRA Rank Training Steps Warmup Ratio Method 5e-5 1e-4 3e-4 1e-3 16 32 64 15000 20000 25000 0.01 0.02 0.03 SNI Def + 2 Pos. Tk-Instruct [36] 41.3 41.8 42.2 38.9 - - - 41.4 41.8 42.1 41.5 41.8 40.6 TAGI (Ours) 42.1 42.5 40.3 39.7 41.8 42.5 42.3 41.8 42.5 42.4 42.3 42.5 41.9 Def Tk-Instruct [36] 35.0 34.2 32.6 31.7 - - - 34.4 34.2 34.5 35.0 34.2 34.3 TAGI (Ours) 34.3 35.3 33.5 31.8 34.8 35.3 35.4 34.2 35.3 35.4 34.8 35.3 34.9 P3 MTest11 Avg. Metatrain 43.3 44.1 43.6 40.9 - - - 44.0 44.1 44.3 44.2 44.1 43.6 TAGI (Ours) 44.0 45.6 44.0 41.6 44.8 45.6 45.5 44.3 45.6 45.2 45.1 45.6 44.8
B.4 How and are tuned?
In the experiment, we set and to two different values: and . The effects of these different values on the results are illustrated in Figure 6 and Table 8. We maintained all other conditions constant and only varied to perform an ablation experiment at Def+2Pos. scenario.
RougeL | ||
0.5 | 40.1 | |
2 | 40.9 | |
5 | 42.5 | |
10 | 38.7 | |
5 | 0.2 | 41.3 |
5 | 0.5 | 41.6 |
5 | 1.0 | 41.2 |
B.5 Inference Cost
To analyze the computational efficiency of the TAGI model compared to the standard instruction training model (full fine-tuning), let’s consider a scenario where we have to process samples, each of length , along with a task instruction of length . We assume the output sequence length is negligible and thus ignore it in our computations.
In a typical full fine-tuning setup, such as Tk-Instruct, each input is concatenated with the task instruction, requiring the model to process the combined input sequence. If we denote the number of FLOPs required to process a single token with an encoder-decoder model as , where is the total number of model parameters, then the total computation cost for all samples can be estimated as:Here, each of the samples includes both the instruction and the sample input, leading to tokens being processed.
Our TAGI model, on the other hand, processes the task instruction only once, regardless of the number of samples. This unique feature significantly reduces the computation required, especially as the number of samples or the length of the instruction increases. The total computation cost in this model is given by:In this case, the instruction length is processed only once, and each sample is processed separately, resulting in a total of tokens being processed.
Appendix C Extended Results
C.1 Characteristics Comparison of the Proposed TAGI and Other Baselines
Here, we report a full comparison of methods and the proposed TAGI in Table9, also visualized in Table1. In this report, we compare various methods across eight dimensions. Finetuning on target tasks yields good performance; however, it necessitates retuning when applied to unseen tasks and fails to address these effectively. Strong baseline meta-training methods excel at handling unseen tasks by enabling models to solve problems based on task-specific instructions. Nevertheless, these methods are limited to instance-level operations and entail repetitive processing of concatenated instructions and comprehensive finetuning, resulting in significant parameter updates and high inference costs.
Hypter [41] initially introduced the approach of considering tasks at the task level, treating identical tasks as a unified entity, and employing a hypernetwork to generate adapters that represent specific task models from instructions. Building on this, Hypertuning [26] uses demonstrations to generate adapters and pretrains the hypernetwork to boost its expressive capabilities. Both strategies avoid the direct input of instructions and rely on hypernetwork, which reduces parameter updates and lowers computational demands during inference. However, they suffer from notable performance degradation due to the lack of instructional information in the input.
HINT [13] addresses this issue by appending instructions post-encoder, thus eliminating redundant computations. Although these methods facilitate learning at the task level, they do not engage in instruction-based learning, i.e., they do not explicitly supervise the hypernetwork’s generation process to aid in understanding instructions and generating parameters.
The proposed TAGI rectifies these deficiencies by integrating cross-attention for enhanced information fusion and supervised learning of adapter weights within HINT. This innovation aids in generalizing to unseen tasks without increasing the computational burden.
Meta- Pre- Instr. Instr. Low Up. Low Infer. Instr. Unseen Method Train Train Concat. Fus. Params Cost Learning Task Simple FT ✗ ✗ ✔ ✔ ✗ ✗ ✗ ✗ T0 [29] / Tk-Instruct [36] ✔ ✗ ✔ ✔ ✗ ✗ ✗ ✔✔✔ Hypter [41] ✔ ✗ ✗ ✗ ✔ ✔ ✗ ✔ HyperTuning [26] ✔ ✔ ✗ ✗ ✔ ✔ ✗ ✔ HINT [13] ✔ ✔ ✔ ✗ ✔ ✔ ✗ ✔✔ TAGI (Ours) ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔✔✔
C.2 P3 Full Results
Table10 reports the per-task performance and average accuracy on P3 reported in subsection4.4.
Method ANLI ◇ (R1) (R2) (R3) HSwag CB COPA RTE WiC WSC WGD SCloze MTest11 Avg. HyperT5 Avg. Random 33.4 33.4 33.4 33.4 25.0 50.0 50.0 52.7 50.0 63.5 50.0 50.0 44.7 46.8 # Base(250M) T5-LM † 33.4 33.3 33.5 33.5 24.7 44.3 54.3 47.9 49.7 57.9 49.8 54.1 43.9 45.2 T5-LM Full FT † 33.8 34.5 33.4 33.5 24.8 66.5 45.7 51.1 53.7 46.3 49.8 50.9 44.6 46.5 T5-LM Metatrain 31.0 30.3 29.5 33.1 25.0 40.5 52.6 51.2 50.2 58.4 47.4 66.6 44.1 44.6 T5-LM-FiD † 33.0 32.4 33.1 33.4 26.7 42.5 58.8 54.6 51.1 57.9 50.3 76.3 47.0 46.9 T5-LM-TAGI 32.1 31.5 31.7 33.1 25.0 44.5 54.7 53.7 52.3 60.5 50.8 64.0 45.6 46.7 T0 † 32.3 31.5 32.4 33.1 26.5 45.8 65.9 69.3 51.6 56.7 51.2 76.1 49.1 49.9 T0 Full FT † 33.5 32.6 33.9 33.9 29.1 73.2 66.3 68.0 53.1 50.9 51.0 79.0 51.9 53.1 T0 Metatrain 32.1 31.5 31.5 33.2 29.5 50.4 64.2 68.2 47.7 61.6 52.8 80.8 50.1 50.8 T0-FiD † 32.7 31.7 32.9 33.6 26.2 54.9 68.2 68.1 51.9 60.3 51.3 82.3 51.0 51.7 T0-TAGI 32.7 31.1 31.9 35.0 29.8 49.3 67.1 70.0 49.0 61.2 54.4 79.6 50.8 51.7 # Large(800M) T5-LM † 32.7 32.1 33.4 32.7 25.3 33.8 50.5 49.0 51.0 50.4 50.5 47.8 41.5 42.9 T5-LM Full FT † 34.1 35.1 33.6 33.6 26.1 65.4 47.1 51.7 53.5 47.5 49.9 56.5 45.5 46.9 T5-LM Metatrain 31.3 30.0 30.5 33.4 27.0 60.4 77.6 71.9 47.0 56.4 54.8 87.2 52.4 53.3 T5-LM-FiD † 34.4 33.9 33.4 35.8 28.3 60.2 81.1 72.6 50.7 63.7 55.6 91.6 55.2 55.8 T5-LM-TAGI 33.7 33.5 32.5 35.1 27.8 62.9 79.0 76.1 52.9 57.9 58.2 86.2 54.7 56.0 T0 † 34.1 32.2 34.2 36.0 26.1 56.8 76.6 65.3 50.8 56.4 53.9 88.4 52.4 52.5 T0 Full FT † 35.3 34.5 35.4 36.2 33.1 80.1 80.8 69.2 54.1 53.2 56.3 90.0 56.6 57.8 T0 Metatrain 32.9 31.5 31.8 35.5 24.5 59.4 77.0 65.1 48.8 56.7 57.6 88.0 52.4 52.8 T0-FiD † 33.4 31.8 32.8 35.7 26.1 60.7 77.6 67.1 52.1 59.1 54.7 89.5 53.4 53.9 T0-TAGI 32.7 31.5 32.9 36.6 27.3 61.3 79.6 68.7 48.2 59.9 56.4 89.4 53.8 54.6 HyperT5-Prefix ‡ 33.4 - - - 32.3 60.1 73.9 71.5 51.1 63.0 51.1 - - 54.6 HyperT5-LoRA ‡ 33.6 - - - 33.0 49.5 74.2 67.4 52.0 64.0 52.9 - - 53.3 # XL(3B) T5-LM † 32.7 32.2 33.4 32.7 24.6 32.7 53.1 48.8 50.8 57.6 50.9 51.4 42.6 43.9 T5-LM Full FT † 34.6 35.5 34.3 33.9 27.1 67.8 54.8 50.7 53.7 47.7 50.7 63.3 47.2 48.4 T5-LM Metatrain 32.7 31.5 32.3 34.3 33.3 59.5 74.8 69.5 52.6 53.8 54.2 88.4 53.1 53.8 T5-LM-FiD † 39.3 39.8 37.6 40.4 31.4 67.0 92.3 78.8 50.4 64.5 61.2 96.5 60.0 60.6 T5-LM-TAGI 37.7 37.8 36.1 39.3 32.0 68.2 89.4 76.6 53.6 61.2 59.6 94.2 58.9 59.8 T0 † 38.0 38.4 35.7 40.0 26.5 67.7 82.2 80.1 53.5 57.3 57.8 94.0 57.6 57.9 T0 Full FT † 38.5 37.5 38.8 39.2 38.7 81.9 88.0 80.1 55.9 59.5 61.4 95.0 61.4 63.0 T0 Metatrain 37.0 37.3 33.2 40.4 24.8 66.9 81.9 78.9 52.7 60.2 55.6 92.8 56.8 57.3 T0-FiD † 38.6 39.0 36.5 40.5 28.5 62.9 87.4 74.6 52.1 62.7 61.0 95.5 58.2 58.5 T0-TAGI 38.7 39.5 35.6 41.0 26.5 68.7 87.8 78.2 52.2 61.8 59.8 95.6 58.8 59.2 HyperT5-Prefix ‡ 38.7 - - - 33.6 69.6 88.4 79.5 53.1 57.6 56.6 - - 59.6 HyperT5-LoRA ‡ 35.3 - - - 30.8 66.4 83.3 68.5 50.3 60.0 56.1 - - 56.4
Appendix D Limitations
Large Language Models. Due to computational constraints, most of our experiments were conducted using models with parameters. Given the complexity of our research, we restricted our focus to encoder-decoder models, which have demonstrated superior performance in cross-task generalization [34], which we explore further in B.1. Consequently, it remains uncertain whether instruction learning can be effectively scaled to larger models ( parameters) or commonly used decoder-only models. However, since our method preserves the original model parameters without compromising performance, we anticipate its applicability to broader research in the future.
Training Costs. Although TAGI is computationally efficient during inference, its training cost is significantly higher. This is due to the additional requirements beyond the foundation laid by previous work, including the introduction of knowledge distillation, running a hypernetwork to generate adapters for each batch, and pre-training some downstream task-specific models. Consequently, while TAGI may be highly efficient for inference and suitable for users with limited resources, training a unique TAGI model presents considerable challenges.
Datasets. In the SNI study, our investigation was limited to tasks in English, leaving the generalization capabilities in a multilingual context unexplored. However, given the proven effectiveness of hypernetwork methods in achieving multilingual generalization [2, 44], we are optimistic about the potential directions for our future research in this domain. Furthermore, in P3, we adopted the methodologies of T0 [29] and FiD-ICL [40], concentrating primarily on natural language processing (NLP) tasks amenable to ranking classification. This focus included tasks related to classification and multiple-choice questions but excluded other types of generative tasks. Looking ahead, we aim to develop new research resources and broaden our experimental scope and evaluations to encompass a more diverse array of categories.
Task # Num of Task First Split (6 Tasks) Question Answering 157 Program Execution 90 Question Generation 51 Sentiment Analysis 42 Misc. 36 Toxic Language Detection 32 Second Split (15 Tasks) Text Categorization 28 Commonsense Classification 23 Text Matching 17 Named Entity Recognition 17 Information Extraction 17 Wrong Candidate Generation 15 Text Completion 14 Question Understanding 13 Text to Code 12 Third Split (30 Tasks) Summarization 12 Dialogue Generation 11 Word Semantics 10 Story Composition 9 Speaker Identification 9 Pos Tagging 9 Linguistic Probing 9 Fill in The Blank 8 Text Quality Evaluation 7 Stereotype Detection 7 Sentence Composition 7 Negotiation Strategy Detection 7 Gender Classification 7 Coherence Classification 6 Word Relation Classification 5 Fourth Split (60 Tasks) Explanation 5 Text Simplification 4 Sentence Perturbation 4 Paraphrasing 4 Mathematics 4 Intent Identification 4 Dialogue State Tracking 4 Code to Text 4 Sentence Ordering 3 Fact Verification 3 Answer Verification 3 Translation 2 Style Transfer 2 Stance Detection 2 Speaker Relation Classification 2 Question Decomposition 2 Number Conversion 2 Irony Detection 2 Grammar Error Detection 2 Spelling Error Detection 1 Spam Classification 1 Sentence Expansion 1 Sentence Compression 1 Punctuation Error Detection 1 Preposition Prediction 1 Poem Generation 1 Entity Relation Classification 1 Entity Generation 1 Discourse Relation Classification 1 Discourse Connective Identification 1
Task # Num of Prompts Meta-Train First Split (5 Tasks) cosmos_qa 13 kilt_tasks_hotpotqa 5 amazon_polarity 9 cnn_dailymail_3.0.0 9 common_gen 9 Second Split (10 Tasks) glue_mrpc 7 adversarial_qa_dbert 5 ag_news 7 dream 5 gigaword 9 Third Split (20 Tasks) paws 12 wiki_qa 11 ropes 12 quoref 11 dbpedia_14 4 multi_news 6 imdb 10 quail 13 quartz 8 wiki_bio 5 Fourth Split (36 Tasks) adversarial_qa_dbidaf 5 adversarial_qa_droberta 5 duorc_SelfRC 9 duorc_ParaphraseRC 9 cos_e_v1.11 11 qasc 8 sciq 5 glue_qqp 6 social_i_qa 6 wiki_hop_original 9 wiqa 8 app_reviews 4 rotten_tomatoes 10 yelp_review_full 7 samsum 7 xsum 10 Meta-Test super_glue_wsc.fixed winogrande_winogrande_xl super_glue_cb super_glue_rte anli(r1/r2/r3) super_glue_copa hellaswag super_glue_wic story_cloze †