Documents
- contrastive noveltyaugmented learning anticipating outliers with large language models
- conal anticipating outliers with large language models
- frozen clip model is an efficient point cloud backbone
- pointclip point cloud understanding by clip
- clip2point transfer clip to point cloud classification with imagedepth pretraining
Abstract
In many task settings, text classification models are likely to encounter examples from novel classes on which they cannot predict correctly. Selective prediction, in which models abstain on low-confidence examples, provides a possible solution, but existing models are often overly confident on unseen classes. To remedy this overconfidence, we introduce Contrastive Novelty-Augmented Learning (CoNAL), a two-step method that generates OOD examples representative of novel classes, then trains to decrease confidence on them. First, we generate OOD examples by prompting a large language model twice: we prompt it to enumerate relevant novel classes, then generate examples from each novel class matching the task format. Second, we train a classifier with a novel contrastive objective that encourages lower confidence on generated OOD examples than training examples. When trained with CoNAL, classifiers improve in their ability to detect and abstain on novel class examples over prior methods by an average of 2.3% in terms of accuracy under the accuracy-coverage curve (AUAC) and 5.5% AUROC across 4 NLP datasets, with no cost to in-distribution accuracy.
Abstract
In many task settings, text classification models are likely to encounter examples from novel classes on which they cannot predict correctly. Selective prediction, in which models abstain on low-confidence examples, provides a possible solution, but existing models are often overly confident on OOD examples. To remedy this overconfidence, we introduce Contrastive Novelty-Augmented Learning (CoNAL), a two-step method that generates OOD examples representative of novel classes, then trains to decrease confidence on them. First, we generate OOD examples by prompting a large language model twice: we prompt it to enumerate relevant novel labels, then generate examples from each novel class matching the task format. Second, we train our classifier with a novel contrastive objective that encourages lower confidence on generated OOD examples than training examples. When trained with CoNAL, classifiers improve in their ability to detect and abstain on OOD examples over prior methods by an average of 2.3% AUAC and 5.5% AUROC across 4 NLP datasets, with no cost to in-distribution accuracy.1
Abstract
The pretraining-finetuning paradigm has demonstrated great success in NLP and2D image fields because of the high-quality representation ability andtransferability of their pretrained models. However, pretraining such a strongmodel is difficult in the 3D point cloud field since the training data islimited and point cloud collection is expensive. This paper introducesEfficient Point Cloud Learning (EPCL), an effective and efficient point cloudlearner for directly training high-quality point cloud models with a frozenCLIP model. Our EPCL connects the 2D and 3D modalities by semantically aligningthe 2D features and point cloud features without paired 2D-3D data.Specifically, the input point cloud is divided into a sequence of tokens anddirectly fed into the frozen CLIP model to learn point cloud representation.Furthermore, we design a task token to narrow the gap between 2D images and 3Dpoint clouds. Comprehensive experiments on 3D detection, semantic segmentation,classification and few-shot learning demonstrate that the 2D CLIP model can bean efficient point cloud backbone and our method achieves state-of-the-artaccuracy on both real-world and synthetic downstream tasks. Code will beavailable.
Abstract
Recently, zero-shot and few-shot learning via Contrastive Vision-LanguagePre-training (CLIP) have shown inspirational performance on 2D visualrecognition, which learns to match images with their corresponding texts inopen-vocabulary settings. However, it remains under explored that whether CLIP,pre-trained by large-scale image-text pairs in 2D, can be generalized to 3Drecognition. In this paper, we identify such a setting is feasible by proposingPointCLIP, which conducts alignment between CLIP-encoded point cloud and 3Dcategory texts. Specifically, we encode a point cloud by projecting it intomulti-view depth maps without rendering, and aggregate the view-wise zero-shotprediction to achieve knowledge transfer from 2D to 3D. On top of that, wedesign an inter-view adapter to better extract the global feature andadaptively fuse the few-shot knowledge learned from 3D into CLIP pre-trained in2D. By just fine-tuning the lightweight adapter in the few-shot settings, theperformance of PointCLIP could be largely improved. In addition, we observe thecomplementary property between PointCLIP and classical 3D-supervised networks.By simple ensembling, PointCLIP boosts baseline's performance and evensurpasses state-of-the-art models. Therefore, PointCLIP is a promisingalternative for effective 3D point cloud understanding via CLIP under lowresource cost and data regime. We conduct thorough experiments onwidely-adopted ModelNet10, ModelNet40 and the challenging ScanObjectNN todemonstrate the effectiveness of PointCLIP. The code is released athttps://github.com/ZrrSkywalker/PointCLIP.
Abstract
Pre-training across 3D vision and language remains under development becauseof limited training data. Recent works attempt to transfer vision-languagepre-training models to 3D vision. PointCLIP converts point cloud data tomulti-view depth maps, adopting CLIP for shape classification. However, itsperformance is restricted by the domain gap between rendered depth maps andimages, as well as the diversity of depth distributions. To address this issue,we propose CLIP2Point, an image-depth pre-training method by contrastivelearning to transfer CLIP to the 3D domain, and adapt it to point cloudclassification. We introduce a new depth rendering setting that forms a bettervisual effect, and then render 52,460 pairs of images and depth maps fromShapeNet for pre-training. The pre-training scheme of CLIP2Point combinescross-modality learning to enforce the depth features for capturing expressivevisual and textual features and intra-modality learning to enhance theinvariance of depth aggregation. Additionally, we propose a novel Dual-PathAdapter (DPA) module, i.e., a dual-path structure with simplified adapters forfew-shot learning. The dual-path structure allows the joint use of CLIP andCLIP2Point, and the simplified adapter can well fit few-shot tasks withoutpost-search. Experimental results show that CLIP2Point is effective intransferring CLIP knowledge to 3D vision. Our CLIP2Point outperforms PointCLIPand other self-supervised 3D networks, achieving state-of-the-art results onzero-shot and few-shot classification.
Documents
- scalable and transferable blackbox jailbreaks for language models via persona modulation
- prompt injection attack against llmintegrated applications
- do anything now characterizing and evaluating inthewild jailbreak prompts on large language models
- demystifying rce vulnerabilities in llmintegrated apps
- fuzzllm a novel and universal fuzzing framework for proactively discovering jailbreak vulnerabilities in large language models
Abstract
Despite efforts to align large language models to produce harmless responses,they are still vulnerable to jailbreak prompts that elicit unrestrictedbehaviour. In this work, we investigate persona modulation as a black-boxjailbreaking method to steer a target model to take on personalities that arewilling to comply with harmful instructions. Rather than manually craftingprompts for each persona, we automate the generation of jailbreaks using alanguage model assistant. We demonstrate a range of harmful completions madepossible by persona modulation, including detailed instructions forsynthesising methamphetamine, building a bomb, and laundering money. Theseautomated attacks achieve a harmful completion rate of 42.5% in GPT-4, which is185 times larger than before modulation (0.23%). These prompts also transfer toClaude 2 and Vicuna with harmful completion rates of 61.0% and 35.9%,respectively. Our work reveals yet another vulnerability in commercial largelanguage models and highlights the need for more comprehensive safeguards.
Abstract
Large Language Models (LLMs), renowned for their superior proficiency in language comprehension and generation, stimulate a vibrant ecosystem of applications around them. However, their extensive assimilation into various services introduces significant security risks. This study deconstructs the complexities and implications of prompt injection attacks on actual LLM-integrated applications. Initially, we conduct an exploratory analysis on ten commercial applications, highlighting the constraints of current attack strategies in practice. Prompted by these limitations, we subsequently formulate HouYi, a novel black-box prompt injection attack technique, which draws inspiration from traditional web injection attacks. HouYi is compartmentalized into three crucial elements: a seamlessly-incorporated pre-constructed prompt, an injection prompt inducing context partition, and a malicious payload designed to fulfill the attack objectives. Leveraging HouYi, we unveil previously unknown and severe attack outcomes, such as unrestricted arbitrary LLM usage and uncomplicated application prompt theft. We deploy HouYi on 36 actual LLM-integrated applications and discern 31 applications susceptible to prompt injection. 10 vendors have validated our discoveries, including Notion, which has the potential to impact millions of users. Our investigation illuminates both the possible risks of prompt injection attacks and the possible tactics for mitigation.
Abstract
The misuse of large language models (LLMs) has garnered significant attention from the general public and LLM vendors. In response, efforts have been made to align LLMs with human values and intent use. However, a particular type of adversarial prompts, known as jailbreak prompt, has emerged and continuously evolved to bypass the safeguards and elicit harmful content from LLMs. In this paper, we conduct the first measurement study on jailbreak prompts in the wild, with 6,387 prompts collected from four platforms over six months. Leveraging natural language processing technologies and graph-based community detection methods, we discover unique characteristics of jailbreak prompts and their major attack strategies, such as prompt injection and privilege escalation. We also observe that jailbreak prompts increasingly shift from public platforms to private ones, posing new challenges for LLM vendors in proactive detection. To assess the potential harm caused by jailbreak prompts, we create a question set comprising 46,800 samples across 13 forbidden scenarios. Our experiments show that current LLMs and safeguards cannot adequately defend jailbreak prompts in all scenarios. Particularly, we identify two highly effective jailbreak prompts which achieve 0.99 attack success rates on ChatGPT (GPT-3.5) and GPT-4, and they have persisted online for over 100 days. Our work sheds light on the severe and evolving threat landscape of jailbreak prompts. We hope our study can facilitate the research community and LLM vendors in promoting safer and regulated LLMs.
Abstract
In recent years, Large Language Models (LLMs) have demonstrated remarkable potential across various downstream tasks. LLM-integrated frameworks, which serve as the essential infrastructure, have given rise to many LLM-integrated web apps. However, some of these frameworks suffer from Remote Code Execution (RCE) vulnerabilities, allowing attackers to execute arbitrary code on apps' servers remotely via prompt injections. Despite the severity of these vulnerabilities, no existing work has been conducted for a systematic investigation of them. This leaves a great challenge on how to detect vulnerabilities in frameworks as well as LLM-integrated apps in real-world scenarios. To fill this gap, we present two novel strategies, including 1) a static analysis-based tool called LLMSmith to scan the source code of the framework to detect potential RCE vulnerabilities and 2) a prompt-based automated testing approach to verify the vulnerability in LLM-integrated web apps. We discovered 13 vulnerabilities in 6 frameworks, including 12 RCE vulnerabilities and 1 arbitrary file read/write vulnerability. 11 of them are confirmed by the framework developers, resulting in the assignment of 7 CVE IDs. After testing 51 apps, we found vulnerabilities in 17 apps, 16 of which are vulnerable to RCE and 1 to SQL injection. We responsibly reported all 17 issues to the corresponding developers and received acknowledgments. Furthermore, we amplify the attack impact beyond achieving RCE by allowing attackers to exploit other app users (e.g. app responses hijacking, user API key leakage) without direct interaction between the attacker and the victim. Lastly, we propose some mitigating strategies for improving the security awareness of both framework and app developers, helping them to mitigate these risks effectively.
Abstract
Jailbreak vulnerabilities in Large Language Models (LLMs), which exploit meticulously crafted prompts to elicit content that violates service guidelines, have captured the attention of research communities. While model owners can defend against individual jailbreak prompts through safety training strategies, this relatively passive approach struggles to handle the broader category of similar jailbreaks. To tackle this issue, we introduce FuzzLLM, an automated fuzzing framework designed to proactively test and discover jailbreak vulnerabilities in LLMs. We utilize templates to capture the structural integrity of a prompt and isolate key features of a jailbreak class as constraints. By integrating different base classes into powerful combo attacks and varying the elements of constraints and prohibited questions, FuzzLLM enables efficient testing with reduced manual effort. Extensive experiments demonstrate FuzzLLM's effectiveness and comprehensiveness in vulnerability discovery across various LLMs.
Documents
- selfexplanation prompting improves dialogue understanding in large language models
- can large language models capture public opinion about global warming an empirical assessment of algorithmic fidelity and bias
- large language models as data preprocessors
- leveraging large language models for scalable vector graphicsdriven image understanding
- lpml llmprompting markup language for mathematical reasoning
Abstract
Task-oriented dialogue (TOD) systems facilitate users in executing various activities via multi-turn dialogues, but Large Language Models (LLMs) often struggle to comprehend these intricate contexts. In this study, we propose a novel"Self-Explanation"prompting strategy to enhance the comprehension abilities of LLMs in multi-turn dialogues. This task-agnostic approach requires the model to analyze each dialogue utterance before task execution, thereby improving performance across various dialogue-centric tasks. Experimental results from six benchmark datasets confirm that our method consistently outperforms other zero-shot prompts and matches or exceeds the efficacy of few-shot prompts, demonstrating its potential as a powerful tool in enhancing LLMs' comprehension in complex dialogue tasks.
Abstract
Large language models (LLMs) have demonstrated their potential in socialscience research by emulating human perceptions and behaviors, a conceptreferred to as algorithmic fidelity. This study assesses the algorithmicfidelity and bias of LLMs by utilizing two nationally representative climatechange surveys. The LLMs were conditioned on demographics and/or psychologicalcovariates to simulate survey responses. The findings indicate that LLMs caneffectively capture presidential voting behaviors but encounter challenges inaccurately representing global warming perspectives when relevant covariatesare not included. GPT-4 exhibits improved performance when conditioned on bothdemographics and covariates. However, disparities emerge in LLM estimations ofthe views of certain groups, with LLMs tending to underestimate worry aboutglobal warming among Black Americans. While highlighting the potential of LLMsto aid social science research, these results underscore the importance ofmeticulous conditioning, model selection, survey question format, and biasassessment when employing LLMs for survey simulation. Further investigationinto prompt engineering and algorithm auditing is essential to harness thepower of LLMs while addressing their inherent limitations.
Abstract
Large Language Models (LLMs), typified by OpenAI's GPT series and Meta's LLaMA variants, have marked a significant advancement in artificial intelligence. Trained on vast amounts of text data, LLMs are capable of understanding and generating human-like text across a diverse range of topics. This study expands on the applications of LLMs, exploring their potential in data preprocessing, a critical stage in data mining and analytics applications. We delve into the applicability of state-of-the-art LLMs such as GPT-3.5, GPT-4, and Vicuna-13B for error detection, data imputation, schema matching, and entity matching tasks. Alongside showcasing the inherent capabilities of LLMs, we highlight their limitations, particularly in terms of computational expense and inefficiency. We propose an LLM-based framework for data preprocessing, which integrates cutting-edge prompt engineering techniques, coupled with traditional methods like contextualization and feature selection, to improve the performance and efficiency of these models. The effectiveness of LLMs in data preprocessing is evaluated through an experimental study spanning 12 datasets. GPT-4 emerged as a standout, achieving 100\% accuracy or F1 score on 4 datasets, suggesting LLMs' immense potential in these tasks. Despite certain limitations, our study underscores the promise of LLMs in this domain and anticipates future developments to overcome current hurdles.
Abstract
Recently, large language models (LLMs) have made significant advancements innatural language understanding and generation. However, their potential incomputer vision remains largely unexplored. In this paper, we introduce a new,exploratory approach that enables LLMs to process images using the ScalableVector Graphics (SVG) format. By leveraging the XML-based textual descriptionsof SVG representations instead of raster images, we aim to bridge the gapbetween the visual and textual modalities, allowing LLMs to directly understandand manipulate images without the need for parameterized visual components. Ourmethod facilitates simple image classification, generation, and in-contextlearning using only LLM capabilities. We demonstrate the promise of ourapproach across discriminative and generative tasks, highlighting its (i)robustness against distribution shift, (ii) substantial improvements achievedby tapping into the in-context learning abilities of LLMs, and (iii) imageunderstanding and generation capabilities with human guidance. Our code, data,and models can be found here https://github.com/mu-cai/svg-llm.
Abstract
In utilizing large language models (LLMs) for mathematical reasoning, addressing the errors in the reasoning and calculation present in the generated text by LLMs is a crucial challenge. In this paper, we propose a novel framework that integrates the Chain-of-Thought (CoT) method with an external tool (Python REPL). We discovered that by prompting LLMs to generate structured text in XML-like markup language, we could seamlessly integrate CoT and the external tool and control the undesired behaviors of LLMs. With our approach, LLMs can utilize Python computation to rectify errors within CoT. We applied our method to ChatGPT (GPT-3.5) to solve challenging mathematical problems and demonstrated that combining CoT and Python REPL through the markup language enhances the reasoning capability of LLMs. Our approach enables LLMs to write the markup language and perform advanced mathematical reasoning using only zero-shot prompting.
Documents
- do promptbased models really understand the meaning of their prompts
- promptbased length controlled generation with reinforcement learning
- tcrallm token compression retrieval augmented large language model for inference cost reduction
- bbtv2 towards a gradientfree future with large language models
- llmlingua compressing prompts for accelerated inference of large language models
Abstract
Recently, a boom of papers has shown extraordinary progress in zero-shot andfew-shot learning with various prompt-based models. It is commonly argued thatprompts help models to learn faster in the same way that humans learn fasterwhen provided with task instructions expressed in natural language. In thisstudy, we experiment with over 30 prompt templates manually written for naturallanguage inference (NLI). We find that models learn just as fast with manyprompts that are intentionally irrelevant or even pathologically misleading asthey do with instructively "good" prompts. Further, such patterns hold even formodels as large as 175 billion parameters (Brown et al., 2020) as well as therecently proposed instruction-tuned models which are trained on hundreds ofprompts (Sanh et al., 2022). That is, instruction-tuned models often producegood predictions with irrelevant and misleading prompts even at zero shots. Insum, notwithstanding prompt-based models' impressive improvement, we findevidence of serious limitations that question the degree to which suchimprovement is derived from models understanding task instructions in waysanalogous to humans' use of task instructions.
Abstract
Large language models (LLMs) like ChatGPT and GPT-4 have attracted great attention given their surprising performance on a wide range of NLP tasks. Length controlled generation of LLMs emerges as an important topic, which enables users to fully leverage the capability of LLMs in more real-world scenarios like generating a proper answer or essay of a desired length. In addition, the autoregressive generation in LLMs is extremely time-consuming, while the ability of controlling this generated length can reduce the inference cost by limiting the length. Therefore, we propose a prompt-based length control method to achieve high-accuracy length controlled generation. In particular, we adopt reinforcement learning with the reward signal given by either trainable or rule-based reward models, which further enhances the length-control ability of LLMs by rewarding outputs that follows pre-defined control instruction. To enable rule-based inference, we also introduce standard prompt extractor to collect the standard control information from users' input. Experiments show that our method significantly improves the accuracy of prompt-based length control for summarization task on popular datasets like CNNDM and NYT. Both the standard prompt extractor and the RL-tuned model have show strong generalization ability to unseen control prompt templates.
Abstract
Since ChatGPT released its API for public use, the number of applicationsbuilt on top of commercial large language models (LLMs) increase exponentially.One popular usage of such models is leveraging its in-context learning abilityand generating responses given user queries leveraging knowledge obtained byretrieval augmentation. One problem of deploying commercial retrieval-augmentedLLMs is the cost due to the additionally retrieved context that largelyincreases the input token size of the LLMs. To mitigate this, we propose atoken compression scheme that includes two methods: summarization compressionand semantic compression. The first method applies a T5-based model that isfine-tuned by datasets generated using self-instruct containing samples withvarying lengths and reduce token size by doing summarization. The second methodfurther compresses the token size by removing words with lower impact on thesemantic. In order to adequately evaluate the effectiveness of the proposedmethods, we propose and utilize a dataset called Food-Recommendation DB (FRDB)focusing on food recommendation for women around pregnancy period or infants.Our summarization compression can reduce 65% of the retrieval token size withfurther 0.3% improvement on the accuracy; semantic compression provides a moreflexible way to trade-off the token size with performance, for which we canreduce the token size by 20% with only 1.6% of accuracy drop.
Abstract
Most downstream adaptation methods tune all or part of the parameters ofpre-trained models (PTMs) through gradient descent, where the tuning costincreases linearly with the growth of the model size. By contrast,gradient-free methods only require the forward computation of the PTM to tunethe prompt, retaining the benefits of efficient tuning and deployment. Though,past work on gradient-free tuning often introduces gradient descent to seek agood initialization of prompt and lacks versatility across tasks and PTMs. Inthis paper, we present BBTv2, an improved version of Black-Box Tuning, to drivePTMs for few-shot learning. We prepend continuous prompts to every layer of thePTM and propose a divide-and-conquer gradient-free algorithm to optimize theprompts at different layers alternately. Extensive experiments across varioustasks and PTMs show that BBTv2 can achieve comparable performance to full modeltuning and state-of-the-art parameter-efficient methods (e.g., Adapter, LoRA,BitFit, etc.) under few-shot settings while maintaining much fewer tunableparameters.
Abstract
Large language models (LLMs) have been applied in various applications due to their astonishing capabilities. With advancements in technologies such as chain-of-thought (CoT) prompting and in-context learning (ICL), the prompts fed to LLMs are becoming increasingly lengthy, even exceeding tens of thousands of tokens. To accelerate model inference and reduce cost, this paper presents LLMLingua, a coarse-to-fine prompt compression method that involves a budget controller to maintain semantic integrity under high compression ratios, a token-level iterative compression algorithm to better model the interdependence between compressed contents, and an instruction tuning based method for distribution alignment between language models. We conduct experiments and analysis over four datasets from different scenarios, i.e., GSM8K, BBH, ShareGPT, and Arxiv-March23; showing that the proposed approach yields state-of-the-art performance and allows for up to 20x compression with little performance loss. Our code is available at https://aka.ms/LLMLingua.
Documents
- alexatm 20b fewshot learning using a largescale multilingual seq2seq model
- genderspecific machine translation with large language models
- towards effective disambiguation for machine translation with large language models
- democratizing llms for lowresource languages by leveraging their english dominant abilities with linguisticallydiverse prompts
- adaptive machine translation with large language models
Abstract
In this work, we demonstrate that multilingual large-scalesequence-to-sequence (seq2seq) models, pre-trained on a mixture of denoisingand Causal Language Modeling (CLM) tasks, are more efficient few-shot learnersthan decoder-only models on various tasks. In particular, we train a 20 billionparameter multilingual seq2seq model called Alexa Teacher Model (AlexaTM 20B)and show that it achieves state-of-the-art (SOTA) performance on 1-shotsummarization tasks, outperforming a much larger 540B PaLM decoder model.AlexaTM 20B also achieves SOTA in 1-shot machine translation, especially forlow-resource languages, across almost all language pairs supported by the model(Arabic, English, French, German, Hindi, Italian, Japanese, Marathi,Portuguese, Spanish, Tamil, and Telugu) on Flores-101 dataset. We also show inzero-shot setting, AlexaTM 20B outperforms GPT3 (175B) on SuperGLUE and SQuADv2datasets and provides SOTA performance on multilingual tasks such as XNLI,XCOPA, Paws-X, and XWinograd. Overall, our results present a compelling casefor seq2seq models as a powerful alternative to decoder-only models forLarge-scale Language Model (LLM) training.
Abstract
Decoder-only Large Language Models (LLMs) have demonstrated potential inmachine translation (MT), albeit with performance slightly lagging behindtraditional encoder-decoder Neural Machine Translation (NMT) systems. However,LLMs offer a unique advantage: the ability to control the properties of theoutput through prompts. In this study, we harness this flexibility to exploreLLaMa's capability to produce gender-specific translations for languages withgrammatical gender. Our results indicate that LLaMa can generategender-specific translations with competitive accuracy and gender biasmitigation when compared to NLLB, a state-of-the-art multilingual NMT system.Furthermore, our experiments reveal that LLaMa's translations are robust,showing significant performance drops when evaluated against opposite-genderreferences in gender-ambiguous datasets but maintaining consistency in lessambiguous contexts. This research provides insights into the potential andchallenges of using LLMs for gender-specific translations and highlights theimportance of in-context learning to elicit new tasks in LLMs.
Abstract
Resolving semantic ambiguity has long been recognised as a central challengein the field of Machine Translation. Recent work on benchmarking translationperformance on ambiguous sentences has exposed the limitations of conventionalNeural Machine Translation (NMT) systems, which fail to handle many such cases.Large language models (LLMs) have emerged as a promising alternative,demonstrating comparable performance to traditional NMT models whileintroducing new paradigms for controlling the target outputs. In this paper, westudy the capabilities of LLMs to translate "ambiguous sentences" - i.e. thosecontaining highly polysemous words and/or rare word senses. We also propose twoways to improve their disambiguation capabilities, through a) in-contextlearning and b) fine-tuning on carefully curated ambiguous datasets.Experiments show that our methods can match or outperform state-of-the-artsystems such as DeepL and NLLB in four out of five language directions. Ourresearch provides valuable insights into effectively adapting LLMs to becomebetter disambiguators during Machine Translation. We release our curateddisambiguation corpora and resources athttps://data.statmt.org/ambiguous-europarl.
Abstract
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars. However, in low-resource languages, obtaining such hand-picked exemplars can still be challenging, where unsupervised techniques may be necessary. Moreover, competent generative capabilities of LLMs are observed only in high-resource languages, while their performances among under-represented languages fall behind due to pre-training data imbalance. To elicit LLMs' ability onto low-resource languages without any supervised data, we propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English. These prompts are then used to create intra-lingual exemplars to perform tasks in the target languages. Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages. We also show that fine-tuning a 7B model on data generated from our method helps it perform competitively with a 175B model. In non-English translation tasks, our method even outperforms supervised prompting by up to 3 chrF++ in many low-resource languages. When evaluated on zero-shot multilingual summarization, our method surpasses other English-pivoting baselines by up to 4 ROUGE-L and is also favored by GPT-4.
Abstract
Consistency is a key requirement of high-quality translation. It isespecially important to adhere to pre-approved terminology and adapt tocorrected translations in domain-specific projects. Machine translation (MT)has achieved significant progress in the area of domain adaptation. However,real-time adaptation remains challenging. Large-scale language models (LLMs)have recently shown interesting capabilities of in-context learning, where theylearn to replicate certain input-output text generation patterns, withoutfurther fine-tuning. By feeding an LLM at inference time with a prompt thatconsists of a list of translation pairs, it can then simulate the domain andstyle characteristics. This work aims to investigate how we can utilizein-context learning to improve real-time adaptive MT. Our extensive experimentsshow promising results at translation time. For example, LLMs can adapt to aset of in-domain sentence pairs and/or terminology while translating a newsentence. We observe that the translation quality with few-shot in-contextlearning can surpass that of strong encoder-decoder MT systems, especially forhigh-resource languages. Moreover, we investigate whether we can combine MTfrom strong encoder-decoder models with fuzzy matches, which can furtherimprove translation quality, especially for less supported languages. Weconduct our experiments across five diverse language pairs, namelyEnglish-to-Arabic (EN-AR), English-to-Chinese (EN-ZH), English-to-French(EN-FR), English-to-Kinyarwanda (EN-RW), and English-to-Spanish (EN-ES).
Documents
- sparks of gpts in edge intelligence for metaverse caching and inference for mobile aigc services
- joint foundation model caching and inference of generative ai services for edge intelligence
- chatrec towards interactive and explainable llmsaugmented recommender system
- optimizing mobileedge aigenerated everything (aigx) services by prompt engineering fundamental, framework, and case study
- utilizing language models for energy load forecasting
Abstract
Aiming at achieving artificial general intelligence (AGI) for Metaverse,pretrained foundation models (PFMs), e.g., generative pretrained transformers(GPTs), can effectively provide various AI services, such as autonomousdriving, digital twins, and AI-generated content (AIGC) for extended reality.With the advantages of low latency and privacy-preserving, serving PFMs ofmobile AI services in edge intelligence is a viable solution for caching andexecuting PFMs on edge servers with limited computing resources and GPU memory.However, PFMs typically consist of billions of parameters that are computationand memory-intensive for edge servers during loading and execution. In thisarticle, we investigate edge PFM serving problems for mobile AIGC services ofMetaverse. First, we introduce the fundamentals of PFMs and discuss theircharacteristic fine-tuning and inference methods in edge intelligence. Then, wepropose a novel framework of joint model caching and inference for managingmodels and allocating resources to satisfy users' requests efficiently.Furthermore, considering the in-context learning ability of PFMs, we propose anew metric to evaluate the freshness and relevance between examples indemonstrations and executing tasks, namely the Age of Context (AoC). Finally,we propose a least context algorithm for managing cached models at edge serversby balancing the tradeoff among latency, energy consumption, and accuracy.
Abstract
With the rapid development of artificial general intelligence (AGI), variousmultimedia services based on pretrained foundation models (PFMs) need to beeffectively deployed. With edge servers that have cloud-level computing power,edge intelligence can extend the capabilities of AGI to mobile edge networks.However, compared with cloud data centers, resource-limited edge servers canonly cache and execute a small number of PFMs, which typically consist ofbillions of parameters and require intensive computing power and GPU memoryduring inference. To address this challenge, in this paper, we propose a jointfoundation model caching and inference framework that aims to balance thetradeoff among inference latency, accuracy, and resource consumption bymanaging cached PFMs and user requests efficiently during the provisioning ofgenerative AI services. Specifically, considering the in-context learningability of PFMs, a new metric named the Age of Context (AoC), is proposed tomodel the freshness and relevance between examples in past demonstrations andcurrent service requests. Based on the AoC, we propose a least context cachingalgorithm to manage cached PFMs at edge servers with historical prompts andinference results. The numerical results demonstrate that the proposedalgorithm can reduce system costs compared with existing baselines byeffectively utilizing contextual information.
Abstract
Large language models (LLMs) have demonstrated their significant potential to be applied for addressing various application tasks. However, traditional recommender systems continue to face great challenges such as poor interactivity and explainability, which actually also hinder their broad deployment in real-world systems. To address these limitations, this paper proposes a novel paradigm called Chat-Rec (ChatGPT Augmented Recommender System) that innovatively augments LLMs for building conversational recommender systems by converting user profiles and historical interactions into prompts. Chat-Rec is demonstrated to be effective in learning user preferences and establishing connections between users and products through in-context learning, which also makes the recommendation process more interactive and explainable. What's more, within the Chat-Rec framework, user's preferences can transfer to different products for cross-domain recommendations, and prompt-based injection of information into LLMs can also handle the cold-start scenarios with new items. In our experiments, Chat-Rec effectively improve the results of top-k recommendations and performs better in zero-shot rating prediction task. Chat-Rec offers a novel approach to improving recommender systems and presents new practical scenarios for the implementation of AIGC (AI generated content) in recommender system studies.
Abstract
As the next-generation paradigm for content creation, AI-Generated Content (AIGC), i.e., generating content automatically by Generative AI (GAI) based on user prompts, has gained great attention and success recently. With the ever-increasing power of GAI, especially the emergence of Pretrained Foundation Models (PFMs) that contain billions of parameters and prompt engineering methods (i.e., finding the best prompts for the given task), the application range of AIGC is rapidly expanding, covering various forms of information for human, systems, and networks, such as network designs, channel coding, and optimization solutions. In this article, we present the concept of mobile-edge AI-Generated Everything (AIGX). Specifically, we first review the building blocks of AIGX, the evolution from AIGC to AIGX, as well as practical AIGX applications. Then, we present a unified mobile-edge AIGX framework, which employs edge devices to provide PFM-empowered AIGX services and optimizes such services via prompt engineering. More importantly, we demonstrate that suboptimal prompts lead to poor generation quality, which adversely affects user satisfaction, edge network performance, and resource utilization. Accordingly, we conduct a case study, showcasing how to train an effective prompt optimizer using ChatGPT and investigating how much improvement is possible with prompt engineering in terms of user experience, quality of generation, and network performance.
Abstract
Energy load forecasting plays a crucial role in optimizing resourceallocation and managing energy consumption in buildings and cities. In thispaper, we propose a novel approach that leverages language models for energyload forecasting. We employ prompting techniques to convert energy consumptiondata into descriptive sentences, enabling fine-tuning of language models. Byadopting an autoregressive generating approach, our proposed method enablespredictions of various horizons of future energy load consumption. Throughextensive experiments on real-world datasets, we demonstrate the effectivenessand accuracy of our proposed method. Our results indicate that utilizinglanguage models for energy load forecasting holds promise for enhancing energyefficiency and facilitating intelligent decision-making in energy systems.
Documents
- fewshot incontext learning for knowledge base question answering
- retrieverewriteanswer a kgtotext enhanced llms framework for knowledge graph question answering
- ontologyenhanced prompttuning for fewshot learning
- an incontext schema understanding method for knowledge base question answering
- knowledge crosswords geometric reasoning over structured knowledge with large language models
Abstract
Question answering over knowledge bases is considered a difficult problem dueto the challenge of generalizing to a wide variety of possible natural languagequestions. Additionally, the heterogeneity of knowledge base schema itemsbetween different knowledge bases often necessitates specialized training fordifferent knowledge base question-answering (KBQA) datasets. To handlequestions over diverse KBQA datasets with a unified training-free framework, wepropose KB-BINDER, which for the first time enables few-shot in-contextlearning over KBQA tasks. Firstly, KB-BINDER leverages large language modelslike Codex to generate logical forms as the draft for a specific question byimitating a few demonstrations. Secondly, KB-BINDER grounds on the knowledgebase to bind the generated draft to an executable one with BM25 score matching.The experimental results on four public heterogeneous KBQA datasets show thatKB-BINDER can achieve a strong performance with only a few in-contextdemonstrations. Especially on GraphQA and 3-hop MetaQA, KB-BINDER can evenoutperform the state-of-the-art trained models. On GrailQA and WebQSP, ourmodel is also on par with other fully-trained models. We believe KB-BINDER canserve as an important baseline for future research. Our code is available athttps://github.com/ltl3A87/KB-BINDER.
Abstract
Despite their competitive performance on knowledge-intensive tasks, large language models (LLMs) still have limitations in memorizing all world knowledge especially long tail knowledge. In this paper, we study the KG-augmented language model approach for solving the knowledge graph question answering (KGQA) task that requires rich world knowledge. Existing work has shown that retrieving KG knowledge to enhance LLMs prompting can significantly improve LLMs performance in KGQA. However, their approaches lack a well-formed verbalization of KG knowledge, i.e., they ignore the gap between KG representations and textual representations. To this end, we propose an answer-sensitive KG-to-Text approach that can transform KG knowledge into well-textualized statements most informative for KGQA. Based on this approach, we propose a KG-to-Text enhanced LLMs framework for solving the KGQA task. Experiments on several KGQA benchmarks show that the proposed KG-to-Text augmented LLMs approach outperforms previous KG-augmented LLMs approaches regarding answer accuracy and usefulness of knowledge statements.
Abstract
Few-shot Learning (FSL) is aimed to make predictions based on a limitednumber of samples. Structured data such as knowledge graphs and ontologylibraries has been leveraged to benefit the few-shot setting in various tasks.However, the priors adopted by the existing methods suffer from challengingknowledge missing, knowledge noise, and knowledge heterogeneity, which hinderthe performance for few-shot learning. In this study, we explore knowledgeinjection for FSL with pre-trained language models and proposeontology-enhanced prompt-tuning (OntoPrompt). Specifically, we develop theontology transformation based on the external knowledge graph to address theknowledge missing issue, which fulfills and converts structure knowledge totext. We further introduce span-sensitive knowledge injection via a visiblematrix to select informative knowledge to handle the knowledge noise issue. Tobridge the gap between knowledge and text, we propose a collective trainingalgorithm to optimize representations jointly. We evaluate our proposedOntoPrompt in three tasks, including relation extraction, event extraction, andknowledge graph completion, with eight datasets. Experimental resultsdemonstrate that our approach can obtain better few-shot performance thanbaselines.
Abstract
The Knowledge Base Question Answering (KBQA) task aims to answer naturallanguage questions based on a given knowledge base. As a kind of common methodfor this task, semantic parsing-based ones first convert natural languagequestions to logical forms (e.g., SPARQL queries) and then execute them onknowledge bases to get answers. Recently, Large Language Models (LLMs) haveshown strong abilities in language understanding and may be adopted as semanticparsers in such kinds of methods. However, in doing so, a great challenge forLLMs is to understand the schema of knowledge bases. Therefore, in this paper,we propose an In-Context Schema Understanding (ICSU) method for facilitatingLLMs to be used as a semantic parser in KBQA. Specifically, ICSU adopts theIn-context Learning mechanism to instruct LLMs to generate SPARQL queries withexamples. In order to retrieve appropriate examples from annotatedquestion-query pairs, which contain comprehensive schema information related toquestions, ICSU explores four different retrieval strategies. Experimentalresults on the largest KBQA benchmark, KQA Pro, show that ICSU with all thesestrategies outperforms that with a random retrieval strategy significantly(from 12\% to 78.76\% in accuracy).
Abstract
Large language models (LLMs) are widely adopted in knowledge-intensive tasks and have achieved impressive performance thanks to their knowledge abilities. While LLMs have demonstrated outstanding performance on atomic or linear (multi-hop) QA tasks, whether they can reason in knowledge-rich scenarios with interweaving constraints remains an underexplored problem. In this work, we propose geometric reasoning over structured knowledge, where pieces of knowledge are connected in a graph structure and models need to fill in the missing information. Such geometric knowledge reasoning would require the ability to handle structured knowledge, reason with uncertainty, verify facts, and backtrack when an error occurs. We propose Knowledge Crosswords, a multi-blank QA dataset where each problem consists of a natural language question representing the geometric constraints of an incomplete entity network, where LLMs are tasked with working out the missing entities while meeting all factual constraints. Knowledge Crosswords contains 2,101 individual problems, covering various knowledge domains and further divided into three difficulty levels. We conduct extensive experiments to evaluate existing LLM prompting approaches on the Knowledge Crosswords benchmark. We additionally propose two new approaches, Staged Prompting and Verify-All, to augment LLMs' ability to backtrack and verify structured constraints. Our results demonstrate that while baseline approaches perform well on easier problems but struggle with hard ones, our proposed Verify-All outperforms other methods by a large margin and is more robust with hard problems. Further analysis reveals that LLMs' ability of geometric reasoning over structured knowledge is still far from robust or perfect, susceptible to confounders such as the order of options, certain structural patterns, assumption of existence of correct answer, and more.
Documents
- review of large vision models and visual prompt engineering
- prompt engineering for healthcare methodologies and applications
- geotechnical parrot tales (gpt) harnessing large language models in geotechnical engineering
- from web catalogs to google a retrospective study of web search engines sustainable development
- unleashing the potential of prompt engineering in large language models a comprehensive review
Abstract
Visual prompt engineering is a fundamental technology in the field of visual and image Artificial General Intelligence, serving as a key component for achieving zero-shot capabilities. As the development of large vision models progresses, the importance of prompt engineering becomes increasingly evident. Designing suitable prompts for specific visual tasks has emerged as a meaningful research direction. This review aims to summarize the methods employed in the computer vision domain for large vision models and visual prompt engineering, exploring the latest advancements in visual prompt engineering. We present influential large models in the visual domain and a range of prompt engineering methods employed on these models. It is our hope that this review provides a comprehensive and systematic description of prompt engineering methods based on large visual models, offering valuable insights for future researchers in their exploration of this field.
Abstract
This review will introduce the latest advances in prompt engineering in the field of natural language processing (NLP) for the medical domain. First, we will provide a brief overview of the development of prompt engineering and emphasize its significant contributions to healthcare NLP applications such as question-answering systems, text summarization, and machine translation. With the continuous improvement of general large language models, the importance of prompt engineering in the healthcare domain is becoming increasingly prominent. The aim of this article is to provide useful resources and bridges for healthcare NLP researchers to better explore the application of prompt engineering in this field. We hope that this review can provide new ideas and inspire ample possibilities for research and application in medical NLP.
Abstract
The widespread adoption of large language models (LLMs), such as OpenAI's ChatGPT, could revolutionize various industries, including geotechnical engineering. However, GPT models can sometimes generate plausible-sounding but false outputs, leading to hallucinations. In this article, we discuss the importance of prompt engineering in mitigating these risks and harnessing the full potential of GPT for geotechnical applications. We explore the challenges and pitfalls associated with LLMs and highlight the role of context in ensuring accurate and valuable responses. Furthermore, we examine the development of context-specific search engines and the potential of LLMs to become a natural interface for complex tasks, such as data analysis and design. We also develop a unified interface using natural language to handle complex geotechnical engineering tasks and data analysis. By integrating GPT into geotechnical engineering workflows, professionals can streamline their work and develop sustainable and resilient infrastructure systems for the future.
Abstract
This study presents a review of search engines and search engine optimization and shows how the search engine landscape relates to sustainable development. We have used a narrative review research method and described three main topics: the past and present of web catalogs and search engines; current knowledge about the dominant types of search results presented in Google search; and methods of search engine optimization. Technical elements of important website areas related to technical website auditing are discussed. We summarize our research with several key findings on how web search engines are involved in sustainable development and offer a glimpse into the future use of web searching with the help of artificial intelligence chats and prompt engineering.
Abstract
This paper delves into the pivotal role of prompt engineering in unleashingthe capabilities of Large Language Models (LLMs). Prompt engineering is theprocess of structuring input text for LLMs and is a technique integral tooptimizing the efficacy of LLMs. This survey elucidates foundational principlesof prompt engineering, such as role-prompting, one-shot, and few-shotprompting, as well as more advanced methodologies such as the chain-of-thoughtand tree-of-thoughts prompting. The paper sheds light on how externalassistance in the form of plugins can assist in this task, and reduce machinehallucination by retrieving external knowledge. We subsequently delineateprospective directions in prompt engineering research, emphasizing the need fora deeper understanding of structures and the role of agents in ArtificialIntelligence-Generated Content (AIGC) tools. We discuss how to assess theefficacy of prompt methods from different perspectives and using differentmethods. Finally, we gather information about the application of promptengineering in such fields as education and programming, showing itstransformative potential. This comprehensive survey aims to serve as a friendlyguide for anyone venturing through the big world of LLMs and promptengineering.
Documents
- braininspired globallocal learning incorporated with neuromorphic computing
- right to be forgotten in the era of large language models implications, challenges, and solutions
- menucraft interactive menu system design with large language models
- alphagpt humanai interactive alpha mining for quantitative investment
- large language model for multiobjective evolutionary optimization
Abstract
Two main routes of learning methods exist at present including error-drivenglobal learning and neuroscience-oriented local learning. Integrating them intoone network may provide complementary learning capabilities for versatilelearning scenarios. At the same time, neuromorphic computing holds greatpromise, but still needs plenty of useful algorithms and algorithm-hardwareco-designs for exploiting the advantages. Here, we report a neuromorphic hybridlearning model by introducing a brain-inspired meta-learning paradigm and adifferentiable spiking model incorporating neuronal dynamics and synapticplasticity. It can meta-learn local plasticity and receive top-down supervisioninformation for multiscale synergic learning. We demonstrate the advantages ofthis model in multiple different tasks, including few-shot learning, continuallearning, and fault-tolerance learning in neuromorphic vision sensors. Itachieves significantly higher performance than single-learning methods, andshows promise in empowering neuromorphic applications revolution. We furtherimplemented the hybrid model in the Tianjic neuromorphic platform by exploitingalgorithm-hardware co-designs and proved that the model can fully utilizeneuromorphic many-core architecture to develop hybrid computation paradigm.
Abstract
The Right to be Forgotten (RTBF) was first established as the result of the ruling of Google Spain SL, Google Inc. v AEPD, Mario Costeja Gonz\'alez, and was later included as the Right to Erasure under the General Data Protection Regulation (GDPR) of European Union to allow individuals the right to request personal data be deleted by organizations. Specifically for search engines, individuals can send requests to organizations to exclude their information from the query results. It was a significant emergent right as the result of the evolution of technology. With the recent development of Large Language Models (LLMs) and their use in chatbots, LLM-enabled software systems have become popular. But they are not excluded from the RTBF. Compared with the indexing approach used by search engines, LLMs store, and process information in a completely different way. This poses new challenges for compliance with the RTBF. In this paper, we explore these challenges and provide our insights on how to implement technical solutions for the RTBF, including the use of differential privacy, machine unlearning, model editing, and prompt engineering. With the rapid advancement of AI and the increasing need of regulating this powerful technology, learning from the case of RTBF can provide valuable lessons for technical practitioners, legal experts, organizations, and authorities.
Abstract
Menu system design is a challenging task involving many design options andvarious human factors. For example, one crucial factor that designers need toconsider is the semantic and systematic relation of menu commands. However,capturing these relations can be challenging due to limited availableresources. With the advancement of neural language models, large languagemodels can utilize their vast pre-existing knowledge in designing and refiningmenu systems. In this paper, we propose MenuCraft, an AI-assisted designer formenu design that enables collaboration between the designer and a dialoguesystem to design menus. MenuCraft offers an interactive language-based menudesign tool that simplifies the menu design process and enables easycustomization of design options. MenuCraft supports a variety of interactionsthrough dialog that allows performing zero/few-shot learning.
Abstract
One of the most important tasks in quantitative investment research is miningnew alphas (effective trading signals or factors). Traditional alpha miningmethods, either hand-crafted factor synthesizing or algorithmic factor mining(e.g., search with genetic programming), have inherent limitations, especiallyin implementing the ideas of quants. In this work, we propose a new alphamining paradigm by introducing human-AI interaction, and a novel promptengineering algorithmic framework to implement this paradigm by leveraging thepower of large language models. Moreover, we develop Alpha-GPT, a newinteractive alpha mining system framework that provides a heuristic way to``understand'' the ideas of quant researchers and outputs creative, insightful,and effective alphas. We demonstrate the effectiveness and advantage ofAlpha-GPT via a number of alpha mining experiments.
Abstract
Multiobjective evolutionary algorithms (MOEAs) are major methods for solvingmultiobjective optimization problems (MOPs). Many MOEAs have been proposed inthe past decades, of which the search operators need a carefully handcrafteddesign with domain knowledge. Recently, some attempts have been made to replacethe manually designed operators in MOEAs with learning-based operators (e.g.,neural network models). However, much effort is still required for designingand training such models, and the learned operators might not generalize wellon new problems. To tackle the above challenges, this work investigates a novelapproach that leverages the powerful large language model (LLM) to design MOEAoperators. With proper prompt engineering, we successfully let a general LLMserve as a black-box search operator for decomposition-based MOEA (MOEA/D) in azero-shot manner. In addition, by learning from the LLM behavior, we furtherdesign an explicit white-box operator with randomness and propose a new versionof decomposition-based MOEA, termed MOEA/D-LO. Experimental studies ondifferent test benchmarks show that our proposed method can achieve competitiveperformance with widely used MOEAs. It is also promising to see the operatoronly learned from a few instances can have robust generalization performance onunseen problems with quite different patterns and settings. The results revealthe potential benefits of using pre-trained LLMs in the design of MOEAs.
Documents
- a mlllm pairing for better code comment classification
- development of metaprompts for large language models to screen titles and abstracts for diagnostic test accuracy reviews
- short answer grading using oneshot prompting and text similarity scoring model
- stance detection with supervised, zeroshot, and fewshot applications
- benchmarking a foundation llm on its ability to relabel structure names in accordance with the aapm tg263 report
Abstract
The "Information Retrieval in Software Engineering (IRSE)" at FIRE 2023shared task introduces code comment classification, a challenging task thatpairs a code snippet with a comment that should be evaluated as either usefulor not useful to the understanding of the relevant code. We answer the codecomment classification shared task challenge by providing a two-foldevaluation: from an algorithmic perspective, we compare the performance ofclassical machine learning systems and complement our evaluations from adata-driven perspective by generating additional data with the help of largelanguage model (LLM) prompting to measure the potential increase inperformance. Our best model, which took second place in the shared task, is aNeural Network with a Macro-F1 score of 88.401% on the provided seed data and a1.5% overall increase in performance on the data generated by the LLM.
Abstract
Systematic reviews (SRs) are a critical component of evidence-based medicine, but the process of screening titles and abstracts is time-consuming. This study aimed to develop and externally validate a method using large language models to classify abstracts for diagnostic test accuracy (DTA) systematic reviews, thereby reducing the human workload. We used a previously collected dataset for developing DTA abstract classifiers and applied prompt engineering. We developed an optimized meta-prompt for Generative Pre-trained Transformer (GPT)-3.5-turbo and GPT-4 to classify abstracts. In the external validation dataset 1, the prompt with GPT-3.5 turbo showed a sensitivity of 0.988, and a specificity of 0.298. GPT-4 showed a sensitivity of 0.982, and a specificity of 0.677. In the external validation dataset 2, GPT-3.5 turbo showed a sensitivity of 0.919, and a specificity of 0.434. GPT-4 showed a sensitivity of 0.806, and a specificity of 0.740. If we included eligible studies from among the references of the identified studies, GPT-3.5 turbo had no critical misses, while GPT-4 had some misses. Our study indicates that GPT-3.5 turbo can be effectively used to classify abstracts for DTA systematic reviews. Further studies using other dataset are warranted to confirm our results. Additionally, we encourage the use of our framework and publicly available dataset for further exploration of more effective classifiers using other LLMs and prompts (https://github.com/youkiti/ARE/).
Abstract
In this study, we developed an automated short answer grading (ASAG) model that provided both analytic scores and final holistic scores. Short answer items typically consist of multiple sub-questions, and providing an analytic score and the text span relevant to each sub-question can increase the interpretability of the automated scores. Furthermore, they can be used to generate actionable feedback for students. Despite these advantages, most studies have focused on predicting only holistic scores due to the difficulty in constructing dataset with manual annotations. To address this difficulty, we used large language model (LLM)-based one-shot prompting and a text similarity scoring model with domain adaptation using small manually annotated dataset. The accuracy and quadratic weighted kappa of our model were 0.67 and 0.71 on a subset of the publicly available ASAG dataset. The model achieved a substantial improvement over the majority baseline.
Abstract
Stance detection is the identification of an author's beliefs about a subjectfrom a document. Researchers widely rely on sentiment analysis to accomplishthis. However, recent research has show that sentiment analysis is only looselycorrelated with stance, if at all. This paper advances methods in text analysisby precisely defining the task of stance detection, providing a generalizedframework for the task, and then presenting three distinct approaches forperforming stance detection: supervised classification, zero-shotclassification with NLI classifiers, and in-context learning. In doing so, Idemonstrate how zero-shot and few-shot language classifiers can replace humanlabelers for a variety of tasks and discuss how their application andlimitations differ from supervised classifiers. Finally, I demonstrate anapplication of zero-shot stance detection by replicating Block Jr et al.(2022).
Abstract
Purpose: To introduce the concept of using large language models (LLMs) tore-label structure names in accordance with the American Association ofPhysicists in Medicine (AAPM) Task Group (TG)-263 standard, and to establish abenchmark for future studies to reference. Methods and Materials: The Generative Pre-trained Transformer (GPT)-4application programming interface (API) was implemented as a Digital Imagingand Communications in Medicine (DICOM) storage server, which upon receiving astructure set DICOM file, prompts GPT-4 to re-label the structure names of bothtarget volumes and normal tissues according to the AAPM TG-263. Three diseasesites, prostate, head and neck, and thorax were selected for evaluation. Foreach disease site category, 150 patients were randomly selected for manuallytuning the instructions prompt (in batches of 50) and 50 patients were randomlyselected for evaluation. Structure names that were considered were those thatwere most likely to be relevant for studies utilizing structure contours formany patients. Results: The overall re-labeling accuracy of both target volumes and normaltissues for prostate, head and neck, and thorax cases was 96.0%, 98.5%, and96.9% respectively. Re-labeling of target volumes was less accurate on averageexcept for prostate - 100%, 93.1%, and 91.1% respectively. Conclusions: Given the accuracy of GPT-4 in re-labeling structure names ofboth target volumes and normal tissues as presented in this work, LLMs arepoised to be the preferred method for standardizing structure names inradiation oncology, especially considering the rapid advancements in LLMcapabilities that are likely to continue.
Documents
- rethink the effectiveness of text data augmentation an empirical analysis
- can language models be biomedical knowledge bases
- adaprompt adaptive model training for promptbased nlp
- can large language models truly understand prompts a case study with negated prompts
- does correction remain a problem for large language models
Abstract
In recent years, language models (LMs) have made remarkable progress inadvancing the field of natural language processing (NLP). However, the impactof data augmentation (DA) techniques on the fine-tuning (FT) performance ofthese LMs has been a topic of ongoing debate. In this study, we evaluate theeffectiveness of three different FT methods in conjugation withback-translation across an array of 7 diverse NLP tasks, includingclassification and regression types, covering single-sentence and sentence-pairtasks. Contrary to prior assumptions that DA does not contribute to theenhancement of LMs' FT performance, our findings reveal that continuedpre-training on augmented data can effectively improve the FT performance ofthe downstream tasks. In the most favourable case, continued pre-trainingimproves the performance of FT by more than 10% in the few-shot learningsetting. Our finding highlights the potential of DA as a powerful tool forbolstering LMs' performance.
Abstract
Pre-trained language models (LMs) have become ubiquitous in solving various natural language processing (NLP) tasks. There has been increasing interest in what knowledge these LMs contain and how we can extract that knowledge, treating LMs as knowledge bases (KBs). While there has been much work on probing LMs in the general domain, there has been little attention to whether these powerful LMs can be used as domain-specific KBs. To this end, we create the BioLAMA benchmark, which is comprised of 49K biomedical factual knowledge triples for probing biomedical LMs. We find that biomedical LMs with recently proposed probing methods can achieve up to 18.51% Acc@5 on retrieving biomedical knowledge. Although this seems promising given the task difficulty, our detailed analyses reveal that most predictions are highly correlated with prompt templates without any subjects, hence producing similar results on each relation and hindering their capabilities to be used as domain-specific KBs. We hope that BioLAMA can serve as a challenging benchmark for biomedical factual probing.
Abstract
Prompt-based learning, with its capability to tackle zero-shot and few-shot NLP tasks, has gained much attention in community. The main idea is to bridge the gap between NLP downstream tasks and language modeling (LM), by mapping these tasks into natural language prompts, which are then filled by pre-trained language models (PLMs). However, for prompt learning, there are still two salient gaps between NLP tasks and pretraining. First, prompt information is not necessarily sufficiently present during LM pretraining. Second, task-specific data are not necessarily well represented during pretraining. We address these two issues by proposing AdaPrompt, adaptively retrieving external data for continual pretraining of PLMs by making use of both task and prompt characteristics. In addition, we make use of knowledge in Natural Language Inference models for deriving adaptive verbalizers. Experimental results on five NLP benchmarks show that AdaPrompt can improve over standard PLMs in few-shot settings. In addition, in zero-shot settings, our method outperforms standard prompt-based methods by up to 26.35\% relative error reduction.
Abstract
Previous work has shown that there exists a scaling law between the size of Language Models (LMs) and their zero-shot performance on different downstream NLP tasks. In this work, we show that this phenomenon does not hold when evaluating large LMs on tasks with negated prompts, but instead shows an inverse scaling law. We evaluate 9 different tasks with negated prompts on (1) pretrained LMs (OPT&GPT-3) of varying sizes (125M - 175B), (2) LMs further pretrained to generalize to novel prompts (InstructGPT), (3) LMs provided with few-shot examples, and (4) LMs fine-tuned specifically on negated prompts; all LM types perform worse on negated prompts as they scale and show a huge performance gap between the human performance when comparing the average score on both original and negated prompts. By highlighting a critical limitation of existing LMs and methods, we urge the community to develop new approaches of developing LMs that actually follow the given instructions. We provide the code and the datasets to explore negated prompts at https://github.com/joeljang/negated-prompts-for-llms
Abstract
As large language models, such as GPT, continue to advance the capabilitiesof natural language processing (NLP), the question arises: does the problem ofcorrection still persist? This paper investigates the role of correction in thecontext of large language models by conducting two experiments. The firstexperiment focuses on correction as a standalone task, employing few-shotlearning techniques with GPT-like models for error correction. The secondexperiment explores the notion of correction as a preparatory task for otherNLP tasks, examining whether large language models can tolerate and performadequately on texts containing certain levels of noise or errors. By addressingthese experiments, we aim to shed light on the significance of correction inthe era of large language models and its implications for various NLPapplications.
Documents
- visorgpt learning visual prior via generative pretraining
- qaclims questionanswer cross language image matching for weakly supervised semantic segmentation
- visionlanguage pretraining basics, recent advances, and future trends
- multimodal classifiers for openvocabulary object detection
- a survey on segment anything model (sam) vision foundation model meets prompt engineering
Abstract
Various stuff and things in visual data possess specific traits, which can be learned by deep neural networks and are implicitly represented as the visual prior, e.g., object location and shape, in the model. Such prior potentially impacts many vision tasks. For example, in conditional image synthesis, spatial conditions failing to adhere to the prior can result in visually inaccurate synthetic results. This work aims to explicitly learn the visual prior and enable the customization of sampling. Inspired by advances in language modeling, we propose to learn Visual prior via Generative Pre-Training, dubbed VisorGPT. By discretizing visual locations of objects, e.g., bounding boxes, human pose, and instance masks, into sequences, VisorGPT can model visual prior through likelihood maximization. Besides, prompt engineering is investigated to unify various visual locations and enable customized sampling of sequential outputs from the learned prior. Experimental results demonstrate that VisorGPT can effectively model the visual prior, which can be employed for many vision tasks, such as customizing accurate human pose for conditional image synthesis models like ControlNet. Code will be released at https://github.com/Sierkinhane/VisorGPT.
Abstract
Class Activation Map (CAM) has emerged as a popular tool for weakly supervised semantic segmentation (WSSS), allowing the localization of object regions in an image using only image-level labels. However, existing CAM methods suffer from under-activation of target object regions and false-activation of background regions due to the fact that a lack of detailed supervision can hinder the model's ability to understand the image as a whole. In this paper, we propose a novel Question-Answer Cross-Language-Image Matching framework for WSSS (QA-CLIMS), leveraging the vision-language foundation model to maximize the text-based understanding of images and guide the generation of activation maps. First, a series of carefully designed questions are posed to the VQA (Visual Question Answering) model with Question-Answer Prompt Engineering (QAPE) to generate a corpus of both foreground target objects and backgrounds that are adaptive to query images. We then employ contrastive learning in a Region Image Text Contrastive (RITC) network to compare the obtained foreground and background regions with the generated corpus. Our approach exploits the rich textual information from the open vocabulary as additional supervision, enabling the model to generate high-quality CAMs with a more complete object region and reduce false-activation of background regions. We conduct extensive analysis to validate the proposed method and show that our approach performs state-of-the-art on both PASCAL VOC 2012 and MS COCO datasets.
Abstract
This paper surveys vision-language pre-training (VLP) methods for multimodalintelligence that have been developed in the last few years. We group theseapproaches into three categories: ($i$) VLP for image-text tasks, such as imagecaptioning, image-text retrieval, visual question answering, and visualgrounding; ($ii$) VLP for core computer vision tasks, such as (open-set) imageclassification, object detection, and segmentation; and ($iii$) VLP forvideo-text tasks, such as video captioning, video-text retrieval, and videoquestion answering. For each category, we present a comprehensive review ofstate-of-the-art methods, and discuss the progress that has been made andchallenges still being faced, using specific systems and models as casestudies. In addition, for each category, we discuss advanced topics beingactively explored in the research community, such as big foundation models,unified modeling, in-context few-shot learning, knowledge, robustness, andcomputer vision in the wild, to name a few.
Abstract
The goal of this paper is open-vocabulary object detection (OVOD) $\unicode{x2013}$ building a model that can detect objects beyond the set of categories seen at training, thus enabling the user to specify categories of interest at inference without the need for model retraining. We adopt a standard two-stage object detector architecture, and explore three ways for specifying novel categories: via language descriptions, via image exemplars, or via a combination of the two. We make three contributions: first, we prompt a large language model (LLM) to generate informative language descriptions for object classes, and construct powerful text-based classifiers; second, we employ a visual aggregator on image exemplars that can ingest any number of images as input, forming vision-based classifiers; and third, we provide a simple method to fuse information from language descriptions and image exemplars, yielding a multi-modal classifier. When evaluating on the challenging LVIS open-vocabulary benchmark we demonstrate that: (i) our text-based classifiers outperform all previous OVOD works; (ii) our vision-based classifiers perform as well as text-based classifiers in prior work; (iii) using multi-modal classifiers perform better than either modality alone; and finally, (iv) our text-based and multi-modal classifiers yield better performance than a fully-supervised detector.
Abstract
Segment anything model (SAM) developed by Meta AI Research has recently attracted significant attention. Trained on a large segmentation dataset of over 1 billion masks, SAM is capable of segmenting any object on a certain image. In the original SAM work, the authors turned to zero-short transfer tasks (like edge detection) for evaluating the performance of SAM. Recently, numerous works have attempted to investigate the performance of SAM in various scenarios to recognize and segment objects. Moreover, numerous projects have emerged to show the versatility of SAM as a foundation model by combining it with other models, like Grounding DINO, Stable Diffusion, ChatGPT, etc. With the relevant papers and projects increasing exponentially, it is challenging for the readers to catch up with the development of SAM. To this end, this work conducts the first yet comprehensive survey on SAM. This is an ongoing project and we intend to update the manuscript on a regular basis. Therefore, readers are welcome to contact us if they complete new works related to SAM so that we can include them in our next version.
Documents
- multitask pretraining of modular prompt for chinese fewshot learning
- list lite prompted selftraining makes parameterefficient fewshot learners
- prompting electra fewshot learning with discriminative pretrained models
- stt soft template tuning for fewshot adaptation
- attempt parameterefficient multitask tuning via attentional mixtures of soft prompts
Abstract
Prompt tuning is a parameter-efficient approach to adapting pre-trainedlanguage models to downstream tasks. Although prompt tuning has been shown tomatch the performance of full model tuning when training data is sufficient, ittends to struggle in few-shot learning settings. In this paper, we presentMulti-task Pre-trained Modular Prompt (MP2) to boost prompt tuning for few-shotlearning. MP2 is a set of combinable prompts pre-trained on 38 Chinese tasks.On downstream tasks, the pre-trained prompts are selectively activated andcombined, leading to strong compositional generalization to unseen tasks. Tobridge the gap between pre-training and fine-tuning, we formulate upstream anddownstream tasks into a unified machine reading comprehension task. Extensiveexperiments under two learning paradigms, i.e., gradient descent and black-boxtuning, show that MP2 significantly outperforms prompt tuning, full modeltuning, and prior prompt pre-training methods in few-shot settings. Inaddition, we demonstrate that MP2 can achieve surprisingly fast and strongadaptation to downstream tasks by merely learning 8 parameters to combine thepre-trained modular prompts.
Abstract
We present a new method LiST is short for Lite Prompted Self-Training forparameter-efficient fine-tuning of large pre-trained language models (PLMs) forfew-shot learning. LiST improves over recent methods that adopt prompt-basedfine-tuning (FN) using two key techniques. The first is the use ofself-training to leverage large amounts of unlabeled data for prompt-based FNin few-shot settings. We use self-training in conjunction with meta-learningfor re-weighting noisy pseudo-prompt labels. Self-training is expensive as itrequires updating all the model parameters repetitively. Therefore, we use asecond technique for light-weight fine-tuning where we introduce a small numberof task-specific parameters that are fine-tuned during self-training whilekeeping the PLM encoder frozen. Our experiments show that LiST can effectivelyleverage unlabeled data to improve the model performance for few-shot learning.Additionally, the fine-tuning is efficient as it only updates a smallpercentage of parameters and the overall model footprint is reduced sinceseveral tasks can share a common PLM encoder as backbone. A comprehensive studyon six NLU tasks demonstrate LiST to improve by 35% over classic fine-tuningand 6% over prompt-based FN with 96% reduction in number of trainableparameters when fine-tuned with no more than 30 labeled examples from eachtask. With only 14M tunable parameters, LiST outperforms GPT-3 in-contextlearning by 33% on few-shot NLU tasks.
Abstract
Pre-trained masked language models successfully perform few-shot learning byformulating downstream tasks as text infilling. However, as a strongalternative in full-shot settings, discriminative pre-trained models likeELECTRA do not fit into the paradigm. In this work, we adapt prompt-basedfew-shot learning to ELECTRA and show that it outperforms masked languagemodels in a wide range of tasks. ELECTRA is pre-trained to distinguish if atoken is generated or original. We naturally extend that to prompt-basedfew-shot learning by training to score the originality of the target optionswithout introducing new parameters. Our method can be easily adapted to tasksinvolving multi-token predictions without extra computation overhead. Analysisshows that ELECTRA learns distributions that align better with downstreamtasks.
Abstract
Prompt tuning has been an extremely effective tool to adapt a pre-trained model to downstream tasks. However, standard prompt-based methods mainly consider the case of sufficient data of downstream tasks. It is still unclear whether the advantage can be transferred to the few-shot regime, where only limited data are available for each downstream task. Although some works have demonstrated the potential of prompt-tuning under the few-shot setting, the main stream methods via searching discrete prompts or tuning soft prompts with limited data are still very challenging. Through extensive empirical studies, we find that there is still a gap between prompt tuning and fully fine-tuning for few-shot learning. To bridge the gap, we propose a new prompt-tuning framework, called Soft Template Tuning (STT) 1. STT combines manual and auto prompts, and treats down-stream classification tasks as a masked language modeling task. Comprehensive evaluation on different settings suggests STT can close the gap between fine-tuning and prompt-based methods without introducing additional parameters. Significantly, it can even outperform the time- and resource-consuming fine-tuning method on sentiment classification tasks.
Abstract
This work introduces a new multi-task, parameter-efficient language model(LM) tuning method that learns to transfer knowledge across different tasks viaa mixture of soft prompts-small prefix embedding vectors pre-trained fordifferent tasks. Our method, called ATTEMPT (ATTEntional Mixtures of PromptTuning), obtains source prompts as encodings of large-scale source tasks into asmall number of parameters and trains an attention module to interpolate thesource prompts and a newly initialized target prompt for every instance in thetarget task. During training, only the target task prompt and the attentionweights, which are shared between tasks in multi-task training, are updated,while the original LM and source prompts are intact. ATTEMPT is highlyparameter-efficient (e.g., updates 2,300 times fewer parameters than fullfine-tuning) while achieving high task performance using knowledge fromhigh-resource tasks. Moreover, it is modular using pre-trained soft prompts,and can flexibly add or remove source prompts for effective knowledge transfer.Our experimental results across 21 diverse NLP datasets show that ATTEMPTsignificantly outperforms prompt tuning and outperforms or matches fullyfine-tuned or other parameter-efficient tuning approaches that use over tentimes more parameters. Finally, ATTEMPT outperforms previous work in few-shotlearning settings.
Documents
- factchecking complex claims with programguided reasoning
- legoprover neural theorem proving with growing libraries
- the student becomes the master matching gpt3 on scientific factual error correction
- lmcanvas objectoriented interaction to personalize large language modelpowered writing environments
- abscribe rapid exploration of multiple writing variations in humanai cowriting tasks using large language models
Abstract
Fact-checking real-world claims often requires collecting multiple pieces ofevidence and applying complex multi-step reasoning. In this paper, we presentProgram-Guided Fact-Checking (ProgramFC), a novel fact-checking model thatdecomposes complex claims into simpler sub-tasks that can be solved using ashared library of specialized functions. We first leverage the in-contextlearning ability of large language models to generate reasoning programs toguide the verification process. Afterward, we execute the program by delegatingeach sub-task to the corresponding sub-task handler. This process makes ourmodel both explanatory and data-efficient, providing clear explanations of itsreasoning process and requiring minimal training data. We evaluate ProgramFC ontwo challenging fact-checking datasets and show that it outperforms sevenfact-checking baselines across different settings of evidence availability,with explicit output programs that benefit human debugging. Our codes and dataare publicly available at https://github.com/mbzuai-nlp/ProgramFC.
Abstract
Despite the success of large language models (LLMs), the task of theorem proving still remains one of the hardest reasoning tasks that is far from being fully solved. Prior methods using language models have demonstrated promising results, but they still struggle to prove even middle school level theorems. One common limitation of these methods is that they assume a fixed theorem library during the whole theorem proving process. However, as we all know, creating new useful theorems or even new theories is not only helpful but crucial and necessary for advancing mathematics and proving harder and deeper results. In this work, we present LEGO-Prover, which employs a growing skill library containing verified lemmas as skills to augment the capability of LLMs used in theorem proving. By constructing the proof modularly, LEGO-Prover enables LLMs to utilize existing skills retrieved from the library and to create new skills during the proving process. These skills are further evolved (by prompting an LLM) to enrich the library on another scale. Modular and reusable skills are constantly added to the library to enable tackling increasingly intricate mathematical problems. Moreover, the learned library further bridges the gap between human proofs and formal proofs by making it easier to impute missing steps. LEGO-Prover advances the state-of-the-art pass rate on miniF2F-valid (48.0% to 57.0%) and miniF2F-test (45.5% to 47.1%). During the proving process, LEGO-Prover also manages to generate over 20,000 skills (theorems/lemmas) and adds them to the growing library. Our ablation study indicates that these newly added skills are indeed helpful for proving theorems, resulting in an improvement from a success rate of 47.1% to 50.4%. We also release our code and all the generated skills.
Abstract
Due to the prohibitively high cost of creating error correction datasets, most Factual Claim Correction methods rely on a powerful verification model to guide the correction process. This leads to a significant drop in performance in domains like Scientific Claim Correction, where good verification models do not always exist. In this work, we introduce a claim correction system that makes no domain assumptions and does not require a verifier but is able to outperform existing methods by an order of magnitude — achieving 94% correction accuracy on the SciFact dataset, and 62.5% on the SciFact-Open dataset, compared to the next best meth-ods 0.5% and 1.50% respectively. Our method leverages the power of prompting with LLMs during training to create a richly annotated dataset that can be used for fully supervised training and regularization. We additionally use a claim-aware decoding procedure to improve the quality of corrected claims. Our method is competitive with the very LLM that was used to generate the annotated dataset — with GPT3.5 achieving 89.5% and 60% correction accuracy on SciFact and SciFact-Open, despite using 1250 times as many parameters as our model.
Abstract
Large language models (LLMs) can enhance writing by automating or supporting specific tasks in writers' workflows (e.g., paraphrasing, creating analogies). Leveraging this capability, a collection of interfaces have been developed that provide LLM-powered tools for specific writing tasks. However, these interfaces provide limited support for writers to create personal tools for their own unique tasks, and may not comprehensively fulfill a writer's needs -- requiring them to continuously switch between interfaces during writing. In this work, we envision LMCanvas, an interface that enables writers to create their own LLM-powered writing tools and arrange their personal writing environment by interacting with"blocks"in a canvas. In this interface, users can create text blocks to encapsulate writing and LLM prompts, model blocks for model parameter configurations, and connect these to create pipeline blocks that output generations. In this workshop paper, we discuss the design for LMCanvas and our plans to develop this concept.
Abstract
Exploring alternative ideas by rewriting text is integral to the writing process. State-of-the-art large language models (LLMs) can simplify writing variation generation. However, current interfaces pose challenges for simultaneous consideration of multiple variations: creating new versions without overwriting text can be difficult, and pasting them sequentially can clutter documents, increasing workload and disrupting writers' flow. To tackle this, we present ABScribe, an interface that supports rapid, yet visually structured, exploration of writing variations in human-AI co-writing tasks. With ABScribe, users can swiftly produce multiple variations using LLM prompts, which are auto-converted into reusable buttons. Variations are stored adjacently within text segments for rapid in-place comparisons using mouse-over interactions on a context toolbar. Our user study with 12 writers shows that ABScribe significantly reduces task workload (d = 1.20, p<0.001), enhances user perceptions of the revision process (d = 2.41, p<0.001) compared to a popular baseline workflow, and provides insights into how writers explore variations using LLMs.
Documents
- incontext learning learns label relationships but is not conventional learning
- ambiguityaware incontext learning with large language models
- exploring demonstration ensembling for incontext learning
- incontext example selection with influences
- privacypreserving incontext learning for large language models
Abstract
The predictions of Large Language Models (LLMs) on downstream tasks oftenimprove significantly when including examples of the input--label relationshipin the context. However, there is currently no consensus about how thisin-context learning (ICL) ability of LLMs works. For example, while Xie et al.(2021) liken ICL to a general-purpose learning algorithm, Min et al. (2022)argue ICL does not even learn label relationships from in-context examples. Inthis paper, we provide novel insights into how ICL leverages label information,revealing both capabilities and limitations. To ensure we obtain acomprehensive picture of ICL behavior, we study probabilistic aspects of ICLpredictions and thoroughly examine the dynamics of ICL as more examples areprovided. Our experiments show that ICL predictions almost always depend onin-context labels, and that ICL can learn truly novel tasks in-context.However, we also find that ICL struggles to fully overcome predictionpreferences acquired from pre-training data, and, further, that ICL does notconsider all in-context information equally.
Abstract
In-context learning (ICL) i.e. showing LLMs only a few task-specificdemonstrations has led to downstream gains with no task-specific fine-tuningrequired. However, LLMs are sensitive to the choice of prompts, and therefore acrucial research question is how to select good demonstrations for ICL. Oneeffective strategy is leveraging semantic similarity between the ICLdemonstrations and test inputs by using a text retriever, which however issub-optimal as that does not consider the LLM's existing knowledge about thattask. From prior work (Min et al., 2022), we already know that labels pairedwith the demonstrations bias the model predictions. This leads us to ourhypothesis whether considering LLM's existing knowledge about the task,especially with respect to the output label space can help in a betterdemonstration selection strategy. Through extensive experimentation on threetext classification tasks, we find that it is beneficial to not only choosesemantically similar ICL demonstrations but also to choose those demonstrationsthat help resolve the inherent label ambiguity surrounding the test example.Interestingly, we find that including demonstrations that the LLM previouslymis-classified and also fall on the test example's decision boundary, bringsthe most performance gain.
Abstract
In-context learning (ICL) operates by showing language models (LMs) examplesof input-output pairs for a given task, i.e., demonstrations. The standardapproach for ICL is to prompt the LM with concatenated demonstrations followedby the test input. This approach suffers from some issues. First, concatenationoffers almost no control over the contribution of each demo to the modelprediction. This can be sub-optimal when some demonstrations are irrelevant tothe test example. Second, due to the input length limit of some transformermodels, it might be infeasible to fit many examples into the context,especially when dealing with long-input tasks. In this work, we exploreDemonstration Ensembling (DENSE) as an alternative to simple concatenation.DENSE predicts outputs using subsets (i.e., buckets) of the demonstrations andthen combines the output probabilities resulting from each subset to producethe final prediction. We study different ensembling methods using GPT-j andexperiment on 12 language tasks. Our experiments show weighted max ensemblingto outperform vanilla concatenation by as large as 2.4 average points. Codeavailable at https://github.com/mukhal/icl-ensembling.
Abstract
In-context learning (ICL) is a powerful paradigm emerged from large languagemodels (LLMs). Despite its promises, ICL performance is known to be highlysensitive to input examples. In this work, we use $\textit{in-contextinfluences}$ to analyze few-shot ICL performance directly from the in-contextexamples. Our proposed influence-based example selection method can identifyboth positive and negative examples, outperforming several baselines whenevaluated on 9 SuperGLUE tasks. Our analysis uncovers up to a $16.3\%$performance gap between using the most negative in-context examples compared tothe most positive. In a case study, we apply our influence-based framework toquantify the phenomena of recency bias in example ordering for few-shot ICL.
Abstract
In-context learning (ICL) is an important capability of Large Language Models(LLMs), enabling these models to dynamically adapt based on specific,in-context exemplars, thereby improving accuracy and relevance. However, LLM'sresponses may leak the sensitive private information contained in in-contextexemplars. To address this challenge, we propose Differentially PrivateIn-context Learning (DP-ICL), a general paradigm for privatizing ICL tasks. Thekey idea for DP-ICL paradigm is generating differentially private responsesthrough a noisy consensus among an ensemble of LLM's responses based ondisjoint exemplar sets. Based on the general paradigm of DP-ICL, we instantiateseveral techniques showing how to privatize ICL for text classification andlanguage generation. We evaluate DP-ICL on four text classification benchmarksand two language generation tasks, and our empirical results show that DP-ICLachieves a strong utility-privacy tradeoff.
Documents
- citeprompt using prompts to identify citation intent in scientific papers
- large language models can share images, too!
- clip model is an efficient continual learner
- stprompt semanticguided and taskdriven prompts for effective fewshot classification
- zeroshot and fewshot learning for lung cancer multilabel classification using vision transformer
Abstract
Citations in scientific papers not only help us trace the intellectual lineage but also are a useful indicator of the scientific significance of the work. Citation intents prove beneficial as they specify the role of the citation in a given context. We present a tool Citeprompt which uses the hitherto unexplored approach of prompt learning for citation intent classification. We argue that with the proper choice of the pretrained language model, the prompt template, and the prompt verbalizer, we can not only get results that are better than or comparable to those obtained with the state-of-the-art methods but also do it with much less exterior information about the scientific document. We report state-of-the-art results on the ACL-ARC dataset, and also show significant improvement on the SciCite dataset over all baseline models except one. As suitably large labelled datasets for citation intent classification can be quite hard to find, in a first, we propose the conversion of this task to the few-shot and zero-shot settings. For the ACL-ARC dataset, we report a 53.86% F1 score for the zero-shot setting, which improves to 63.61% and 66.99% for the 5-shot and 10-shot settings respectively.
Abstract
This paper explores the image-sharing capability of Large Language Models(LLMs), such as InstructGPT, ChatGPT, and GPT-4, in a zero-shot setting,without the help of visual foundation models. Inspired by the two-stage processof image-sharing in human dialogues, we propose a two-stage framework thatallows LLMs to predict potential image-sharing turns and generate related imagedescriptions using our effective restriction-based prompt template. Withextensive experiments, we unlock the \textit{image-sharing} capability of LLMsin zero-shot prompting, with GPT-4 achieving the best performance.Additionally, we uncover the emergent \textit{image-sharing} ability inzero-shot prompting, demonstrating the effectiveness of restriction-basedprompts in both stages of our framework. Based on this framework, we augmentthe PhotoChat dataset with images generated by Stable Diffusion at predictedturns, namely PhotoChat++. To our knowledge, this is the first study to assessthe image-sharing ability of LLMs in a zero-shot setting without visualfoundation models. The source code and the dataset will be released afterpublication.
Abstract
The continual learning setting aims to learn new tasks over time without forgetting the previous ones. The literature reports several significant efforts to tackle this problem with limited or no access to previous task data. Among such efforts, typical solutions offer sophisticated techniques involving memory replay, knowledge distillation, model regularization, and dynamic network expansion. The resulting methods have a retraining cost at each learning task, dedicated memory requirements, and setting-specific design choices. In this work, we show that a frozen CLIP (Contrastive Language-Image Pretraining) model offers as-tounding continual learning performance without any fine-tuning (zero-shot eval-uation). We evaluate CLIP under a variety of settings including class-incremental, domain-incremental and task-agnostic incremental learning on five popular benchmarks (ImageNet-100 & 1K, CORe50, CIFAR-100, and TinyImageNet). Without any bells and whistles, the CLIP model outperforms the state-of-the-art continual learning approaches in majority of the settings. We show the effect on CLIP model’s performance by varying text inputs with simple prompt templates. To the best of our knowledge, this is the first work to report the CLIP zero-shot performance in a continual setting. We advocate the use of this strong yet embarrass-ingly simple baseline for future comparisons in the continual learning tasks. Code is available at https://github.com/vgthengane/Continual-CLIP .
Abstract
The effectiveness of prompt learning has been demonstrated in differentpre-trained language models. By formulating suitable template and choosingrepresentative label mapping, prompt learning can be used as an efficientknowledge probe. However, finding suitable prompt in existing methods requiresmultiple experimental attempts or appropriate vector initialization onformulating suitable template and choosing representative label mapping, whichit is more common in few-shot learning tasks. Motivating by PLM workingprocess, we try to construct the prompt from task semantic perspective and thuspropose the STPrompt -Semantic-guided and Task-driven Prompt model.Specifically, two novel prompts generated from the semantic dependency tree(Dep-prompt) and task-specific metadata description (Meta-prompt), are firstlyconstructed in a prompt augmented pool, and the proposed model wouldautomatically select a suitable semantic prompt to motivating the promptlearning process. Our results show that the proposed model achieves thestate-of-the-art performance in five different datasets of few-shot textclassification tasks, which prove that more semantic and significant promptscould assume as a better knowledge proving tool.
Abstract
Lung cancer is the leading cause of cancer-related death worldwide. Lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) are the most common histologic subtypes of non-small-cell lung cancer (NSCLC). Histology is an essential tool for lung cancer diagnosis. Pathologists make classifications according to the dominant subtypes. Although morphology remains the standard for diagnosis, significant tool needs to be developed to elucidate the diagnosis. In our study, we utilize the pre-trained Vision Transformer (ViT) model to classify multiple label lung cancer on histologic slices (from dataset LC25000), in both Zero-Shot and Few-Shot settings. Then we compare the performance of Zero-Shot and Few-Shot ViT on accuracy, precision, recall, sensitivity and specificity. Our study show that the pre-trained ViT model has a good performance in Zero-Shot setting, a competitive accuracy ($99.87\%$) in Few-Shot setting ({epoch = 1}) and an optimal result ($100.00\%$ on both validation set and test set) in Few-Shot seeting ({epoch = 5}).
Documents
- continuous prompt tuning based textual entailment model for ecommerce entity typing
- chemical identification and indexing in pubmed articles via bert and texttotext approaches
- promptner prompt locating and typing for named entity recognition
- learning incontext learning for named entity recognition
- inverse is better! fast and accurate prompt for fewshot slot tagging
Abstract
The explosion of e-commerce has caused the need for processing and analysis of product titles, like entity typing in product titles. However, the rapid activity in e-commerce has led to the rapid emergence of new entities, which is difficult for general entity typing. Besides, product titles in e-commerce have very different language styles from text data in general domain. In order to handle new entities in product titles and address the special language styles of product titles in e-commerce domain, we propose our textual entailment model with continuous prompt tuning based hypotheses and fusion embeddings for e-commerce entity typing. First, we reformulate entity typing into a textual entailment problem to handle new entities that are not present during training. Second, we design a model to automatically generate textual entailment hypotheses using a continuous prompt tuning method, which can generate better textual entailment hypotheses without manual design. Third, we utilize the fusion embeddings of BERT embedding and Char-acterBERT embedding to solve the problem that the language styles of product titles in e-commerce are different from that of general domain. To analyze the effect of each contribution, we compare the performance of entity typing and textual entailment model, and conduct ablation studies on continuous prompt tuning and fusion embeddings. We also evaluate the impact of different prompt template initialization for the continuous prompt tuning. We show our proposed model improves the average F1 score by around 2% compared to the baseline BERT entity typing model.
Abstract
The Biocreative VII Track-2 challenge consists of named entity recognition,entity-linking (or entity-normalization), and topic indexing tasks -- withentities and topics limited to chemicals for this challenge. Named entityrecognition is a well-established problem and we achieve our best performancewith BERT-based BioMegatron models. We extend our BERT-based approach to theentity linking task. After the second stage of pretraining BioBERT with ametric-learning loss strategy called self-alignment pretraining (SAP), we linkentities based on the cosine similarity between their SAP-BioBERT wordembeddings. Despite the success of our named entity recognition experiments, wefind the chemical indexing task generally more challenging. In addition to conventional NER methods, we attempt both named entityrecognition and entity linking with a novel text-to-text or "prompt" basedmethod that uses generative language models such as T5 and GPT. We achieveencouraging results with this new approach.
Abstract
Prompt learning is a new paradigm for utilizing pre-trained language models and has achieved great success in many tasks. To adopt prompt learning in the NER task, two kinds of methods have been explored from a pair of symmetric perspectives, populating the template by enumerating spans to predict their entity types or constructing type-specific prompts to locate entities. However, these methods not only require a multi-round prompting manner with a high time overhead and computational cost, but also require elaborate prompt templates, that are difficult to apply in practical scenarios. In this paper, we unify entity locating and entity typing into prompt learning, and design a dual-slot multi-prompt template with the position slot and type slot to prompt locating and typing respectively. Multiple prompts can be input to the model simultaneously, and then the model extracts all entities by parallel predictions on the slots. To assign labels for the slots during training, we design a dynamic template filling mechanism that uses the extended bipartite graph matching between prompts and the ground-truth entities. We conduct experiments in various settings, including resource-rich flat and nested NER datasets and low-resource in-domain and cross-domain datasets. Experimental results show that the proposed model achieves a significant performance improvement, especially in the cross-domain few-shot setting, which outperforms the state-of-the-art model by +7.7% on average.
Abstract
Named entity recognition in real-world applications suffers from thediversity of entity types, the emergence of new entity types, and the lack ofhigh-quality annotations. To address the above problems, this paper proposes anin-context learning-based NER approach, which can effectively inject in-contextNER ability into PLMs and recognize entities of novel types on-the-fly usingonly a few demonstrative instances. Specifically, we model PLMs as ameta-function $\mathcal{ \lambda_ {\text{instruction, demonstrations, text}}.M}$, and a new entity extractor can be implicitly constructed by applying newinstruction and demonstrations to PLMs, i.e., $\mathcal{ (\lambda . M)}$(instruction, demonstrations) $\to$ $\mathcal{F}$ where $\mathcal{F}$ will bea new entity extractor, i.e., $\mathcal{F}$: text $\to$ entities. To inject theabove in-context NER ability into PLMs, we propose a meta-function pre-trainingalgorithm, which pre-trains PLMs by comparing the (instruction,demonstration)-initialized extractor with a surrogate golden extractor.Experimental results on 4 few-shot NER datasets show that our method caneffectively inject in-context NER ability into PLMs and significantlyoutperforms the PLMs+fine-tuning counterparts.
Abstract
Prompting methods recently achieve impressive success in few-shot learning.These methods modify input samples with prompt sentence pieces, and decodelabel tokens to map samples to corresponding labels. However, such a paradigmis very inefficient for the task of slot tagging. Since slot tagging samplesare multiple consecutive words in a sentence, the prompting methods have toenumerate all n-grams token spans to find all the possible slots, which greatlyslows down the prediction. To tackle this, we introduce an inverse paradigm forprompting. Different from the classic prompts mapping tokens to labels, wereversely predict slot values given slot types. Such inverse prompting onlyrequires a one-turn prediction for each slot type and greatly speeds up theprediction. Besides, we propose a novel Iterative Prediction Strategy, fromwhich the model learns to refine predictions by considering the relationsbetween different slot types. We find, somewhat surprisingly, the proposedmethod not only predicts faster but also significantly improves the effect(improve over 6.1 F1-scores on 10-shot setting) and achieves newstate-of-the-art performance.
Documents
- zeroshot temporal relation extraction with chatgpt
- chatgpt evaluation on sentence level relations a focus on temporal, causal, and discourse relations
- daprompt deterministic assumption prompt learning for event causality identification
- is chatgpt a good causal reasoner a comprehensive evaluation
- getting sick after seeing a doctor diagnosing and mitigating knowledge conflicts in event temporal reasoning
Abstract
The goal of temporal relation extraction is to infer the temporal relation between two events in the document. Supervised models are dominant in this task. In this work, we investigate ChatGPT’s ability on zero-shot temporal relation extraction. We designed three different prompt techniques to break down the task and evaluate ChatGPT. Our experiments show that ChatGPT’s performance has a large gap with that of supervised methods and can heavily rely on the design of prompts. We further demonstrate that ChatGPT can infer more small relation classes correctly than supervised methods. The current shortcomings of ChatGPT on temporal relation extraction are also discussed in this paper. We found that ChatGPT cannot keep consistency during temporal inference and it fails in actively long-dependency temporal inference.
Abstract
This paper aims to quantitatively evaluate the performance of ChatGPT, an interactive large language model, on inter-sentential relations such as temporal relations, causal relations, and discourse relations. Given ChatGPT's promising performance across various tasks, we conduct extensive evaluations on the whole test sets of 13 datasets, including temporal and causal relations, PDTB2.0-based and dialogue-based discourse relations, and downstream applications on discourse understanding. To achieve reliable results, we adopt three tailored prompt templates for each task, including the zero-shot prompt template, zero-shot prompt engineering (PE) template, and in-context learning (ICL) prompt template, to establish the initial baseline scores for all popular sentence-pair relation classification tasks for the first time. We find that ChatGPT exhibits strong performance in detecting and reasoning about causal relations, while it may not be proficient in identifying the temporal order between two events. It can recognize most discourse relations with existing explicit discourse connectives, but the implicit discourse relation still remains a challenging task. Meanwhile, ChatGPT performs poorly in the dialogue discourse parsing task that requires structural understanding in a dialogue before being aware of the discourse relation.
Abstract
Event Causality Identification (ECI) aims at determining whether there is a causal relation between two event mentions. Conventional prompt learning designs a prompt template to first predict an answer word and then maps it to the final decision. Unlike conventional prompts, we argue that predicting an answer word may not be a necessary prerequisite for the ECI task. Instead, we can first make a deterministic assumption on the existence of causal relation between two events and then evaluate its rationality to either accept or reject the assumption. The design motivation is to try the most utilization of the encyclopedia-like knowledge embedded in a pre-trained language model. In light of such considerations, we propose a deterministic assumption prompt learning model, called DAPrompt, for the ECI task. In particular, we design a simple deterministic assumption template concatenating with the input event pair, which includes two masks as predicted events' tokens. We use the probabilities of predicted events to evaluate the assumption rationality for the final event causality decision. Experiments on the EventStoryLine corpus and Causal-TimeBank corpus validate our design objective in terms of significant performance improvements over the state-of-the-art algorithms.
Abstract
Causal reasoning ability is crucial for numerous NLP applications. Despite the impressive emerging ability of ChatGPT in various NLP tasks, it is unclear how well ChatGPT performs in causal reasoning. In this paper, we conduct the first comprehensive evaluation of the ChatGPT's causal reasoning capabilities. Experiments show that ChatGPT is not a good causal reasoner, but a good causal explainer. Besides, ChatGPT has a serious hallucination on causal reasoning, possibly due to the reporting biases between causal and non-causal relationships in natural language, as well as ChatGPT's upgrading processes, such as RLHF. The In-Context Learning (ICL) and Chain-of-Thought (CoT) techniques can further exacerbate such causal hallucination. Additionally, the causal reasoning ability of ChatGPT is sensitive to the words used to express the causal concept in prompts, and close-ended prompts perform better than open-ended prompts. For events in sentences, ChatGPT excels at capturing explicit causality rather than implicit causality, and performs better in sentences with lower event density and smaller lexical distance between events. The code is available on https://github.com/ArrogantL/ChatGPT4CausalReasoning .
Abstract
Event temporal reasoning aims at identifying the temporal relations betweentwo or more events. However, knowledge conflicts arise when there is a mismatchbetween the actual temporal relations of events in the context and the priorknowledge or biases learned by the model. We first systematically definedistinct kinds of bias in event temporal reasoning, which include eventrelation prior bias, tense bias, narrative bias, and dependency bias, asindicators to study knowledge conflicts. To mitigate such event-relatedknowledge conflict, we introduce a Counterfactual Data Augmentation basedmethod that can be applied to both Pre-trained Language Models (PLMs) and LargeLanguage Models (LLMs) either as additional training data or demonstrations forIn-Context Learning. Experiments suggest the importance of mitigating knowledgeconflicts in event temporal reasoning tasks for reducing hallucination andhighlight the potential of counterfactual data augmentation for improving modelperformance.
Documents
- s3dst structured opendomain dialogue segmentation and state tracking in the era of llms
- roco dialectic multirobot collaboration with large language models
- building cooperative embodied agents modularly with large language models
- mindagent emergent gaming interaction
- scalable multirobot collaboration with large language models centralized or decentralized systems
Abstract
The traditional Dialogue State Tracking (DST) problem aims to track user preferences and intents in user-agent conversations. While sufficient for task-oriented dialogue systems supporting narrow domain applications, the advent of Large Language Model (LLM)-based chat systems has introduced many real-world intricacies in open-domain dialogues. These intricacies manifest in the form of increased complexity in contextual interactions, extended dialogue sessions encompassing a diverse array of topics, and more frequent contextual shifts. To handle these intricacies arising from evolving LLM-based chat systems, we propose joint dialogue segmentation and state tracking per segment in open-domain dialogue systems. Assuming a zero-shot setting appropriate to a true open-domain dialogue system, we propose S3-DST, a structured prompting technique that harnesses Pre-Analytical Recollection, a novel grounding mechanism we designed for improving long context tracking. To demonstrate the efficacy of our proposed approach in joint segmentation and state tracking, we evaluate S3-DST on a proprietary anonymized open-domain dialogue dataset, as well as publicly available DST and segmentation datasets. Across all datasets and settings, S3-DST consistently outperforms the state-of-the-art, demonstrating its potency and robustness the next generation of LLM-based chat systems.
Abstract
We propose a novel approach to multi-robot collaboration that harnesses the power of pre-trained large language models (LLMs) for both high-level communication and low-level path planning. Robots are equipped with LLMs to discuss and collectively reason task strategies. They then generate sub-task plans and task space waypoint paths, which are used by a multi-arm motion planner to accelerate trajectory planning. We also provide feedback from the environment, such as collision checking, and prompt the LLM agents to improve their plan and waypoints in-context. For evaluation, we introduce RoCoBench, a 6-task benchmark covering a wide range of multi-robot collaboration scenarios, accompanied by a text-only dataset for agent representation and reasoning. We experimentally demonstrate the effectiveness of our approach -- it achieves high success rates across all tasks in RoCoBench and adapts to variations in task semantics. Our dialog setup offers high interpretability and flexibility -- in real world experiments, we show RoCo easily incorporates human-in-the-loop, where a user can communicate and collaborate with a robot agent to complete tasks together. See project website https://project-roco.github.io for videos and code.
Abstract
Large Language Models (LLMs) have demonstrated impressive planning abilities in single-agent embodied tasks across various domains. However, their capacity for planning and communication in multi-agent cooperation remains unclear, even though these are crucial skills for intelligent embodied agents. In this paper, we present a novel framework that utilizes LLMs for multi-agent cooperation and tests it in various embodied environments. Our framework enables embodied agents to plan, communicate, and cooperate with other embodied agents or humans to accomplish long-horizon tasks efficiently. We demonstrate that recent LLMs, such as GPT-4, can surpass strong planning-based methods and exhibit emergent effective communication using our framework without requiring fine-tuning or few-shot prompting. We also discover that LLM-based agents that communicate in natural language can earn more trust and cooperate more effectively with humans. Our research underscores the potential of LLMs for embodied AI and lays the foundation for future research in multi-agent cooperation. Videos can be found on the project website https://vis-www.cs.umass.edu/Co-LLM-Agents/.
Abstract
Large Language Models (LLMs) have the capacity of performing complex scheduling in a multi-agent system and can coordinate these agents into completing sophisticated tasks that require extensive collaboration. However, despite the introduction of numerous gaming frameworks, the community has insufficient benchmarks towards building general multi-agents collaboration infrastructure that encompass both LLM and human-NPCs collaborations. In this work, we propose a novel infrastructure - MindAgent - to evaluate planning and coordination emergent capabilities for gaming interaction. In particular, our infrastructure leverages existing gaming framework, to i) require understanding of the coordinator for a multi-agent system, ii) collaborate with human players via un-finetuned proper instructions, and iii) establish an in-context learning on few-shot prompt with feedback. Furthermore, we introduce CUISINEWORLD, a new gaming scenario and related benchmark that dispatch a multi-agent collaboration efficiency and supervise multiple agents playing the game simultaneously. We conduct comprehensive evaluations with new auto-metric CoS for calculating the collaboration efficiency. Finally, our infrastructure can be deployed into real-world gaming scenarios in a customized VR version of CUISINEWORLD and adapted in existing broader Minecraft gaming domain. We hope our findings on LLMs and the new infrastructure for general-purpose scheduling and coordination can help shed light on how such skills can be obtained by learning from large language corpora.
Abstract
A flurry of recent work has demonstrated that pre-trained large language models (LLMs) can be effective task planners for a variety of single-robot tasks. The planning performance of LLMs is significantly improved via prompting techniques, such as in-context learning or re-prompting with state feedback, placing new importance on the token budget for the context window. An under-explored but natural next direction is to investigate LLMs as multi-robot task planners. However, long-horizon, heterogeneous multi-robot planning introduces new challenges of coordination while also pushing up against the limits of context window length. It is therefore critical to find token-efficient LLM planning frameworks that are also able to reason about the complexities of multi-robot coordination. In this work, we compare the task success rate and token efficiency of four multi-agent communication frameworks (centralized, decentralized, and two hybrid) as applied to four coordination-dependent multi-agent 2D task scenarios for increasing numbers of agents. We find that a hybrid framework achieves better task success rates across all four tasks and scales better to more agents. We further demonstrate the hybrid frameworks in 3D simulations where the vision-to-text problem and dynamical errors are considered. See our project website https://yongchao98.github.io/MIT-REALM-Multi-Robot/ for prompts, videos, and code.
Documents
- large language modelempowered agents for simulating macroeconomic activities
- subjectdriven texttoimage generation via apprenticeship learning
- fewshot emotion recognition in conversation with sequential prototypical networks
- exploring the intersection of large language models and agentbased modeling via prompt engineering
- data race detection using large language models
Abstract
The advent of the Web has brought about a paradigm shift in traditionaleconomics, particularly in the digital economy era, enabling the preciserecording and analysis of individual economic behavior. This has led to agrowing emphasis on data-driven modeling in macroeconomics. In macroeconomicresearch, Agent-based modeling (ABM) emerged as an alternative, evolvingthrough rule-based agents, machine learning-enhanced decision-making, and, morerecently, advanced AI agents. However, the existing works are suffering fromthree main challenges when endowing agents with human-like decision-making,including agent heterogeneity, the influence of macroeconomic trends, andmultifaceted economic factors. Large language models (LLMs) have recentlygained prominence in offering autonomous human-like characteristics. Therefore,leveraging LLMs in macroeconomic simulation presents an opportunity to overcometraditional limitations. In this work, we take an early step in introducing anovel approach that leverages LLMs in macroeconomic simulation. We designprompt-engineering-driven LLM agents to exhibit human-like decision-making andadaptability in the economic environment, with the abilities of perception,reflection, and decision-making to address the abovementioned challenges.Simulation experiments on macroeconomic activities show that LLM-empoweredagents can make realistic work and consumption decisions and emerge morereasonable macroeconomic phenomena than existing rule-based or AI agents. Ourwork demonstrates the promising potential to simulate macroeconomics based onLLM and its human-like characteristics.
Abstract
Recent text-to-image generation models like DreamBooth have made remarkableprogress in generating highly customized images of a target subject, byfine-tuning an ``expert model'' for a given subject from a few examples.However, this process is expensive, since a new expert model must be learnedfor each subject. In this paper, we present SuTI, a Subject-drivenText-to-Image generator that replaces subject-specific fine tuning within-context learning. Given a few demonstrations of a new subject, SuTI caninstantly generate novel renditions of the subject in different scenes, withoutany subject-specific optimization. SuTI is powered by apprenticeship learning,where a single apprentice model is learned from data generated by a massivenumber of subject-specific expert models. Specifically, we mine millions ofimage clusters from the Internet, each centered around a specific visualsubject. We adopt these clusters to train a massive number of expert models,each specializing in a different subject. The apprentice model SuTI then learnsto imitate the behavior of these fine-tuned experts. SuTI can generatehigh-quality and customized subject-specific images 20x faster thanoptimization-based SoTA methods. On the challenging DreamBench andDreamBench-v2, our human evaluation shows that SuTI significantly outperformsexisting models like InstructPix2Pix, Textual Inversion, Imagic, Prompt2Prompt,Re-Imagen and DreamBooth, especially on the subject and text alignment aspects.
Abstract
Several recent studies on dyadic human-human interactions have been done onconversations without specific business objectives. However, many companiesmight benefit from studies dedicated to more precise environments such as aftersales services or customer satisfaction surveys. In this work, we placeourselves in the scope of a live chat customer service in which we want todetect emotions and their evolution in the conversation flow. This contextleads to multiple challenges that range from exploiting restricted, small andmostly unlabeled datasets to finding and adapting methods for such context.Wetackle these challenges by using Few-Shot Learning while making the hypothesisit can serve conversational emotion classification for different languages andsparse labels. We contribute by proposing a variation of Prototypical Networksfor sequence labeling in conversation that we name ProtoSeq. We test thismethod on two datasets with different languages: daily conversations in Englishand customer service chat conversations in French. When applied to emotionclassification in conversations, our method proved to be competitive even whencompared to other ones.
Abstract
The final frontier for simulation is the accurate representation of complex, real-world social systems. While agent-based modeling (ABM) seeks to study the behavior and interactions of agents within a larger system, it is unable to faithfully capture the full complexity of human-driven behavior. Large language models (LLMs), like ChatGPT, have emerged as a potential solution to this bottleneck by enabling researchers to explore human-driven interactions in previously unimaginable ways. Our research investigates simulations of human interactions using LLMs. Through prompt engineering, inspired by Park et al. (2023), we present two simulations of believable proxies of human behavior: a two-agent negotiation and a six-agent murder mystery game.
Abstract
Large language models (LLMs) are demonstrating significant promise as analternate strategy to facilitate analyses and optimizations of high-performancecomputing programs, circumventing the need for resource-intensive manual toolcreation. In this paper, we explore a novel LLM-based data race detectionapproach combining prompting engineering and fine-tuning techniques. We createa dedicated dataset named DRB-ML, which is derived from DataRaceBench, withfine-grain labels showing the presence of data race pairs and their associatedvariables, line numbers, and read/write information. DRB-ML is then used toevaluate representative LLMs and fine-tune open-source ones. Our experimentshows that LLMs can be a viable approach to data race detection. However, theystill cannot compete with traditional data race detection tools when we needdetailed information about variable pairs causing data races.
Documents
- how are prompts different in terms of sensitivity
- prompt position really matters in fewshot and zeroshot nlu tasks
- flatnessaware prompt selection improves accuracy and sample efficiency
- how to prompt llms for texttosql a study in zeroshot, singledomain, and crossdomain settings
- analyzing chainofthought prompting in large language models via gradientbased feature attributions
Abstract
In-context learning (ICL) has become one of the most popular learningparadigms. While there is a growing body of literature focusing on promptengineering, there is a lack of systematic analysis comparing the effects ofprompts across different models and tasks. To address this gap, we present acomprehensive prompt analysis based on the sensitivity of a function. Ouranalysis reveals that sensitivity is an unsupervised proxy for modelperformance, as it exhibits a strong negative correlation with accuracy. We usegradient-based saliency scores to empirically demonstrate how different promptsaffect the relevance of input tokens to the output, resulting in differentlevels of sensitivity. Furthermore, we introduce sensitivity-aware decodingwhich incorporates sensitivity estimation as a penalty term in the standardgreedy decoding. We show that this approach is particularly helpful wheninformation in the input is scarce. Our work provides a fresh perspective onthe analysis of prompts, and contributes to a better understanding of themechanism of ICL.
Abstract
Prompt-based models have made remarkable advancements in the fields of zero-shot and few-shot learning, attracting a lot of attention from researchers. Developing an effective prompt template plays a critical role. However, prior studies have mainly focused on prompt vocabulary selection or embedding initialization with the reserved prompt position fixed. In this empirical study, we conduct the most comprehensive analysis to date of prompt position option for natural language understanding tasks. Our findings quantify the substantial impact prompt position has on model performance. We observe that the prompt position used in prior studies is often sub-optimal for both zero-shot and few-shot settings. These findings suggest prompt position optimisation as an interesting research direction alongside the existing focus on prompt engineering.
Abstract
With growing capabilities of large language models, prompting them has become the dominant way to access them. This has motivated the development of strategies for automatically selecting effective language prompts. In this paper, we introduce prompt flatness, a new metric to quantify the expected utility of a language prompt. This metric is inspired by flatness regularization in statistical learning that quantifies the robustness of the model towards its parameter perturbations. We provide theoretical foundations for this metric and its relationship with other prompt selection metrics, providing a comprehensive understanding of existing methods. Empirically, we show that combining prompt flatness with existing metrics improves both performance and sample efficiency. Our metric outperforms the previous prompt selection metrics with an average increase of 5% in accuracy and 10% in Pearson correlation across 6 classification benchmarks.
Abstract
Large language models (LLMs) with in-context learning have demonstratedremarkable capability in the text-to-SQL task. Previous research has promptedLLMs with various demonstration-retrieval strategies and intermediate reasoningsteps to enhance the performance of LLMs. However, those works often employvaried strategies when constructing the prompt text for text-to-SQL inputs,such as databases and demonstration examples. This leads to a lack ofcomparability in both the prompt constructions and their primary contributions.Furthermore, selecting an effective prompt construction has emerged as apersistent problem for future research. To address this limitation, wecomprehensively investigate the impact of prompt constructions across varioussettings and provide insights for future work.
Abstract
Chain-of-thought (CoT) prompting has been shown to empirically improve the accuracy of large language models (LLMs) on various question answering tasks. While understanding why CoT prompting is effective is crucial to ensuring that this phenomenon is a consequence of desired model behavior, little work has addressed this; nonetheless, such an understanding is a critical prerequisite for responsible model deployment. We address this question by leveraging gradient-based feature attribution methods which produce saliency scores that capture the influence of input tokens on model output. Specifically, we probe several open-source LLMs to investigate whether CoT prompting affects the relative importances they assign to particular input tokens. Our results indicate that while CoT prompting does not increase the magnitude of saliency scores attributed to semantically relevant tokens in the prompt compared to standard few-shot prompting, it increases the robustness of saliency scores to question perturbations and variations in model output.
Documents
- fashionsap symbols and attributes prompt for finegrained fashion visionlanguage pretraining
- language model acceptability judgements are not always robust to context
- moconvq unified physicsbased motion control via scalable discrete representations
- you are an expert linguistic annotator limits of llms as analyzers of abstract meaning representation
- rcot detecting and rectifying factual inconsistency in reasoning by reversing chainofthought
Abstract
Fashion vision-language pre-training models have shown efficacy for a wide range of downstream tasks. However, general vision-language pre-training models pay less attention to fine-grained domain features, while these features are important in distinguishing the specific domain tasks from general tasks. We propose a method for fine-grained fashion vision-language pre-training based on fashion Symbols and Attributes Prompt (FashionSAP) to model fine-grained multi-modalities fashion attributes and characteristics. Firstly, we propose the fashion symbols, a novel abstract fashion concept layer, to represent different fashion items and to generalize various kinds of fine- grained fashion features, making modelling fine-grained attributes more effective. Secondly, the attributes prompt method is proposed to make the model learn specific attributes of fashion items explicitly. We design proper prompt templates according to the format of fashion data. Comprehensive experiments are conducted on two public fashion benchmarks, i.e., FashionGen and FashionIQ, and FashionSAP gets SOTA performances for four popular fashion tasks. The ablation study also shows the proposed abstract fashion symbols, and the attribute prompt method enables the model to acquire fine-grained semantics in the fashion domain effectively. The obvious performance gains from FashionSAP provide a new baseline for future fashion task research.11The source code is available at https://github.com/hssip/FashionSAP
Abstract
Targeted syntactic evaluations of language models ask whether models showstable preferences for syntactically acceptable content over minimal-pairunacceptable inputs. Most targeted syntactic evaluation datasets ask models tomake these judgements with just a single context-free sentence as input. Thisdoes not match language models' training regime, in which input sentences arealways highly contextualized by the surrounding corpus. This mismatch raises animportant question: how robust are models' syntactic judgements in differentcontexts? In this paper, we investigate the stability of language models'performance on targeted syntactic evaluations as we vary properties of theinput context: the length of the context, the types of syntactic phenomena itcontains, and whether or not there are violations of grammaticality. We findthat model judgements are generally robust when placed in randomly sampledlinguistic contexts. However, they are substantially unstable for contextscontaining syntactic structures matching those in the critical test content.Among all tested models (GPT-2 and five variants of OPT), we significantlyimprove models' judgements by providing contexts with matching syntacticstructures, and conversely significantly worsen them using unacceptablecontexts with matching but violated syntactic structures. This effect isamplified by the length of the context, except for unrelated inputs. We showthat these changes in model performance are not explainable by simple featuresmatching the context and the test inputs, such as lexical overlap anddependency overlap. This sensitivity to highly specific syntactic features ofthe context can only be explained by the models' implicit in-context learningabilities.
Abstract
In this work, we present MoConVQ, a novel unified framework for physics-basedmotion control leveraging scalable discrete representations. Building uponvector quantized variational autoencoders (VQ-VAE) and model-basedreinforcement learning, our approach effectively learns motion embeddings froma large, unstructured dataset spanning tens of hours of motion examples. Theresultant motion representation not only captures diverse motion skills butalso offers a robust and intuitive interface for various applications. Wedemonstrate the versatility of MoConVQ through several applications: universaltracking control from various motion sources, interactive character controlwith latent motion representations using supervised learning, physics-basedmotion generation from natural language descriptions using the GPT framework,and, most interestingly, seamless integration with large language models (LLMs)with in-context learning to tackle complex and abstract tasks.
Abstract
Large language models (LLMs) show amazing proficiency and fluency in the useof language. Does this mean that they have also acquired insightful linguisticknowledge about the language, to an extent that they can serve as an "expertlinguistic annotator"? In this paper, we examine the successes and limitationsof the GPT-3, ChatGPT, and GPT-4 models in analysis of sentence meaningstructure, focusing on the Abstract Meaning Representation (AMR; Banarescu etal. 2013) parsing formalism, which provides rich graphical representations ofsentence meaning structure while abstracting away from surface forms. Wecompare models' analysis of this semantic structure across two settings: 1)direct production of AMR parses based on zero- and few-shot prompts, and 2)indirect partial reconstruction of AMR via metalinguistic natural languagequeries (e.g., "Identify the primary event of this sentence, and the predicatecorresponding to that event."). Across these settings, we find that models canreliably reproduce the basic format of AMR, and can often capture core event,argument, and modifier structure -- however, model outputs are prone tofrequent and major errors, and holistic analysis of parse acceptability showsthat even with few-shot demonstrations, models have virtually 0% success inproducing fully accurate parses. Eliciting natural language responses producessimilar patterns of errors. Overall, our findings indicate that these modelsout-of-the-box can capture aspects of semantic structure, but there remain keylimitations in their ability to support fully accurate semantic analyses orparses.
Abstract
Large language Models (LLMs) have achieved promising performance on arithmetic reasoning tasks by incorporating step-by-step chain-of-thought (CoT) prompting. However, LLMs face challenges in maintaining factual consistency during reasoning, exhibiting tendencies to condition overlooking, question misinterpretation, and condition hallucination over given problems. Existing methods use coarse-grained feedback (e.g., whether the answer is correct) to improve factual consistency. In this work, we propose RCoT (Reversing Chain-of-Thought), a novel method to improve LLMs' reasoning abilities by automatically detecting and rectifying factual inconsistency in LLMs, generated solutions. To detect factual inconsistency, RCoT first asks LLMs to reconstruct the problem based on generated solutions. Then fine-grained comparisons between the original problem and the reconstructed problem expose the factual inconsistency in the original solutions. To rectify the solution, RCoT formulates detected factual inconsistency into fine-grained feedback to guide LLMs in revising solutions. Experimental results demonstrate improvements of RCoT over standard CoT, Self-Consistency and Self-Refine across seven arithmetic datasets. Moreover, we find that manually written fine-grained feedback can dramatically improve LLMs' reasoning abilities (e.g., ChatGPT reaches 94.6% accuracy on GSM8K), encouraging the community to further explore the fine-grained feedback generation methods.
Documents
- structured chainofthought prompting for code generation
- acecoder utilizing existing code to enhance code generation
- a study on prompt design, advantages and limitations of chatgpt for deep learning program repair
- a lightweight framework for highquality code generation
- large language models are fewshot summarizers multiintent comment generation via incontext learning
Abstract
Large Language Models (LLMs) (e.g., ChatGPT) have shown impressiveperformance in code generation. LLMs take prompts as inputs, andChain-of-Thought (CoT) prompting is the state-of-the-art prompting technique.CoT prompting asks LLMs first to generate CoTs (i.e., intermediate naturallanguage reasoning steps) and then output the code. However, CoT prompting isdesigned for natural language generation and has low accuracy in codegeneration. In this paper, we propose Structured CoTs (SCoTs) and present a novelprompting technique for code generation, named SCoT prompting. Our motivationis source code contains rich structural information and any code can becomposed of three program structures (i.e., sequence, branch, and loopstructures). Intuitively, structured intermediate reasoning steps make forstructured source code. Thus, we ask LLMs to use program structures to buildCoTs, obtaining SCoTs. Then, LLMs generate the final code based on SCoTs.Compared to CoT prompting, SCoT prompting explicitly constrains LLMs to thinkabout how to solve requirements from the view of source code and further theperformance of LLMs in code generation. We apply SCoT prompting to two LLMs(i.e., ChatGPT and Codex) and evaluate it on three benchmarks (i.e., HumanEval,MBPP, and MBCPP). (1) SCoT prompting outperforms the state-of-the-art baseline- CoT prompting by up to 13.79% in Pass@1. (2) Human evaluation shows humandevelopers prefer programs from SCoT prompting. (3) SCoT prompting is robust toexamples and achieves substantial improvements.
Abstract
Large Language Models (LLMs) have shown great success in code generation.LLMs take as the input a prompt and output the code. A key question is how tomake prompts (i.e., Prompting Techniques). Existing prompting techniques aredesigned for natural language generation and have low accuracy in codegeneration. In this paper, we propose a new prompting technique named AceCoder. Ourmotivation is that code generation meets two unique challenges (i.e.,requirement understanding and code implementation). AceCoder contains two novelmechanisms (i.e., guided code generation and example retrieval) to solve thesechallenges. (1) Guided code generation asks LLMs first to analyze requirementsand output an intermediate preliminary (e.g., test cases). The preliminary isused to clarify requirements and tell LLMs "what to write". (2) Exampleretrieval selects similar programs as examples in prompts, which provide lotsof relevant content (e.g., algorithms, APIs) and teach LLMs "how to write". Weapply AceCoder to three LLMs (e.g., Codex) and evaluate it on three publicbenchmarks using the Pass@k. Results show that AceCoder can significantlyimprove the performance of LLMs on code generation. (1) In terms of Pass@1,AceCoder outperforms the state-of-the-art baseline by up to 56.4% in MBPP,70.7% in MBJP, and 88.4% in MBJSP. (2) AceCoder is effective in LLMs withdifferent sizes (i.e., 6B to 13B) and different languages (i.e., Python, Java,and JavaScript). (3) Human evaluation shows human developers prefer programsfrom AceCoder.
Abstract
ChatGPT has revolutionized many research and industrial fields. ChatGPT has shown great potential in software engineering to boost various traditional tasks such as program repair, code understanding, and code generation. However, whether automatic program repair (APR) applies to deep learning (DL) programs is still unknown. DL programs, whose decision logic is not explicitly encoded in the source code, have posed unique challenges to APR. While to repair DL programs, an APR approach needs to not only parse the source code syntactically but also needs to understand the code intention. With the best prior work, the performance of fault localization is still far less than satisfactory (only about 30\%). Therefore, in this paper, we explore ChatGPT's capability for DL program repair by asking three research questions. (1) Can ChatGPT debug DL programs effectively? (2) How can ChatGPT's repair performance be improved by prompting? (3) In which way can dialogue help facilitate the repair? On top of that, we categorize the common aspects useful for prompt design for DL program repair. Also, we propose various prompt templates to facilitate the performance and summarize the advantages and disadvantages of ChatGPT's abilities such as detecting bad code smell, code refactoring, and detecting API misuse/deprecation.
Abstract
In recent years, the use of automated source code generation utilizing transformer-based generative models has expanded, and these models can generate functional code according to the requirements of the developers. However, recent research revealed that these automatically generated source codes can contain vulnerabilities and other quality issues. Despite researchers' and practitioners' attempts to enhance code generation models, retraining and fine-tuning large language models is time-consuming and resource-intensive. Thus, we describe FRANC, a lightweight framework for recommending more secure and high-quality source code derived from transformer-based code generation models. FRANC includes a static filter to make the generated code compilable with heuristics and a quality-aware ranker to sort the code snippets based on a quality score. Moreover, the framework uses prompt engineering to fix persistent quality issues. We evaluated the framework with five Python and Java code generation models and six prompt datasets, including a newly created one in this work (SOEval). The static filter improves 9% to 46% Java suggestions and 10% to 43% Python suggestions regarding compilability. The average improvement over the NDCG@10 score for the ranking system is 0.0763, and the repairing techniques repair the highest 80% of prompts. FRANC takes, on average, 1.98 seconds for Java; for Python, it takes 0.08 seconds.
Abstract
Code comment generation aims at generating natural language descriptions fora code snippet to facilitate developers' program comprehension activities.Despite being studied for a long time, a bottleneck for existing approaches isthat given a code snippet, they can only generate one comment while developersusually need to know information from diverse perspectives such as what is thefunctionality of this code snippet and how to use it. To tackle thislimitation, this study empirically investigates the feasibility of utilizinglarge language models (LLMs) to generate comments that can fulfill developers'diverse intents. Our intuition is based on the facts that (1) the code and itspairwise comment are used during the pre-training process of LLMs to build thesemantic connection between the natural language and programming language, and(2) comments in the real-world projects, which are collected for thepre-training, usually contain different developers' intents. We thus postulatethat the LLMs can already understand the code from different perspectives afterthe pre-training. Indeed, experiments on two large-scale datasets demonstratethe rationale of our insights: by adopting the in-context learning paradigm andgiving adequate prompts to the LLM (e.g., providing it with ten or moreexamples), the LLM can significantly outperform a state-of-the-art supervisedlearning approach on generating comments with multiple intents. Results alsoshow that customized strategies for constructing the prompts andpost-processing strategies for reranking the results can both boost the LLM'sperformances, which shed light on future research directions for using LLMs toachieve comment generation.
Documents
- calibrating llmbased evaluator
- evallm interactive evaluation of large language model prompts on userdefined criteria
- benchmarking cognitive biases in large language models as evaluators
- prompting a large language model to generate diverse motivational messages a comparison with humanwritten messages
- noise2music textconditioned music generation with diffusion models
Abstract
Recent advancements in large language models (LLMs) on language modeling andemergent capabilities make them a promising reference-free evaluator of naturallanguage generation quality, and a competent alternative to human evaluation.However, hindered by the closed-source or high computational demand to host andtune, there is a lack of practice to further calibrate an off-the-shelfLLM-based evaluator towards better human alignment. In this work, we proposeAutoCalibrate, a multi-stage, gradient-free approach to automatically calibrateand align an LLM-based evaluator toward human preference. Instead of explicitlymodeling human preferences, we first implicitly encompass them within a set ofhuman labels. Then, an initial set of scoring criteria is drafted by thelanguage model itself, leveraging in-context learning on different few-shotexamples. To further calibrate this set of criteria, we select the bestperformers and re-draft them with self-refinement. Our experiments on multipletext quality evaluation datasets illustrate a significant improvement incorrelation with expert evaluation through calibration. Our comprehensivequalitative analysis conveys insightful intuitions and observations on theessence of effective scoring criteria.
Abstract
By simply composing prompts, developers can prototype novel generative applications with Large Language Models (LLMs). To refine prototypes into products, however, developers must iteratively revise prompts by evaluating outputs to diagnose weaknesses. Formative interviews (N=8) revealed that developers invest significant effort in manually evaluating outputs as they assess context-specific and subjective criteria. We present EvalLM, an interactive system for iteratively refining prompts by evaluating multiple outputs on user-defined criteria. By describing criteria in natural language, users can employ the system's LLM-based evaluator to get an overview of where prompts excel or fail, and improve these based on the evaluator's feedback. A comparative study (N=12) showed that EvalLM, when compared to manual evaluation, helped participants compose more diverse criteria, examine twice as many outputs, and reach satisfactory prompts with 59% fewer revisions. Beyond prompts, our work can be extended to augment model evaluation and alignment in specific application contexts.
Abstract
Large Language Models (LLMs) have recently been shown to be effective asautomatic evaluators with simple prompting and in-context learning. In thiswork, we assemble 15 LLMs of four different size ranges and evaluate theiroutput responses by preference ranking from the other LLMs as evaluators, suchas System Star is better than System Square. We then evaluate the quality ofranking outputs introducing the Cognitive Bias Benchmark for LLMs as Evaluators(CoBBLEr), a benchmark to measure six different cognitive biases in LLMevaluation outputs, such as the Egocentric bias where a model prefers to rankits own outputs highly in evaluation. We find that LLMs are biased text qualityevaluators, exhibiting strong indications on our bias benchmark (average of 40%of comparisons across all models) within each of their evaluations thatquestion their robustness as evaluators. Furthermore, we examine thecorrelation between human and machine preferences and calculate the averageRank-Biased Overlap (RBO) score to be 49.6%, indicating that machinepreferences are misaligned with humans. According to our findings, LLMs maystill be unable to be utilized for automatic annotation aligned with humanpreferences. Our project page is at: https://minnesotanlp.github.io/cobbler.
Abstract
Large language models (LLMs) are increasingly capable and prevalent, and can be used to produce creative content. The quality of content is influenced by the prompt used, with more specific prompts that incorporate examples generally producing better results. On from this, it could be seen that using instructions written for crowdsourcing tasks (that are specific and include examples to guide workers) could prove effective LLM prompts. To explore this, we used a previous crowdsourcing pipeline that gave examples to people to help them generate a collectively diverse corpus of motivational messages. We then used this same pipeline to generate messages using GPT-4, and compared the collective diversity of messages from: (1) crowd-writers, (2) GPT-4 using the pipeline, and (3&4) two baseline GPT-4 prompts. We found that the LLM prompts using the crowdsourcing pipeline caused GPT-4 to produce more diverse messages than the two baseline prompts. We also discuss implications from messages generated by both human writers and LLMs.
Abstract
We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts. Two types of diffusion models, a generator model, which generates an intermediate representation conditioned on text, and a cascader model, which generates high-fidelity audio conditioned on the intermediate representation and possibly the text, are trained and utilized in succession to generate high-fidelity music. We explore two options for the intermediate representation, one using a spectrogram and the other using audio with lower fidelity. We find that the generated audio is not only able to faithfully reflect key elements of the text prompt such as genre, tempo, instruments, mood, and era, but goes beyond to ground fine-grained semantics of the prompt. Pretrained large language models play a key role in this story -- they are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models. Generated examples: https://google-research.github.io/noise2music
Documents
- prompt engineering for textbased generative art
- a taxonomy of prompt modifiers for texttoimage generation
- coaudit tools to help humans doublecheck aigenerated content
- prompt sapper llmempowered software engineering infrastructure for ainative services
- constructing dreams using generative ai
Abstract
Text-based generative art has seen an explosion of interest in 2021. Online communities around text-based generative art as a novel digital medium have quickly emerged. This short paper identifies five types of prompt modifiers used by practitioners in the community of text-based generative art based on a 3-month ethnographic study on Twitter. The novel taxonomy of prompt modifiers provides researchers a conceptual starting point for investigating the practices of text-based generative art, but also may help practitioners of text-based generative art improve their images. The paper concludes with a discussion of research opportunities in the space of text-based generative art and the broader implications of prompt engineering from the perspective of human-AI interaction in future applications beyond the use case of text-based generative art.
Abstract
Text-to-image generation has seen an explosion of interest since 2021. Today,beautiful and intriguing digital images and artworks can be synthesized fromtextual inputs ("prompts") with deep generative models. Online communitiesaround text-to-image generation and AI generated art have quickly emerged. Thispaper identifies six types of prompt modifiers used by practitioners in theonline community based on a 3-month ethnographic study. The novel taxonomy ofprompt modifiers provides researchers a conceptual starting point forinvestigating the practice of text-to-image generation, but may also helppractitioners of AI generated art improve their images. We further outline howprompt modifiers are applied in the practice of "prompt engineering." Wediscuss research opportunities of this novel creative practice in the field ofHuman-Computer Interaction (HCI). The paper concludes with a discussion ofbroader implications of prompt engineering from the perspective of Human-AIInteraction (HAI) in future applications beyond the use case of text-to-imagegeneration and AI generated art.
Abstract
Users are increasingly being warned to check AI-generated content for correctness. Still, as LLMs (and other generative models) generate more complex output, such as summaries, tables, or code, it becomes harder for the user to audit or evaluate the output for quality or correctness. Hence, we are seeing the emergence of tool-assisted experiences to help the user double-check a piece of AI-generated content. We refer to these as co-audit tools. Co-audit tools complement prompt engineering techniques: one helps the user construct the input prompt, while the other helps them check the output response. As a specific example, this paper describes recent research on co-audit tools for spreadsheet computations powered by generative models. We explain why co-audit experiences are essential for any application of generative AI where quality is important and errors are consequential (as is common in spreadsheet computations). We propose a preliminary list of principles for co-audit, and outline research challenges.
Abstract
Foundation models, such as GPT-4, DALL-E have brought unprecedented AI"operating system"effect and new forms of human-AI interaction, sparking a wave of innovation in AI-native services, where natural language prompts serve as executable"code"directly (prompt as executable code), eliminating the need for programming language as an intermediary and opening up the door to personal AI. Prompt Sapper has emerged in response, committed to support the development of AI-native services by AI chain engineering. It creates a large language model (LLM) empowered software engineering infrastructure for authoring AI chains through human-AI collaborative intelligence, unleashing the AI innovation potential of every individual, and forging a future where everyone can be a master of AI innovation. This article will introduce the R\&D motivation behind Prompt Sapper, along with its corresponding AI chain engineering methodology and technical practices.
Abstract
Generative AI tools introduce new and accessible forms of media creation foryouth. They also raise ethical concerns about the generation of fake media,data protection, privacy and ownership of AI-generated art. Since generative AIis already being used in products used by youth, it is critical that theyunderstand how these tools work and how they can be used or misused. In thiswork, we facilitated students' generative AI learning through expression oftheir imagined future identities. We designed a learning workshop - Dreamingwith AI - where students learned about the inner workings of generative AItools, used text-to-image generation algorithms to create their imaged futuredreams, reflected on the potential benefits and harms of generative AI toolsand voiced their opinions about policies for the use of these tools inclassrooms. In this paper, we present the learning activities and experiencesof 34 high school students who engaged in our workshops. Students reachedcreative learning objectives by using prompt engineering to create their futuredreams, gained technical knowledge by learning the abilities, limitations,text-visual mappings and applications of generative AI, and identified mostpotential societal benefits and harms of generative AI.
Documents
- a survey of large language models for autonomous driving
- beyond task performance evaluating and reducing the flaws of large multimodal models with incontext learning
- bioinformatics in plant breeding and research on disease resistance
- the mystery and fascination of llms a comprehensive survey on the interpretation and analysis of emergent abilities
- llm4dyg can large language models solve problems on dynamic graphs
Abstract
Autonomous driving technology, a catalyst for revolutionizing transportationand urban mobility, has the tend to transition from rule-based systems todata-driven strategies. Traditional module-based systems are constrained bycumulative errors among cascaded modules and inflexible pre-set rules. Incontrast, end-to-end autonomous driving systems have the potential to avoiderror accumulation due to their fully data-driven training process, althoughthey often lack transparency due to their ``black box" nature, complicating thevalidation and traceability of decisions. Recently, large language models(LLMs) have demonstrated abilities including understanding context, logicalreasoning, and generating answers. A natural thought is to utilize theseabilities to empower autonomous driving. By combining LLM with foundationvision models, it could open the door to open-world understanding, reasoning,and few-shot learning, which current autonomous driving systems are lacking. Inthis paper, we systematically review a research line about \textit{LargeLanguage Models for Autonomous Driving (LLM4AD)}. This study evaluates thecurrent state of technological advancements, distinctly outlining the principalchallenges and prospective directions for the field. For the convenience ofresearchers in academia and industry, we provide real-time updates on thelatest advances in the field as well as relevant open-source resources via thedesignated link: https://github.com/Thinklab-SJTU/Awesome-LLM4AD.
Abstract
Following the success of Large Language Models (LLMs), Large MultimodalModels (LMMs), such as the Flamingo model and its subsequent competitors, havestarted to emerge as natural steps towards generalist agents. However,interacting with recent LMMs reveals major limitations that are hardly capturedby the current evaluation benchmarks. Indeed, task performances (e.g., VQAaccuracy) alone do not provide enough clues to understand their realcapabilities, limitations, and to which extent such models are aligned to humanexpectations. To refine our understanding of those flaws, we deviate from thecurrent evaluation paradigm and propose the EvALign-ICL framework, in which we(1) evaluate 8 recent open-source LMMs (based on the Flamingo architecture suchas OpenFlamingo and IDEFICS) on 5 different axes; hallucinations, abstention,compositionality, explainability and instruction following. Our evaluation onthese axes reveals major flaws in LMMs. To efficiently address these problems,and inspired by the success of in-context learning (ICL) in LLMs, (2) weexplore ICL as a solution and study how it affects these limitations. Based onour ICL study, (3) we push ICL further and propose new multimodal ICLapproaches such as; Multitask-ICL, Chain-of-Hindsight-ICL, andSelf-Correcting-ICL. Our findings are as follows; (1) Despite their success,LMMs have flaws that remain unsolved with scaling alone. (2) The effect of ICLon LMMs flaws is nuanced; despite its effectiveness for improvedexplainability, abstention, and instruction following, ICL does not improvecompositional abilities, and actually even amplifies hallucinations. (3) Theproposed ICL variants are promising as post-hoc approaches to efficientlytackle some of those flaws. The code is available here:https://evalign-icl.github.io/
Abstract
In the context of plant breeding, bioinformatics can empower genetic and genomic selection to determine the optimal combination of genotypes that will produce a desired phenotype and help expedite the isolation of these new varieties. Bioinformatics is also instrumental in collecting and processing plant phenotypes, which facilitates plant breeding. Robots that use automated and digital technologies to collect and analyze different types of information to monitor the environment in which plants grow, analyze the environmental stresses they face, and promptly optimize suboptimal and adverse growth conditions accordingly, have helped plant research and saved human resources. In this paper, we describe the use of various bioinformatics databases and algorithms and explore their potential applications in plant breeding and for research on plant disease resistance.
Abstract
Understanding emergent abilities, such as in-context learning (ICL) andchain-of-thought (CoT) prompting in large language models (LLMs), is of utmostimportance. This importance stems not only from the better utilization of thesecapabilities across various tasks, but also from the proactive identificationand mitigation of potential risks, including concerns of truthfulness, bias,and toxicity, that may arise alongside these capabilities. In this paper, wepresent a thorough survey on the interpretation and analysis of emergentabilities of LLMs. First, we provide a concise introduction to the backgroundand definition of emergent abilities. Then, we give an overview of advancementsfrom two perspectives: 1) a macro perspective, emphasizing studies on themechanistic interpretability and delving into the mathematical foundationsbehind emergent abilities; and 2) a micro-perspective, concerning studies thatfocus on empirical interpretability by examining factors associated with theseabilities. We conclude by highlighting the challenges encountered andsuggesting potential avenues for future research. We believe that our workestablishes the basis for further exploration into the interpretation ofemergent abilities.
Abstract
In an era marked by the increasing adoption of Large Language Models (LLMs)for various tasks, there is a growing focus on exploring LLMs' capabilities inhandling web data, particularly graph data. Dynamic graphs, which capturetemporal network evolution patterns, are ubiquitous in real-world web data.Evaluating LLMs' competence in understanding spatial-temporal information ondynamic graphs is essential for their adoption in web applications, whichremains unexplored in the literature. In this paper, we bridge the gap viaproposing to evaluate LLMs' spatial-temporal understanding abilities on dynamicgraphs, to the best of our knowledge, for the first time. Specifically, wepropose the LLM4DyG benchmark, which includes nine specially designed tasksconsidering the capability evaluation of LLMs from both temporal and spatialdimensions. Then, we conduct extensive experiments to analyze the impacts ofdifferent data generators, data statistics, prompting techniques, and LLMs onthe model performance. Finally, we propose Disentangled Spatial-TemporalThoughts (DST2) for LLMs on dynamic graphs to enhance LLMs' spatial-temporalunderstanding abilities. Our main observations are: 1) LLMs have preliminaryspatial-temporal understanding abilities on dynamic graphs, 2) Dynamic graphtasks show increasing difficulties for LLMs as the graph size and densityincrease, while not sensitive to the time span and data generation mechanism,3) the proposed DST2 prompting method can help to improve LLMs'spatial-temporal understanding abilities on dynamic graphs for most tasks. Thedata and codes will be open-sourced at publication time.
Documents
- the c4h, tat, hppr and hppd genes prompted engineering of rosmarinic acid biosynthetic pathway in salvia miltiorrhiza hairy root cultures
- beyond yes and no improving zeroshot llm rankers via scoring finegrained relevance labels
- multilabel fewshot icd coding as autoregressive generation with prompt
- generating novel leads for drug discovery using llms with logical feedback
- unsupervised contrastconsistent ranking with language models
Abstract
Rational engineering to produce biologically active plant compounds has been greatly impeded by our poor understanding of the regulatory and metabolic pathways underlying the biosynthesis of these compounds. Here we capitalized on our previously described gene-to-metabolite network in order to engineer rosmarinic acid (RA) biosynthesis pathway for the production of beneficial RA and lithospermic acid B (LAB) in Salvia miltiorrhiza hairy root cultures. Results showed their production was greatly elevated by (1) overexpression of single gene, including cinnamic acid 4-hydroxylase (c4h), tyrosine aminotransferase (tat), and 4-hydroxyphenylpyruvate reductase (hppr), (2) overexpression of both tat and hppr, and (3) suppression of 4-hydroxyphenylpyruvate dioxygenase (hppd). Co-expression of tat/hppr produced the most abundant RA (906 mg/liter) and LAB (992 mg/liter), which were 4.3 and 3.2-fold more than in their wild-type (wt) counterparts respectively. And the value of RA concentration was also higher than that reported before, that produced by means of nutrient medium optimization or elicitor treatment. It is the first report of boosting RA and LAB biosynthesis through genetic manipulation, providing an effective approach for their large-scale commercial production by using hairy root culture systems as bioreactors.
Abstract
Zero-shot text rankers powered by recent LLMs achieve remarkable rankingperformance by simply prompting. Existing prompts for pointwise LLM rankersmostly ask the model to choose from binary relevance labels like "Yes" and"No". However, the lack of intermediate relevance label options may cause theLLM to provide noisy or biased answers for documents that are partiallyrelevant to the query. We propose to incorporate fine-grained relevance labelsinto the prompt for LLM rankers, enabling them to better differentiate amongdocuments with different levels of relevance to the query and thus derive amore accurate ranking. We study two variants of the prompt template, coupledwith different numbers of relevance levels. Our experiments on 8 BEIR data setsshow that adding fine-grained relevance labels significantly improves theperformance of LLM rankers.
Abstract
Automatic International Classification of Diseases (ICD) coding aims to assign multiple ICD codes to a medical note with an average of 3,000+ tokens. This task is challenging due to the high-dimensional space of multi-label assignment (155,000+ ICD code candidates) and the long-tail challenge - Many ICD codes are infrequently assigned yet infrequent ICD codes are important clinically. This study addresses the long-tail challenge by transforming this multi-label classification task into an autoregressive generation task. Specifically, we first introduce a novel pretraining objective to generate free text diagnoses and procedures using the SOAP structure, the medical logic physicians use for note documentation. Second, instead of directly predicting the high dimensional space of ICD codes, our model generates the lower dimension of text descriptions, which then infers ICD codes. Third, we designed a novel prompt template for multi-label classification. We evaluate our Generation with Prompt (GPsoap) model with the benchmark of all code assignment (MIMIC-III-full) and few shot ICD code assignment evaluation benchmark (MIMIC-III-few). Experiments on MIMIC-III-few show that our model performs with a marco F130.2, which substantially outperforms the previous MIMIC-III-full SOTA model (marco F1 4.3) and the model specifically designed for few/zero shot setting (marco F1 18.7). Finally, we design a novel ensemble learner, a cross-attention reranker with prompts, to integrate previous SOTA and our best few-shot coding predictions. Experiments on MIMIC-III-full show that our ensemble learner substantially improves both macro and micro F1, from 10.4 to 14.6 and from 58.2 to 59.1, respectively.
Abstract
Large Language Models (LLMs) can be used as repositories of biological and chemical information to generate pharmacological lead compounds. However, for LLMs to focus on specific drug targets typically require experimentation with progressively more refined prompts. Results thus become dependent not just on what is known about the target, but also on what is known about the prompt-engineering. In this paper, we separate the prompt into domain-constraints that can be written in a standard logical form, and a simple text-based query. We investigate whether LLMs can be guided, not by refining prompts manually, but by refining the the logical component automatically, keeping the query unchanged. We describe an iterative procedure LMLF (“Language Models with Logical Feedback”) in which the constraints are progressively refined using a logical notion of generalisation. On any iteration, newly generated instances are verified against the constraint, providing “logical-feedback” for the next iteration’s refinement of the constraints. We evaluate LMLF using two well-known targets (inhibition of the Janus Kinase 2; and Dopamine Receptor D2); and two different LLMs (GPT-3 and PaLM). We show that LMLF, starting with the same logical constraints and query text, can guide both LLMs to generate potential leads. We find: (a) Binding affinities of LMLF-generated molecules are skewed towards higher binding affinities than those from existing baselines; LMLF results in generating molecules that are skewed towards higher binding affinities than without logical feedback; (c) Assessment by a computational chemist suggests that LMLF generated compounds may be novel inhibitors. These findings suggest that LLMs with logical feedback may provide a mechanism for generating new leads without requiring the domain-specialist to acquire sophisticated skills in prompt-engineering.
Abstract
Language models contain ranking-based knowledge and are powerful solvers of in-context ranking tasks. For instance, they may have parametric knowledge about the ordering of countries by size or may be able to rank reviews by sentiment. Recent work focuses on pairwise, pointwise, and listwise prompting techniques to elicit a language model's ranking knowledge. However, we find that even with careful calibration and constrained decoding, prompting-based techniques may not always be self-consistent in the rankings they produce. This motivates us to explore an alternative approach that is inspired by an unsupervised probing method called Contrast-Consistent Search (CCS). The idea is to train a probing model guided by a logical constraint: a model's representation of a statement and its negation must be mapped to contrastive true-false poles consistently across multiple statements. We hypothesize that similar constraints apply to ranking tasks where all items are related via consistent pairwise or listwise comparisons. To this end, we extend the binary CCS method to Contrast-Consistent Ranking (CCR) by adapting existing ranking methods such as the Max-Margin Loss, Triplet Loss, and Ordinal Regression objective. Our results confirm that, for the same language model, CCR probing outperforms prompting and even performs on a par with prompting much larger language models.
Documents
- human emotion knowledge representation emerges in large language model and supports discrete emotion inference
- tool documentation enables zeroshot toolusage with large language models
- languagespecific representation of emotionconcept knowledge causally supports emotion inference
- att3d amortized textto3d object synthesis
- do physicians know how to prompt the need for automatic prompt optimization help in clinical note generation
Abstract
How humans infer discrete emotions is a fundamental research question in the field of psychology. While conceptual knowledge about emotions (emotion knowledge) has been suggested to be essential for emotion inference, evidence to date is mostly indirect and inconclusive. As the large language models (LLMs) have been shown to support effective representations of various human conceptual knowledge, the present study further employed artificial neurons in LLMs to investigate the mechanism of human emotion inference. With artificial neurons activated by prompts, the LLM (RoBERTa) demonstrated a similar conceptual structure of 27 discrete emotions as that of human behaviors. Furthermore, the LLM-based conceptual structure revealed a human-like reliance on 14 underlying conceptual attributes of emotions for emotion inference. Most importantly, by manipulating attribute-specific neurons, we found that the corresponding LLM's emotion inference performance deteriorated, and the performance deterioration was correlated to the effectiveness of representations of the conceptual attributes on the human side. Our findings provide direct evidence for the emergence of emotion knowledge representation in large language models and suggest its casual support for discrete emotion inference. # These authors contributed equally: liming16@tsinghua.org.cn, yushengsu.thu@gmail.com * Corresponding authors: {liuzy, dzhang}@tsinghua.edu.cn The source code can be obtained from https://github.com/thunlp/Model_Emotion.
Abstract
Today, large language models (LLMs) are taught to use new tools by providing a few demonstrations of the tool's usage. Unfortunately, demonstrations are hard to acquire, and can result in undesirable biased usage if the wrong demonstration is chosen. Even in the rare scenario that demonstrations are readily available, there is no principled selection protocol to determine how many and which ones to provide. As tasks grow more complex, the selection search grows combinatorially and invariably becomes intractable. Our work provides an alternative to demonstrations: tool documentation. We advocate the use of tool documentation, descriptions for the individual tool usage, over demonstrations. We substantiate our claim through three main empirical findings on 6 tasks across both vision and language modalities. First, on existing benchmarks, zero-shot prompts with only tool documentation are sufficient for eliciting proper tool usage, achieving performance on par with few-shot prompts. Second, on a newly collected realistic tool-use dataset with hundreds of available tool APIs, we show that tool documentation is significantly more valuable than demonstrations, with zero-shot documentation significantly outperforming few-shot without documentation. Third, we highlight the benefits of tool documentations by tackling image generation and video tracking using just-released unseen state-of-the-art models as tools. Finally, we highlight the possibility of using tool documentation to automatically enable new applications: by using nothing more than the documentation of GroundingDino, Stable Diffusion, XMem, and SAM, LLMs can re-invent the functionalities of the just-released Grounded-SAM and Track Anything models.
Abstract
Understanding how language supports emotion inference remains a topic ofdebate in emotion science. The present study investigated whetherlanguage-derived emotion-concept knowledge would causally support emotioninference by manipulating the language-specific knowledge representations inlarge language models. Using the prompt technique, 14 attributes of emotionconcepts were found to be represented by distinct artificial neuronpopulations. By manipulating these attribute-related neurons, the majority ofthe emotion inference tasks showed performance deterioration compared to randommanipulations. The attribute-specific performance deterioration was related tothe importance of different attributes in human mental space. Our findingsprovide causal evidence in support of a language-based mechanism for emotioninference and highlight the contributions of emotion-concept knowledge.
Abstract
Text-to-3D modelling has seen exciting progress by combining generative text-to-image models with image-to-3D methods like Neural Radiance Fields. DreamFusion recently achieved high-quality results but requires a lengthy, per-prompt optimization to create 3D objects. To address this, we amortize optimization over text prompts by training on many prompts simultaneously with a unified model, instead of separately. With this, we share computation across a prompt set, training in less time than per-prompt optimization. Our framework - Amortized text-to-3D (ATT3D) - enables knowledge-sharing between prompts to generalize to unseen setups and smooth interpolations between text for novel assets and simple animations.
Abstract
This study examines the effect of prompt engineering on the performance ofLarge Language Models (LLMs) in clinical note generation. We introduce anAutomatic Prompt Optimization (APO) framework to refine initial prompts andcompare the outputs of medical experts, non-medical experts, and APO-enhancedGPT3.5 and GPT4. Results highlight GPT4 APO's superior performance instandardizing prompt quality across clinical note sections. A human-in-the-loopapproach shows that experts maintain content quality post-APO, with apreference for their own modifications, suggesting the value of expertcustomization. We recommend a two-phase optimization process, leveragingAPO-GPT4 for consistency and expert input for personalization.
Documents
- marked personas using natural language prompts to measure stereotypes in language models
- embedding democratic values into social media ais via societal objective functions
- casteist but not racist quantifying disparities in large language model bias between india and the west
- sociocultural norm similarities and differences via situational alignment and explainable textual entailment
- social simulacra creating populated prototypes for social computing systems
Abstract
To recognize and mitigate harms from large language models (LLMs), we need to understand the prevalence and nuances of stereotypes in LLM outputs. Toward this end, we present Marked Personas, a prompt-based method to measure stereotypes in LLMs for intersectional demographic groups without any lexicon or data labeling.Grounded in the sociolinguistic concept of markedness (which characterizes explicitly linguistically marked categories versus unmarked defaults), our proposed method is twofold: 1) prompting an LLM to generate personas, i.e., natural language descriptions, of the target demographic group alongside personas of unmarked, default groups; 2) identifying the words that significantly distinguish personas of the target group from corresponding unmarked ones.We find that the portrayals generated by GPT-3.5 and GPT-4 contain higher rates of racial stereotypes than human-written portrayals using the same prompts. The words distinguishing personas of marked (non-white, non-male) groups reflect patterns of othering and exoticizing these demographics. An intersectional lens further reveals tropes that dominate portrayals of marginalized groups, such as tropicalism and the hypersexualization of minoritized women. These representational harms have concerning implications for downstream applications like story generation.
Abstract
Can we design artificial intelligence (AI) systems that rank our social media feeds to consider democratic values such as mitigating partisan animosity as part of their objective functions? We introduce a method for translating established, vetted social scientific constructs into AI objective functions, which we term societal objective functions, and demonstrate the method with application to the political science construct of anti-democratic attitudes. Traditionally, we have lacked observable outcomes to use to train such models, however, the social sciences have developed survey instruments and qualitative codebooks for these constructs, and their precision facilitates translation into detailed prompts for large language models. We apply this method to create a democratic attitude model that estimates the extent to which a social media post promotes anti-democratic attitudes, and test this democratic attitude model across three studies. In Study 1, we first test the attitudinal and behavioral effectiveness of the intervention among US partisans (N=1,380) by manually annotating (alpha=.895) social media posts with anti-democratic attitude scores and testing several feed ranking conditions based on these scores. Removal (d=.20) and downranking feeds (d=.25) reduced participants' partisan animosity without compromising their experience and engagement. In Study 2, we scale up the manual labels by creating the democratic attitude model, finding strong agreement with manual labels (rho=.75). Finally, in Study 3, we replicate Study 1 using the democratic attitude model instead of manual labels to test its attitudinal and behavioral impact (N=558), and again find that the feed downranking using the societal objective function reduced partisan animosity (d=.25). This method presents a novel strategy to draw on social science theory and methods to mitigate societal harms in social media AIs.
Abstract
Large Language Models (LLMs), now used daily by millions of users, can encode societal biases, exposing their users to representational harms. A large body of scholarship on LLM bias exists but it predominantly adopts a Western-centric frame and attends comparatively less to bias levels and potential harms in the Global South. In this paper, we quantify stereotypical bias in popular LLMs according to an Indian-centric frame and compare bias levels between the Indian and Western contexts. To do this, we develop a novel dataset which we call Indian-BhED (Indian Bias Evaluation Dataset), containing stereotypical and anti-stereotypical examples for caste and religion contexts. We find that the majority of LLMs tested are strongly biased towards stereotypes in the Indian context, especially as compared to the Western context. We finally investigate Instruction Prompting as a simple intervention to mitigate such bias and find that it significantly reduces both stereotypical and anti-stereotypical biases in the majority of cases for GPT-3.5. The findings of this work highlight the need for including more diverse voices when evaluating LLMs.
Abstract
Designing systems that can reason across cultures requires that they are grounded in the norms of the contexts in which they operate. However, current research on developing computational models of social norms has primarily focused on American society. Here, we propose a novel approach to discover and compare descriptive social norms across Chinese and American cultures. We demonstrate our approach by leveraging discussions on a Chinese Q&A platform (Zhihu) and the existing SocialChemistry dataset as proxies for contrasting cultural axes, align social situations cross-culturally, and extract social norms from texts using in-context learning. Embedding Chain-of-Thought prompting in a human-AI collaborative framework, we build a high-quality dataset of 3,069 social norms aligned with social situations across Chinese and American cultures alongside corresponding free-text explanations. To test the ability of models to reason about social norms across cultures, we introduce the task of explainable social norm entailment, showing that existing models under 3B parameters have significant room for improvement in both automatic and human evaluation. Further analysis of cross-cultural norm differences based on our dataset shows empirical alignment with the social orientations framework, revealing several situational and descriptive nuances in norms across these cultures.
Abstract
Social computing prototypes probe the social behaviors that may arise in an envisioned system design. This prototyping practice is currently limited to recruiting small groups of people. Unfortunately, many challenges do not arise until a system is populated at a larger scale. Can a designer understand how a social system might behave when populated, and make adjustments to the design before the system falls prey to such challenges? We introduce social simulacra, a prototyping technique that generates a breadth of realistic social interactions that may emerge when a social computing system is populated. Social simulacra take as input the designer’s description of a community’s design—goal, rules, and member personas—and produce as output an instance of that design with simulated behavior, including posts, replies, and anti-social behaviors. We demonstrate that social simulacra shift the behaviors that they generate appropriately in response to design changes, and that they enable exploration of “what if?” scenarios where community members or moderators intervene. To power social simulacra, we contribute techniques for prompting a large language model to generate thousands of distinct community members and their social interactions with each other; these techniques are enabled by the observation that large language models’ training data already includes a wide variety of positive and negative behavior on social media platforms. In evaluations, we show that participants are often unable to distinguish social simulacra from actual community behavior and that social computing designers successfully refine their social computing designs when using social simulacra.
Documents
- instruction tuning for fewshot aspectbased sentiment analysis
- transforming sentiment analysis in the financial domain with chatgpt
- automating governing knowledge commons and contextual integrity (gkcci) privacy policy annotations with large language models
- prompt tuning large language models on personalized aspect extraction for recommendations
- toward unified controllable text generation via regular expression instruction
Abstract
Aspect-based Sentiment Analysis (ABSA) is a fine-grained sentiment analysis task which involves four elements from user-generated texts:aspect term, aspect category, opinion term, and sentiment polarity. Most computational approaches focus on some of the ABSA sub-taskssuch as tuple (aspect term, sentiment polarity) or triplet (aspect term, opinion term, sentiment polarity) extraction using either pipeline or joint modeling approaches. Recently, generative approaches have been proposed to extract all four elements as (one or more) quadrupletsfrom text as a single task. In this work, we take a step further and propose a unified framework for solving ABSA, and the associated sub-tasksto improve the performance in few-shot scenarios. To this end, we fine-tune a T5 model with instructional prompts in a multi-task learning fashion covering all the sub-tasks, as well as the entire quadruple prediction task. In experiments with multiple benchmark datasets, we show that the proposed multi-task prompting approach brings performance boost (by absolute 8.29 F1) in the few-shot learning setting.
Abstract
Financial sentiment analysis plays a crucial role in decoding market trends and guiding strategic trading decisions. Despite the deployment of advanced deep learning techniques and language models to refine sentiment analysis in finance, this study breaks new ground by investigating the potential of large language models, particularly ChatGPT 3.5, in financial sentiment analysis, with a strong emphasis on the foreign exchange market (forex). Employing a zero-shot prompting approach, we examine multiple ChatGPT prompts on a meticulously curated dataset of forex-related news headlines, measuring performance using metrics such as precision, recall, f1-score, and Mean Absolute Error (MAE) of the sentiment class. Additionally, we probe the correlation between predicted sentiment and market returns as an additional evaluation approach. ChatGPT, compared to FinBERT, a well-established sentiment analysis model for financial texts, exhibited approximately 35\% enhanced performance in sentiment classification and a 36\% higher correlation with market returns. By underlining the significance of prompt engineering, particularly in zero-shot contexts, this study spotlights ChatGPT's potential to substantially boost sentiment analysis in financial applications. By sharing the utilized dataset, our intention is to stimulate further research and advancements in the field of financial services.
Abstract
Identifying contextual integrity (CI) and governing knowledge commons (GKC)parameters in privacy policy texts can facilitate normative privacy analysis.However, GKC-CI annotation has heretofore required manual or crowdsourcedeffort. This paper demonstrates that high-accuracy GKC-CI parameter annotationof privacy policies can be performed automatically using large language models.We fine-tune 18 open-source and proprietary models on 21,588 GKC-CI annotationsfrom 16 ground truth privacy policies. Our best-performing model (fine-tunedGPT-3.5 Turbo with prompt engineering) has an accuracy of 86%, exceeding theperformance of prior crowdsourcing approaches despite the complexity of privacypolicy texts and the nuance of the GKC-CI annotation task. We apply ourbest-performing model to privacy policies from 164 popular online services,demonstrating the effectiveness of scaling GKC-CI annotation for dataexploration. We make all annotated policies as well as the training data andscripts needed to fine-tune our best-performing model publicly available forfuture research.
Abstract
Existing aspect extraction methods mostly rely on explicit or ground truth aspect information, or using data mining or machine learning approaches to extract aspects from implicit user feedback such as user reviews. It however remains under-explored how the extracted aspects can help generate more meaningful recommendations to the users. Meanwhile, existing research on aspect-based recommendations often relies on separate aspect extraction models or assumes the aspects are given, without accounting for the fact the optimal set of aspects could be dependent on the recommendation task at hand. In this work, we propose to combine aspect extraction together with aspect-based recommendations in an end-to-end manner, achieving the two goals together in a single framework. For the aspect extraction component, we leverage the recent advances in large language models and design a new prompt learning mechanism to generate aspects for the end recommendation task. For the aspect-based recommendation component, the extracted aspects are concatenated with the usual user and item features used by the recommendation model. The recommendation task mediates the learning of the user embeddings and item embeddings, which are used as soft prompts to generate aspects. Therefore, the extracted aspects are personalized and contextualized by the recommendation task. We showcase the effectiveness of our proposed method through extensive experiments on three industrial datasets, where our proposed framework significantly outperforms state-of-the-art baselines in both the personalized aspect extraction and aspect-based recommendation tasks. In particular, we demonstrate that it is necessary and beneficial to combine the learning of aspect extraction and aspect-based recommendation together. We also conduct extensive ablation studies to understand the contribution of each design component in our framework.
Abstract
Controllable text generation is a fundamental aspect of natural languagegeneration, with numerous methods proposed for different constraint types.However, these approaches often require significant architectural or decodingmodifications, making them challenging to apply to additional constraints orresolve different constraint combinations. To address this, our paperintroduces Regular Expression Instruction (REI), which utilizes aninstruction-based mechanism to fully exploit regular expressions' advantages touniformly model diverse constraints. Specifically, our REI supports all popularfine-grained controllable generation constraints, i.e., lexical, positional,and length, as well as their complex combinations, via regular expression-styleinstructions. Our method only requires fine-tuning on medium-scale languagemodels or few-shot, in-context learning on large language models, and requiresno further adjustment when applied to various constraint combinations.Experiments demonstrate that our straightforward approach yields high successrates and adaptability to various constraints while maintaining competitivenessin automatic metrics and outperforming most previous baselines.
Documents
- automatic chain of thought prompting in large language models
- boosting language models reasoning with chainofknowledge prompting
- synthetic prompting generating chainofthought demonstrations for large language models
- pal programaided language models
- lambada backward chaining for automated reasoning in natural language
Abstract
Large language models (LLMs) can perform complex reasoning by generating intermediate reasoning steps. Providing these steps for prompting demonstrations is called chain-of-thought (CoT) prompting. CoT prompting has two major paradigms. One leverages a simple prompt like"Let's think step by step"to facilitate step-by-step thinking before answering a question. The other uses a few manual demonstrations one by one, each composed of a question and a reasoning chain that leads to an answer. The superior performance of the second paradigm hinges on the hand-crafting of task-specific demonstrations one by one. We show that such manual efforts may be eliminated by leveraging LLMs with the"Let's think step by step"prompt to generate reasoning chains for demonstrations one by one, i.e., let's think not just step by step, but also one by one. However, these generated chains often come with mistakes. To mitigate the effect of such mistakes, we find that diversity matters for automatically constructing demonstrations. We propose an automatic CoT prompting method: Auto-CoT. It samples questions with diversity and generates reasoning chains to construct demonstrations. On ten public benchmark reasoning tasks with GPT-3, Auto-CoT consistently matches or exceeds the performance of the CoT paradigm that requires manual designs of demonstrations. Code is available at https://github.com/amazon-research/auto-cot
Abstract
Recently, Chain-of-Thought (CoT) prompting has delivered success on complex reasoning tasks, which aims at designing a simple prompt like ``Let's think step by step'' or multiple in-context exemplars with well-designed rationales to elicit Large Language Models (LLMs) to generate intermediate reasoning steps. However, the generated rationales often come with mistakes, making unfactual and unfaithful reasoning chains. To mitigate this brittleness, we propose a novel Chain-of-Knowledge (CoK) prompting, where we aim at eliciting LLMs to generate explicit pieces of knowledge evidence in the form of structure triple. This is inspired by our human behaviors, i.e., we can draw a mind map or knowledge map as the reasoning evidence in the brain before answering a complex question. Benefiting from CoK, we additionally introduce a F^2-Verification method to estimate the reliability of the reasoning chains in terms of factuality and faithfulness. For the unreliable response, the wrong evidence can be indicated to prompt the LLM to rethink. Extensive experiments demonstrate that our method can further improve the performance of commonsense, factual, symbolic, and arithmetic reasoning tasks.
Abstract
Large language models can perform various reasoning tasks by using chain-of-thought prompting, which guides them to find answers through step-by-step demonstrations. However, the quality of the prompts depends on the demonstrations given to the models, and creating many of them by hand is costly. We introduce Synthetic prompting, a method that leverages a few handcrafted examples to prompt the model to generate more examples by itself, and selects effective demonstrations to elicit better reasoning. Our method alternates between a backward and forward process to generate new examples. The backward process generates a question that match a sampled reasoning chain, so that the question is solvable and clear. The forward process produces a more detailed reasoning chain for the question, improving the quality of the example. We evaluate our method on numerical, symbolic, and algorithmic reasoning tasks, and show that it outperforms existing prompting techniques.
Abstract
Large language models (LLMs) have recently demonstrated an impressive ability to perform arithmetic and symbolic reasoning tasks, when provided with a few examples at test time ("few-shot prompting"). Much of this success can be attributed to prompting methods such as"chain-of-thought'', which employ LLMs for both understanding the problem description by decomposing it into steps, as well as solving each step of the problem. While LLMs seem to be adept at this sort of step-by-step decomposition, LLMs often make logical and arithmetic mistakes in the solution part, even when the problem is decomposed correctly. In this paper, we present Program-Aided Language models (PAL): a novel approach that uses the LLM to read natural language problems and generate programs as the intermediate reasoning steps, but offloads the solution step to a runtime such as a Python interpreter. With PAL, decomposing the natural language problem into runnable steps remains the only learning task for the LLM, while solving is delegated to the interpreter. We demonstrate this synergy between a neural LLM and a symbolic interpreter across 13 mathematical, symbolic, and algorithmic reasoning tasks from BIG-Bench Hard and other benchmarks. In all these natural language reasoning tasks, generating code using an LLM and reasoning using a Python interpreter leads to more accurate results than much larger models. For example, PAL using Codex achieves state-of-the-art few-shot accuracy on the GSM8K benchmark of math word problems, surpassing PaLM-540B which uses chain-of-thought by absolute 15% top-1. Our code and data are publicly available at http://reasonwithpal.com/ .
Abstract
Remarkable progress has been made on automated reasoning with natural text, by using Large Language Models (LLMs) and methods such as Chain-of-Thought prompting and Selection-Inference. These techniques search for proofs in the forward direction from axioms to the conclusion, which suffers from a combinatorial explosion of the search space, and thus high failure rates for problems requiring longer chains of reasoning. The classical automated reasoning literature has shown that reasoning in the backward direction (i.e. from intended conclusion to supporting axioms) is significantly more efficient at proof-finding. Importing this intuition into the LM setting, we develop a Backward Chaining algorithm, called LAMBADA, that decomposes reasoning into four sub-modules, that are simply implemented by few-shot prompted LLM inference. We show that LAMBADA achieves sizable accuracy boosts over state-of-the-art forward reasoning methods on two challenging logical reasoning datasets, particularly when deep and accurate proof chains are required.
Documents
- a simple baseline for knowledgebased visual question answering
- unihd at tsar2022 shared task is compute all we need for lexical simplification
- visual prompting for adversarial robustness
- pre visionlanguage prompt learning with reparameterization encoder
- adversarial robustness of promptbased fewshot learning for natural language understanding
Abstract
This paper is on the problem of Knowledge-Based Visual Question Answering(KB-VQA). Recent works have emphasized the significance of incorporating bothexplicit (through external databases) and implicit (through LLMs) knowledge toanswer questions requiring external knowledge effectively. A common limitationof such approaches is that they consist of relatively complicated pipelines andoften heavily rely on accessing GPT-3 API. Our main contribution in this paperis to propose a much simpler and readily reproducible pipeline which, in anutshell, is based on efficient in-context learning by prompting LLaMA (1 and2) using question-informative captions as contextual information. Contrary torecent approaches, our method is training-free, does not require access toexternal databases or APIs, and yet achieves state-of-the-art accuracy on theOK-VQA and A-OK-VQA datasets. Finally, we perform several ablation studies tounderstand important aspects of our method. Our code is publicly available athttps://github.com/alexandrosXe/ASimple-Baseline-For-Knowledge-Based-VQA
Abstract
Previous state-of-the-art models for lexical simplification consist of complex pipelines with several components, each of which requires deep technical knowledge and fine-tuned interaction to achieve its full potential. As an alternative, we describe a frustratingly simple pipeline based on prompted GPT-3 responses, beating competing approaches by a wide margin in settings with few training instances. Our best-performing submission to the English language track of the TSAR-2022 shared task consists of an “ensemble” of six different prompt templates with varying context levels. As a late-breaking result, we further detail a language transfer technique that allows simplification in languages other than English. Applied to the Spanish and Portuguese subset, we achieve state-of-the-art results with only minor modification to the original prompts. Aside from detailing the implementation and setup, we spend the remainder of this work discussing the particularities of prompting and implications for future work. Code for the experiments is available online at https://github.com/dennlinger/TSAR-2022-Shared-Task.
Abstract
In this work, we leverage visual prompting (VP) to improve adversarial robustness of a fixed, pre-trained model at test time. Compared to conventional adversarial defenses, VP allows us to design universal (i.e., data-agnostic) input prompting templates, which have plug-and-play capabilities at test time to achieve desired model performance without introducing much computation overhead. Although VP has been successfully applied to improving model generalization, it remains elusive whether and how it can be used to defend against adversarial attacks. We investigate this problem and show that the vanilla VP approach is not effective in adversarial defense since a universal input prompt lacks the capacity for robust learning against sample-specific adversarial perturbations. To circumvent it, we propose a new VP method, termed Class-wise Adversarial Visual Prompting (C-AVP), to generate class-wise visual prompts so as to not only leverage the strengths of ensemble prompts but also optimize their interrelations to improve model robustness. Our experiments show that C-AVP outperforms the conventional VP method, with 2.1× standard accuracy gain and 2× robust accuracy gain. Compared to classical test-time defenses, C-AVP also yields a 42× inference time speedup. Code is available at https://github.com/Phoveran/vp-for-adversarial-robustness.
Abstract
Large pre-trained vision-language models such as CLIP have demonstrated great potential in zero-shot transferability to downstream tasks. However, to attain optimal performance, the manual selection of prompts is necessary to improve alignment between the downstream image distribution and the textual class descriptions. This manual prompt engineering is the major challenge for deploying such models in practice since it requires domain expertise and is extremely time-consuming. To avoid non-trivial prompt engineering, recent work Context Optimization (CoOp) introduced the concept of prompt learning to the vision domain using learnable textual tokens. While CoOp can achieve substantial improvements over manual prompts, its learned context is worse generalizable to wider unseen classes within the same dataset. In this work, we present Prompt Learning with Reparameterization Encoder (PRE) - a simple and efficient method that enhances the generalization ability of the learnable prompt to unseen classes while maintaining the capacity to learn Base classes. Instead of directly optimizing the prompts, PRE employs a prompt encoder to reparameterize the input prompt embeddings, enhancing the exploration of task-specific knowledge from few-shot samples. Experiments and extensive ablation studies on 8 benchmarks demonstrate that our approach is an efficient method for prompt learning. Specifically, PRE achieves a notable enhancement of 5.60% in average accuracy on New classes and 3% in Harmonic mean compared to CoOp in the 16-shot setting, all achieved within a good training time.
Abstract
State-of-the-art few-shot learning (FSL) methods leverage prompt-basedfine-tuning to obtain remarkable results for natural language understanding(NLU) tasks. While much of the prior FSL methods focus on improving downstreamtask performance, there is a limited understanding of the adversarialrobustness of such methods. In this work, we conduct an extensive study ofseveral state-of-the-art FSL methods to assess their robustness to adversarialperturbations. To better understand the impact of various factors towardsrobustness (or the lack of it), we evaluate prompt-based FSL methods againstfully fine-tuned models for aspects such as the use of unlabeled data, multipleprompts, number of few-shot examples, model size and type. Our results on sixGLUE tasks indicate that compared to fully fine-tuned models, vanilla FSLmethods lead to a notable relative drop in task performance (i.e., are lessrobust) in the face of adversarial perturbations. However, using (i) unlabeleddata for prompt-based FSL and (ii) multiple prompts flip the trend. We furtherdemonstrate that increasing the number of few-shot examples and model size leadto increased adversarial robustness of vanilla FSL methods. Broadly, our worksheds light on the adversarial robustness evaluation of prompt-based FSLmethods for NLU tasks.
Documents
- exploring generative ai assisted feedback writing for students' written responses to a physics conceptual question with prompt engineering and fewshot learning
- exploring efl students' prompt engineering in humanai story writing an activity theory perspective
- cases of efl secondary students' prompt engineering pathways to complete a writing task with chatgpt
- the utility of large language models and generative ai for education research
- retrievalaugmented generation to improve math questionanswering tradeoffs between groundedness and human preference
Abstract
Instructor's feedback plays a critical role in students' development ofconceptual understanding and reasoning skills. However, grading student writtenresponses and providing personalized feedback can take a substantial amount oftime. In this study, we explore using GPT-3.5 to write feedback to studentwritten responses to conceptual questions with prompt engineering and few-shotlearning techniques. In stage one, we used a small portion (n=20) of thestudent responses on one conceptual question to iteratively train GPT. Four ofthe responses paired with human-written feedback were included in the prompt asexamples for GPT. We tasked GPT to generate feedback to the other 16 responses,and we refined the prompt after several iterations. In stage two, we gave fourstudent researchers the 16 responses as well as two versions of feedback, onewritten by the authors and the other by GPT. Students were asked to rate thecorrectness and usefulness of each feedback, and to indicate which one wasgenerated by GPT. The results showed that students tended to rate the feedbackby human and GPT equally on correctness, but they all rated the feedback by GPTas more useful. Additionally, the successful rates of identifying GPT'sfeedback were low, ranging from 0.1 to 0.6. In stage three, we tasked GPT togenerate feedback to the rest of the student responses (n=65). The feedback wasrated by four instructors based on the extent of modification needed if theywere to give the feedback to students. All the instructors rated approximately70% of the feedback statements needing only minor or no modification. Thisstudy demonstrated the feasibility of using Generative AI as an assistant togenerating feedback for student written responses with only a relatively smallnumber of examples. An AI assistance can be one of the solutions tosubstantially reduce time spent on grading student written responses.
Abstract
This study applies Activity Theory to investigate how English as a foreign language (EFL) students prompt generative artificial intelligence (AI) tools during short story writing. Sixty-seven Hong Kong secondary school students created generative-AI tools using open-source language models and wrote short stories with them. The study collected and analyzed the students' generative-AI tools, short stories, and written reflections on their conditions or purposes for prompting. The research identified three main themes regarding the purposes for which students prompt generative-AI tools during short story writing: a lack of awareness of purposes, overcoming writer's block, and developing, expanding, and improving the story. The study also identified common characteristics of students' activity systems, including the sophistication of their generative-AI tools, the quality of their stories, and their school's overall academic achievement level, for their prompting of generative-AI tools for the three purposes during short story writing. The study's findings suggest that teachers should be aware of students' purposes for prompting generative-AI tools to provide tailored instructions and scaffolded guidance. The findings may also help designers provide differentiated instructions for users at various levels of story development when using a generative-AI tool.
Abstract
ChatGPT is a state-of-the-art (SOTA) chatbot. Although it has potential to support English as a foreign language (EFL) students' writing, to effectively collaborate with it, a student must learn to engineer prompts, that is, the skill of crafting appropriate instructions so that ChatGPT produces desired outputs. However, writing an appropriate prompt for ChatGPT is not straightforward for non-technical users who suffer a trial-and-error process. This paper examines the content of EFL students' ChatGPT prompts when completing a writing task and explores patterns in the quality and quantity of the prompts. The data come from iPad screen recordings of secondary school EFL students who used ChatGPT and other SOTA chatbots for the first time to complete the same writing task. The paper presents a case study of four distinct pathways that illustrate the trial-and-error process and show different combinations of prompt content and quantity. The cases contribute evidence for the need to provide prompt engineering education in the context of the EFL writing classroom, if students are to move beyond an individual trial-and-error process, learning a greater variety of prompt content and more sophisticated prompts to support their writing.
Abstract
The use of natural language processing (NLP) techniques in engineering education can provide valuable insights into the underlying processes involved in generating text. While accessing these insights can be labor-intensive if done manually, recent advances in NLP and large language models have made it a realistic option for individuals. This study explores and evaluates a combination of clustering, summarization, and prompting techniques to analyze over 1,000 student essays in which students discussed their career interests. The specific assignment prompted students to define and explain their career goals as engineers. Using text embedding representations of student responses, we clustered the responses together to identify thematically similar statements from students. The clustered responses were then summarized to quickly identify career interest themes. We also used a set of a priori codes about career satisfaction and sectors to demonstrate an alternative approach to using these generative text models to analyze student writing. The results of this study demonstrate the feasibility and usefulness of NLP techniques in engineering education research. By automating the initial analysis of student essays, researchers and educators can more efficiently and accurately identify key themes and patterns in student writing. The methods presented in this paper have broader applications for engineering education and research purposes beyond analyzing student essays. By explaining these methods to the engineering education community, readers can utilize them in their own contexts.
Abstract
For middle-school math students, interactive question-answering (QA) with tutors is an effective way to learn. The flexibility and emergent capabilities of generative large language models (LLMs) has led to a surge of interest in automating portions of the tutoring process - including interactive QA to support conceptual discussion of mathematical concepts. However, LLM responses to math questions can be incorrect or mismatched to the educational context - such as being misaligned with a school's curriculum. One potential solution is retrieval-augmented generation (RAG), which involves incorporating a vetted external knowledge source in the LLM prompt to increase response quality. In this paper, we designed prompts that retrieve and use content from a high-quality open-source math textbook to generate responses to real student questions. We evaluate the efficacy of this RAG system for middle-school algebra and geometry QA by administering a multi-condition survey, finding that humans prefer responses generated using RAG, but not when responses are too grounded in the textbook content. We argue that while RAG is able to improve response quality, designers of math QA systems must consider trade-offs between generating responses preferred by students and responses closely matched to specific educational resources.
Documents
- images in language space exploring the suitability of large language models for vision & language tasks
- detecting hate speech with gpt3
- demystifying prompts in language models via perplexity estimation
- on measuring social biases in promptbased multitask learning
- learning performanceimproving code edits
Abstract
Large language models have demonstrated robust performance on variouslanguage tasks using zero-shot or few-shot learning paradigms. While beingactively researched, multimodal models that can additionally handle images asinput have yet to catch up in size and generality with language-only models. Inthis work, we ask whether language-only models can be utilised for tasks thatrequire visual input -- but also, as we argue, often require a strong reasoningcomponent. Similar to some recent related work, we make visual informationaccessible to the language model using separate verbalisation models.Specifically, we investigate the performance of open-source, open-accesslanguage models against GPT-3 on five vision-language tasks when giventextually-encoded visual information. Our results suggest that language modelsare effective for solving vision-language tasks even with limited samples. Thisapproach also enhances the interpretability of a model's output by providing ameans of tracing the output back through the verbalised image content.
Abstract
Sophisticated language models such as OpenAI's GPT-3 can generate hatefultext that targets marginalized groups. Given this capacity, we are interestedin whether large language models can be used to identify hate speech andclassify text as sexist or racist. We use GPT-3 to identify sexist and racisttext passages with zero-, one-, and few-shot learning. We find that with zero-and one-shot learning, GPT-3 can identify sexist or racist text with an averageaccuracy between 55 per cent and 67 per cent, depending on the category of textand type of learning. With few-shot learning, the model's accuracy can be ashigh as 85 per cent. Large language models have a role to play in hate speechdetection, and with further development they could eventually be used tocounter hate speech.
Abstract
Language models can be prompted to perform a wide variety of zero- andfew-shot learning problems. However, performance varies significantly with thechoice of prompt, and we do not yet understand why this happens or how to pickthe best prompts. In this work, we analyze the factors that contribute to thisvariance and establish a new empirical hypothesis: the performance of a promptis coupled with the extent to which the model is familiar with the language itcontains. Over a wide range of tasks, we show that the lower the perplexity ofthe prompt is, the better the prompt is able to perform the task. As a result,we devise a method for creating prompts: (1) automatically extend a small seedset of manually written prompts by paraphrasing using GPT3 and backtranslationand (2) choose the lowest perplexity prompts to get significant gains inperformance.
Abstract
Large language models trained on a mixture of NLP tasks that are convertedinto a text-to-text format using prompts, can generalize into novel forms oflanguage and handle novel tasks. A large body of work within prompt engineeringattempts to understand the effects of input forms and prompts in achievingsuperior performance. We consider an alternative measure and inquire whetherthe way in which an input is encoded affects social biases promoted in outputs.In this paper, we study T0, a large-scale multi-task text-to-text languagemodel trained using prompt-based learning. We consider two different forms ofsemantically equivalent inputs: question-answer format and premise-hypothesisformat. We use an existing bias benchmark for the former BBQ and create thefirst bias benchmark in natural language inference BBNLI with hand-writtenhypotheses while also converting each benchmark into the other form. Theresults on two benchmarks suggest that given two different formulations ofessentially the same input, T0 conspicuously acts more biased in questionanswering form, which is seen during training, compared to premise-hypothesisform which is unlike its training examples. Code and data are released underhttps://github.com/feyzaakyurek/bbnli.
Abstract
The waning of Moore's Law has shifted the focus of the tech industry towards alternative methods for continued performance gains. While optimizing compilers are a standard tool to help increase program efficiency, programmers continue to shoulder much responsibility in crafting and refactoring code with better performance characteristics. In this paper, we investigate the ability of large language models (LLMs) to suggest functionally correct, performance improving code edits. We hypothesize that language models can suggest such edits in ways that would be impractical for static analysis alone. We investigate these questions by curating a large-scale dataset of Performance-Improving Edits, PIE. PIE contains trajectories of programs, where a programmer begins with an initial, slower version and iteratively makes changes to improve the program's performance. We use PIE to evaluate and improve the capacity of large language models. Specifically, use examples from PIE to fine-tune multiple variants of CODEGEN, a billion-scale Transformer-decoder model. Additionally, we use examples from PIE to prompt OpenAI's CODEX using a few-shot prompting. By leveraging PIE, we find that both CODEX and CODEGEN can generate performance-improving edits, with speedups of more than 2.5x for over 25% of the programs, for C++ and Python, even after the C++ programs were compiled using the O3 optimization level. Crucially, we show that PIE allows CODEGEN, an open-sourced and 10x smaller model than CODEX, to match the performance of CODEX on this challenging task. Overall, this work opens new doors for creating systems and methods that can help programmers write efficient code.
Documents
- investigating the applicability of selfassessment tests for personality measurement of large language models
- increasing probability mass on answer choices does not always improve accuracy
- improving the reliability of large language models by leveraging uncertaintyaware incontext learning
- retrieving supporting evidence for generative question answering
- neuro symbolic reasoning for planning counterexample guided inductive synthesis using large language models and satisfiability solving
Abstract
As large language models (LLM) evolve in their capabilities, various recent studies have tried to quantify their behavior using psychological tools created to study human behavior. One such example is the measurement of"personality"of LLMs using personality self-assessment tests. In this paper, we take three such studies on personality measurement of LLMs that use personality self-assessment tests created to study human behavior. We use the prompts used in these three different papers to measure the personality of the same LLM. We find that all three prompts lead very different personality scores. This simple test reveals that personality self-assessment scores in LLMs depend on the subjective choice of the prompter. Since we don't know the ground truth value of personality scores for LLMs as there is no correct answer to such questions, there's no way of claiming if one prompt is more or less correct than the other. We then introduce the property of option order symmetry for personality measurement of LLMs. Since most of the self-assessment tests exist in the form of multiple choice question (MCQ) questions, we argue that the scores should also be robust to not just the prompt template but also the order in which the options are presented. This test unsurprisingly reveals that the answers to the self-assessment tests are not robust to the order of the options. These simple tests, done on ChatGPT and Llama2 models show that self-assessment personality tests created for humans are not appropriate for measuring personality in LLMs.
Abstract
When pretrained language models (LMs) are applied to discriminative taskssuch as multiple-choice questions, they place probability mass on vocabularytokens that aren't among the given answer choices. Spreading probability massacross multiple surface forms with identical meaning (such as "bath" and"bathtub") is thought to cause an underestimation of a model's trueperformance, referred to as the "surface form competition" (SFC) hypothesis.This has motivated the introduction of various probability normalizationmethods. However, many core questions remain unanswered. How do we measure SFC?Are there direct ways of reducing it, and does doing so improve taskperformance? We propose a mathematical formalism for SFC which allows us to quantify andbound its impact for the first time. We identify a simple method for reducingit -- namely, increasing probability mass on the given answer choices by a)including them in the prompt and b) using in-context learning with even justone example. We show this method eliminates the impact of SFC in the majorityof instances. Our experiments on three diverse datasets and six LMs revealseveral additional surprising findings. For example, both normalization andprompting methods for reducing SFC can be ineffective or even detrimental totask performance for some LMs. We conclude with practical insights foreffectively prompting LMs for multiple-choice tasks.
Abstract
In recent years, large-scale language models (LLMs) have gained attention for their impressive text generation capabilities. However, these models often face the challenge of"hallucination,"which undermines their reliability. In this study, we introduce an uncertainty-aware in-context learning framework to empower the model to enhance or reject its output in response to uncertainty. Human-defined methods for estimating uncertainty typically assume that"uncertainty is lower when the model's response is correct compared to when it is incorrect."However, setting a precise threshold to distinguish correctness is challenging. Therefore, we introduce uncertainty information as an intermediary variable that implicitly influences the model's behavior. Our innovative uncertainty-aware in-context learning framework involves fine-tuning the LLM using a calibration dataset. Our aim is to improve the model's responses by filtering out answers with high uncertainty while considering the model's knowledge limitations. We evaluate the model's knowledge by examining multiple responses to the same question for the presence of a correct answer. When the model lacks relevant knowledge, the response should indicate that the question cannot be answered. Conversely, when the model has relevant knowledge, the response should provide the correct answer. Extensive experiments confirm the effectiveness of our framework, leading to two key findings. First, the logit output values of the LLM partly reflect inherent uncertainty. Second, our model autonomously recognizes uncertainty, resulting in improved responses.
Abstract
Current large language models (LLMs) can exhibit near-human levels of performance on many natural language-based tasks, including open-domain question answering. Unfortunately, at this time, they also convincingly hallucinate incorrect answers, so that responses to questions must be verified against external sources before they can be accepted at face value. In this paper, we report two simple experiments to automatically validate generated answers against a corpus. We base our experiments on questions and passages from the MS MARCO (V1) test collection, and a retrieval pipeline consisting of sparse retrieval, dense retrieval and neural rerankers. In the first experiment, we validate the generated answer in its entirety. After presenting a question to an LLM and receiving a generated answer, we query the corpus with the combination of the question + generated answer. We then present the LLM with the combination of the question + generated answer + retrieved answer, prompting it to indicate if the generated answer can be supported by the retrieved answer. In the second experiment, we consider the generated answer at a more granular level, prompting the LLM to extract a list of factual statements from the answer and verifying each statement separately. We query the corpus with each factual statement and then present the LLM with the statement and the corresponding retrieved evidence. The LLM is prompted to indicate if the statement can be supported and make necessary edits using the retrieved material. With an accuracy of over 80%, we find that an LLM is capable of verifying its generated answer when a corpus of supporting material is provided. However, manual assessment of a random sample of questions reveals that incorrect generated answers are missed by this verification process. While this verification process can reduce hallucinations, it can not entirely eliminate them.
Abstract
Generative large language models (LLMs) with instruct training such as GPT-4 can follow human-provided instruction prompts and generate human-like responses to these prompts. Apart from natural language responses, they have also been found to be effective at generating formal artifacts such as code, plans, and logical specifications from natural language prompts. Despite their remarkably improved accuracy, these models are still known to produce factually incorrect or contextually inappropriate results despite their syntactic coherence - a phenomenon often referred to as hallucination. This limitation makes it difficult to use these models to synthesize formal artifacts that are used in safety-critical applications. Unlike tasks such as text summarization and question-answering, bugs in code, plan, and other formal artifacts produced by LLMs can be catastrophic. We posit that we can use the satisfiability modulo theory (SMT) solvers as deductive reasoning engines to analyze the generated solutions from the LLMs, produce counterexamples when the solutions are incorrect, and provide that feedback to the LLMs exploiting the dialog capability of instruct-trained LLMs. This interaction between inductive LLMs and deductive SMT solvers can iteratively steer the LLM to generate the correct response. In our experiments, we use planning over the domain of blocks as our synthesis task for evaluating our approach. We use GPT-4, GPT3.5 Turbo, Davinci, Curie, Babbage, and Ada as the LLMs and Z3 as the SMT solver. Our method allows the user to communicate the planning problem in natural language; even the formulation of queries to SMT solvers is automatically generated from natural language. Thus, the proposed technique can enable non-expert users to describe their problems in natural language, and the combination of LLMs and SMT solvers can produce provably correct solutions.
Documents
- fineval a chinese financial domain knowledge evaluation benchmark for large language models
- claret pretraining a correlationaware contexttoevent transformer for eventcentric generation and classification
- tallrec an effective and efficient tuning framework to align large language model with recommendation
- what can large language models do in chemistry a comprehensive benchmark on eight tasks
- chef a comprehensive evaluation framework for standardized assessment of multimodal large language models
Abstract
Large language models (LLMs) have demonstrated exceptional performance in various natural language processing tasks, yet their efficacy in more challenging and domain-specific tasks remains largely unexplored. This paper presents FinEval, a benchmark specifically designed for the financial domain knowledge in the LLMs. FinEval is a collection of high-quality multiple-choice questions covering Finance, Economy, Accounting, and Certificate. It includes 4,661 questions spanning 34 different academic subjects. To ensure a comprehensive model performance evaluation, FinEval employs a range of prompt types, including zero-shot and few-shot prompts, as well as answer-only and chain-of-thought prompts. Evaluating state-of-the-art Chinese and English LLMs on FinEval, the results show that only GPT-4 achieved an accuracy close to 70% in different prompt settings, indicating significant growth potential for LLMs in the financial domain knowledge. Our work offers a more comprehensive financial knowledge evaluation benchmark, utilizing data of mock exams and covering a wide range of evaluated LLMs.
Abstract
Generating new events given context with correlated ones plays a crucial rolein many event-centric reasoning tasks. Existing works either limit their scopeto specific scenarios or overlook event-level correlations. In this paper, wepropose to pre-train a general Correlation-aware context-to-Event Transformer(ClarET) for event-centric reasoning. To achieve this, we propose three novelevent-centric objectives, i.e., whole event recovering, contrastiveevent-correlation encoding and prompt-based event locating, which highlightevent-level correlations with effective training. The proposed ClarET isapplicable to a wide range of event-centric reasoning scenarios, consideringits versatility of (i) event-correlation types (e.g., causal, temporal,contrast), (ii) application formulations (i.e., generation and classification),and (iii) reasoning types (e.g., abductive, counterfactual and endingreasoning). Empirical fine-tuning results, as well as zero- and few-shotlearning, on 9 benchmarks (5 generation and 4 classification tasks covering 4reasoning types with diverse event correlations), verify its effectiveness andgeneralization ability.
Abstract
Large Language Models (LLMs) have demonstrated remarkable performance acrossdiverse domains, thereby prompting researchers to explore their potential foruse in recommendation systems. Initial attempts have leveraged the exceptionalcapabilities of LLMs, such as rich knowledge and strong generalization throughIn-context Learning, which involves phrasing the recommendation task asprompts. Nevertheless, the performance of LLMs in recommendation tasks remainssuboptimal due to a substantial disparity between the training tasks for LLMsand recommendation tasks, as well as inadequate recommendation data duringpre-training. To bridge the gap, we consider building a Large RecommendationLanguage Model by tunning LLMs with recommendation data. To this end, wepropose an efficient and effective Tuning framework for Aligning LLMs withRecommendation, namely TALLRec. We have demonstrated that the proposed TALLRecframework can significantly enhance the recommendation capabilities of LLMs inthe movie and book domains, even with a limited dataset of fewer than 100samples. Additionally, the proposed framework is highly efficient and can beexecuted on a single RTX 3090 with LLaMA-7B. Furthermore, the fine-tuned LLMexhibits robust cross-domain generalization. Our code and data are available athttps://github.com/SAI990323/TALLRec.
Abstract
Large Language Models (LLMs) with strong abilities in natural languageprocessing tasks have emerged and have been applied in various kinds of areassuch as science, finance and software engineering. However, the capability ofLLMs to advance the field of chemistry remains unclear. In this paper, ratherthan pursuing state-of-the-art performance, we aim to evaluate capabilities ofLLMs in a wide range of tasks across the chemistry domain. We identify threekey chemistry-related capabilities including understanding, reasoning andexplaining to explore in LLMs and establish a benchmark containing eightchemistry tasks. Our analysis draws on widely recognized datasets facilitatinga broad exploration of the capacities of LLMs within the context of practicalchemistry. Five LLMs (GPT-4, GPT-3.5, Davinci-003, Llama and Galactica) areevaluated for each chemistry task in zero-shot and few-shot in-context learningsettings with carefully selected demonstration examples and specially craftedprompts. Our investigation found that GPT-4 outperformed other models and LLMsexhibit different competitive levels in eight chemistry tasks. In addition tothe key findings from the comprehensive benchmark analysis, our work providesinsights into the limitation of current LLMs and the impact of in-contextlearning settings on LLMs' performance across various chemistry tasks. The codeand datasets used in this study are available athttps://github.com/ChemFoundationModels/ChemLLMBench.
Abstract
Multimodal Large Language Models (MLLMs) have shown impressive abilities ininteracting with visual content with myriad potential downstream tasks.However, even though a list of benchmarks has been proposed, the capabilitiesand limitations of MLLMs are still not comprehensively understood, due to alack of a standardized and holistic evaluation framework. To this end, wepresent the first Comprehensive Evaluation Framework (ChEF) that canholistically profile each MLLM and fairly compare different MLLMs. First, westructure ChEF as four modular components, i.e., Scenario as scalablemultimodal datasets, Instruction as flexible instruction retrieving formulae,Inferencer as reliable question answering strategies, and Metric as indicativetask-specific score functions. Based on them, ChEF facilitates versatileevaluations in a standardized framework, and new evaluations can be built bydesigning new Recipes (systematic selection of these four components). Notably,current MLLM benchmarks can be readily summarized as recipes of ChEF. Second,we introduce 6 new recipes to quantify competent MLLMs' desired capabilities(or called desiderata, i.e., calibration, in-context learning, instructionfollowing, language performance, hallucination, and robustness) as reliableagents that can perform real-world multimodal interactions. Third, we conduct alarge-scale evaluation of 9 prominent MLLMs on 9 scenarios and 6 desiderata.Our evaluation summarized over 20 valuable observations concerning thegeneralizability of MLLMs across various scenarios and the composite capabilityof MLLMs required for multimodal interactions. We will publicly release all thedetailed implementations for further analysis, as well as an easy-to-usemodular toolkit for the integration of new recipes and models, so that ChEF canbe a growing evaluation framework for the MLLM community.
Documents
- an empirical evaluation of using large language models for automated unit test generation
- effective test generation using pretrained large language models and mutation testing
- are hard examples also harder to explain a study with human and modelgenerated explanations
- algo synthesizing algorithmic programs with generated oracle verifiers
- evaluating the decency and consistency of data validation tests generated by llms
Abstract
Unit tests play a key role in ensuring the correctness of software. However,manually creating unit tests is a laborious task, motivating the need forautomation. Large Language Models (LLMs) have recently been applied to thisproblem, utilizing additional training or few-shot learning on examples ofexisting tests. This paper presents a large-scale empirical evaluation on theeffectiveness of LLMs for automated unit test generation without additionaltraining or manual effort, providing the LLM with the signature andimplementation of the function under test, along with usage examples extractedfrom documentation. We also attempt to repair failed generated tests byre-prompting the model with the failing test and error message. We implementour approach in TestPilot, a test generation tool for JavaScript thatautomatically generates unit tests for all API functions in an npm package. Weevaluate TestPilot using OpenAI's gpt3.5-turbo LLM on 25 npm packages with atotal of 1,684 API functions. The generated tests achieve a median statementcoverage of 70.2% and branch coverage of 52.8%, significantly improving onNessie, a recent feedback-directed JavaScript test generation technique, whichachieves only 51.3% statement coverage and 25.6% branch coverage. We also findthat 92.8% of TestPilot's generated tests have no more than 50% similarity withexisting tests (as measured by normalized edit distance), with none of thembeing exact copies. Finally, we run TestPilot with two additional LLMs,OpenAI's older code-cushman-002 LLM and the open LLM StarCoder. Overall, weobserved similar results with the former (68.2% median statement coverage), andsomewhat worse results with the latter (54.0% median statement coverage),suggesting that the effectiveness of the approach is influenced by the size andtraining set of the LLM, but does not fundamentally depend on the specificmodel.
Abstract
One of the critical phases in software development is software testing.Testing helps with identifying potential bugs and reducing maintenance costs.The goal of automated test generation tools is to ease the development of testsby suggesting efficient bug-revealing tests. Recently, researchers haveleveraged Large Language Models (LLMs) of code to generate unit tests. Whilethe code coverage of generated tests was usually assessed, the literature hasacknowledged that the coverage is weakly correlated with the efficiency oftests in bug detection. To improve over this limitation, in this paper, weintroduce MuTAP for improving the effectiveness of test cases generated by LLMsin terms of revealing bugs by leveraging mutation testing. Our goal is achievedby augmenting prompts with surviving mutants, as those mutants highlight thelimitations of test cases in detecting bugs. MuTAP is capable of generatingeffective test cases in the absence of natural language descriptions of theProgram Under Test (PUTs). We employ different LLMs within MuTAP and evaluatetheir performance on different benchmarks. Our results show that our proposedmethod is able to detect up to 28% more faulty human-written code snippets.Among these, 17% remained undetected by both the current state-of-the-art fullyautomated test generation tool (i.e., Pynguin) and zero-shot/few-shot learningapproaches on LLMs. Furthermore, MuTAP achieves a Mutation Score (MS) of 93.57%on synthetic buggy code, outperforming all other approaches in our evaluation.Our findings suggest that although LLMs can serve as a useful tool to generatetest cases, they require specific post-processing steps to enhance theeffectiveness of the generated test cases which may suffer from syntactic orfunctional errors and may be ineffective in detecting certain types of bugs andtesting corner cases PUTs.
Abstract
Recent work on explainable NLP has shown that few-shot prompting can enable large pre-trained language models (LLMs) to generate grammatical and factual natural language explanations for data labels. In this work, we study the connection between explainability and sample hardness by investigating the following research question – “Are LLMs and humans equally good at explaining data labels for both easy and hard samples?” We answer this question by first collecting human-written explanations in the form of generalizable commonsense rules on the task of Winograd Schema Challenge (Winogrande dataset). We compare these explanations with those generated by GPT-3 while varying the hardness of the test samples as well as the in-context samples. We observe that (1) GPT-3 explanations are as grammatical as human explanations regardless of the hardness of the test samples, (2) for easy examples, GPT-3 generates highly supportive explanations but human explanations are more generalizable, and (3) for hard examples, human explanations are significantly better than GPT-3 explanations both in terms of label-supportiveness and generalizability judgements. We also find that hardness of the in-context examples impacts the quality of GPT-3 explanations. Finally, we show that the supportiveness and generalizability aspects of human explanations are also impacted by sample hardness, although by a much smaller margin than models.
Abstract
Large language models (LLMs) excel at implementing code from functionality descriptions but struggle with algorithmic problems that require not only implementation but also identification of the suitable algorithm. Moreover, LLM-generated programs lack guaranteed correctness and require human verification. To address these challenges, we propose ALGO, a framework that synthesizes Algorithmic programs with LLM-Generated Oracles to guide the generation and verify their correctness. ALGO first generates a reference oracle by prompting an LLM to exhaustively enumerate all the combinations of relevant variables. This oracle is then utilized to guide an arbitrary search strategy in exploring the algorithm space and to verify the synthesized algorithms. Our study shows that the LLM-generated oracles are correct for 88% of the cases. With the oracles as verifiers, ALGO can be integrated with any existing code generation model in a model-agnostic manner to enhance its performance. Experiments show that when equipped with ALGO, we achieve an 8x better one-submission pass rate over the Codex model and a 2.6x better one-submission pass rate over CodeT, the current state-of-the-art model on CodeContests. We can also get 1.3x better pass rate over the ChatGPT Code Interpreter on unseen problems. The problem set we used for testing, the prompts we used, the verifier and solution programs, and the test cases generated by ALGO are available at https://github.com/zkx06111/ALGO.
Abstract
We investigated the potential of large language models (LLMs) in developingdataset validation tests. We carried out 96 experiments each for both GPT-3.5and GPT-4, examining different prompt scenarios, learning modes, temperaturesettings, and roles. The prompt scenarios were: 1) Asking for expectations, 2)Asking for expectations with a given context, 3) Asking for expectations afterrequesting a simulation, and 4) Asking for expectations with a provided datasample. For learning modes, we tested: 1) zero-shot, 2) one-shot, and 3)few-shot learning. We also tested four temperature settings: 0, 0.4, 0.6, and1. Furthermore, two distinct roles were considered: 1) "helpful assistant", 2)"expert data scientist". To gauge consistency, every setup was tested fivetimes. The LLM-generated responses were benchmarked against a gold standardsuite, created by an experienced data scientist knowledgeable about the data inquestion. We find there are considerable returns to the use of few-shotlearning, and that the more explicit the data setting can be the better. Thebest LLM configurations complement, rather than substitute, the gold standardresults. This study underscores the value LLMs can bring to the data cleaningand preparation stages of the data science workflow.
Documents
- one step of gradient descent is provably the optimal incontext learner with one layer of linear selfattention
- trained transformers learn linear models incontext
- how many pretraining tasks are needed for incontext learning of linear regression
- transformers as statisticians provable incontext learning with incontext algorithm selection
- causallm is not optimal for incontext learning
Abstract
Recent works have empirically analyzed in-context learning and shown thattransformers trained on synthetic linear regression tasks can learn toimplement ridge regression, which is the Bayes-optimal predictor, givensufficient capacity [Aky\"urek et al., 2023], while one-layer transformers withlinear self-attention and no MLP layer will learn to implement one step ofgradient descent (GD) on a least-squares linear regression objective [vonOswald et al., 2022]. However, the theory behind these observations remainspoorly understood. We theoretically study transformers with a single layer oflinear self-attention, trained on synthetic noisy linear regression data.First, we mathematically show that when the covariates are drawn from astandard Gaussian distribution, the one-layer transformer which minimizes thepre-training loss will implement a single step of GD on the least-squareslinear regression objective. Then, we find that changing the distribution ofthe covariates and weight vector to a non-isotropic Gaussian distribution has astrong impact on the learned algorithm: the global minimizer of thepre-training loss now implements a single step of $\textit{pre-conditioned}$GD. However, if only the distribution of the responses is changed, then thisdoes not have a large effect on the learned algorithm: even when the responsecomes from a more general family of $\textit{nonlinear}$ functions, the globalminimizer of the pre-training loss still implements a single step of GD on aleast-squares linear regression objective.
Abstract
Attention-based neural networks such as transformers have demonstrated aremarkable ability to exhibit in-context learning (ICL): Given a short promptsequence of tokens from an unseen task, they can formulate relevant per-tokenand next-token predictions without any parameter updates. By embedding asequence of labeled training data and unlabeled test data as a prompt, thisallows for transformers to behave like supervised learning algorithms. Indeed,recent work has shown that when training transformer architectures over randominstances of linear regression problems, these models' predictions mimic thoseof ordinary least squares. Towards understanding the mechanisms underlying this phenomenon, weinvestigate the dynamics of ICL in transformers with a single linearself-attention layer trained by gradient flow on linear regression tasks. Weshow that despite non-convexity, gradient flow with a suitable randominitialization finds a global minimum of the objective function. At this globalminimum, when given a test prompt of labeled examples from a new predictiontask, the transformer achieves prediction error competitive with the bestlinear predictor over the test prompt distribution. We additionallycharacterize the robustness of the trained transformer to a variety ofdistribution shifts and show that although a number of shifts are tolerated,shifts in the covariate distribution of the prompts are not. Motivated by this,we consider a generalized ICL setting where the covariate distributions canvary across prompts. We show that although gradient flow succeeds at finding aglobal minimum in this setting, the trained transformer is still brittle undermild covariate shifts. We complement this finding with experiments on large,nonlinear transformer architectures which we show are more robust undercovariate shifts.
Abstract
Transformers pretrained on diverse tasks exhibit remarkable in-contextlearning (ICL) capabilities, enabling them to solve unseen tasks solely basedon input contexts without adjusting model parameters. In this paper, we studyICL in one of its simplest setups: pretraining a linearly parameterizedsingle-layer linear attention model for linear regression with a Gaussianprior. We establish a statistical task complexity bound for the attention modelpretraining, showing that effective pretraining only requires a small number ofindependent tasks. Furthermore, we prove that the pretrained model closelymatches the Bayes optimal algorithm, i.e., optimally tuned ridge regression, byachieving nearly Bayes optimal risk on unseen tasks under a fixed contextlength. These theoretical findings complement prior experimental research andshed light on the statistical foundations of ICL.
Abstract
Neural sequence models based on the transformer architecture havedemonstrated remarkable \emph{in-context learning} (ICL) abilities, where theycan perform new tasks when prompted with training and test examples, withoutany parameter update to the model. This work first provides a comprehensivestatistical theory for transformers to perform ICL. Concretely, we show thattransformers can implement a broad class of standard machine learningalgorithms in context, such as least squares, ridge regression, Lasso, learninggeneralized linear models, and gradient descent on two-layer neural networks,with near-optimal predictive power on various in-context data distributions.Using an efficient implementation of in-context gradient descent as theunderlying mechanism, our transformer constructions admit mild size bounds, andcan be learned with polynomially many pretraining sequences. Building on these ``base'' ICL algorithms, intriguingly, we show thattransformers can implement more complex ICL procedures involving\emph{in-context algorithm selection}, akin to what a statistician can do inreal life -- A \emph{single} transformer can adaptively select different baseICL algorithms -- or even perform qualitatively different tasks -- on differentinput sequences, without any explicit prompting of the right algorithm or task.We both establish this in theory by explicit constructions, and also observethis phenomenon experimentally. In theory, we construct two general mechanismsfor algorithm selection with concrete examples: pre-ICL testing, and post-ICLvalidation. As an example, we use the post-ICL validation mechanism toconstruct a transformer that can perform nearly Bayes-optimal ICL on achallenging task -- noisy linear models with mixed noise levels.Experimentally, we demonstrate the strong in-context algorithm selectioncapabilities of standard transformer architectures.
Abstract
Recent empirical evidence indicates that transformer based in-contextlearning performs better when using a prefix language model (prefixLM), inwhich in-context samples can all attend to each other, compared to causallanguage models (causalLM), which use auto-regressive attention that prohibitsin-context samples to attend to future samples. While this result is intuitive,it is not understood from a theoretical perspective. In this paper we take atheoretical approach and analyze the convergence behavior of prefixLM andcausalLM under a certain parameter construction. Our analysis shows that bothLM types converge to their stationary points at a linear rate, but that whileprefixLM converges to the optimal solution of linear regression, causalLMconvergence dynamics follows that of an online gradient descent algorithm,which is not guaranteed to be optimal even as the number of samples growsinfinitely. We supplement our theoretical claims with empirical experimentsover synthetic and real tasks and using various types of transformers. Ourexperiments verify that causalLM consistently underperforms prefixLM in allsettings.
Documents
- passive learning of active causal strategies in agents and language models
- training on thin air improve image classification with generated data
- federated fewshot learning for mobile nlp
- federated large language model a position paper
- prompting to distill boosting datafree knowledge distillation via reinforced prompt
Abstract
What can be learned about causality and experimentation from passive data? This question is salient given recent successes of passively-trained language models in interactive domains such as tool use. Passive learning is inherently limited. However, we show that purely passive learning can in fact allow an agent to learn generalizable strategies for determining and using causal structures, as long as the agent can intervene at test time. We formally illustrate that learning a strategy of first experimenting, then seeking goals, can allow generalization from passive learning in principle. We then show empirically that agents trained via imitation on expert data can indeed generalize at test time to infer and use causal links which are never present in the training data; these agents can also generalize experimentation strategies to novel variable sets never observed in training. We then show that strategies for causal intervention and exploitation can be generalized from passive data even in a more complex environment with high-dimensional observations, with the support of natural language explanations. Explanations can even allow passive learners to generalize out-of-distribution from perfectly-confounded training data. Finally, we show that language models, trained only on passive next-word prediction, can generalize causal intervention strategies from a few-shot prompt containing examples of experimentation, together with explanations and reasoning. These results highlight the surprising power of passive learning of active causal strategies, and may help to understand the behaviors and capabilities of language models.
Abstract
Acquiring high-quality data for training discriminative models is a crucialyet challenging aspect of building effective predictive systems. In this paper,we present Diffusion Inversion, a simple yet effective method that leveragesthe pre-trained generative model, Stable Diffusion, to generate diverse,high-quality training data for image classification. Our approach captures theoriginal data distribution and ensures data coverage by inverting images to thelatent space of Stable Diffusion, and generates diverse novel training imagesby conditioning the generative model on noisy versions of these vectors. Weidentify three key components that allow our generated images to successfullysupplant the original dataset, leading to a 2-3x enhancement in samplecomplexity and a 6.5x decrease in sampling time. Moreover, our approachconsistently outperforms generic prompt-based steering methods and KNNretrieval baseline across a wide range of datasets. Additionally, wedemonstrate the compatibility of our approach with widely-used dataaugmentation techniques, as well as the reliability of the generated data insupporting various neural architectures and enhancing few-shot learning.
Abstract
Natural language processing (NLP) sees rich mobile applications. To supportvarious language understanding tasks, a foundation NLP model is oftenfine-tuned in a federated, privacy-preserving setting (FL). This processcurrently relies on at least hundreds of thousands of labeled training samplesfrom mobile clients; yet mobile users often lack willingness or knowledge tolabel their data. Such an inadequacy of data labels is known as a few-shotscenario; it becomes the key blocker for mobile NLP applications. For the first time, this work investigates federated NLP in the few-shotscenario (FedFSL). By retrofitting algorithmic advances of pseudo labeling andprompt learning, we first establish a training pipeline that deliverscompetitive accuracy when only 0.05% (fewer than 100) of the training data islabeled and the remaining is unlabeled. To instantiate the workflow, we furtherpresent a system FeS, addressing the high execution cost with novel designs.(1) Curriculum pacing, which injects pseudo labels to the training workflow ata rate commensurate to the learning progress; (2) Representational diversity, amechanism for selecting the most learnable data, only for which pseudo labelswill be generated; (3) Co-planning of a model's training depth and layercapacity. Together, these designs reduce the training delay, client energy, andnetwork traffic by up to 46.0$\times$, 41.2$\times$ and 3000.0$\times$,respectively. Through algorithm/system co-design, FFNLP demonstrates that FLcan apply to challenging settings where most training samples are unlabeled.
Abstract
Large scale language models (LLM) have received significant attention and found diverse applications across various domains, but their development encounters challenges in real-world scenarios. These challenges arise due to the scarcity of public domain data availability and the need to maintain privacy with respect to private domain data. To address these issues, federated learning (FL) has emerged as a promising technology that enables collaborative training of shared models while preserving decentralized data. We propose the concept of federated LLM, which comprises three key components, i.e., federated LLM pre-training, federated LLM fine-tuning, and federated LLM prompt engineering. For each component, we discuss its advantage over traditional LLM training methods and propose specific engineering strategies for implementation. Furthermore, we explore the novel challenges introduced by the integration of FL and LLM. We analyze existing solutions and identify potential obstacles faced by these solutions within the context of federated LLM.
Abstract
Data-free knowledge distillation (DFKD) conducts knowledge distillation via eliminating the dependence of original training data, and has recently achieved impressive results in accelerating pre-trained language models. At the heart of DFKD is to reconstruct a synthetic dataset by inverting the parameters of the uncompressed model. Prior DFKD approaches, however, have largely relied on hand-crafted priors of the target data distribution for the reconstruction, which can be inevitably biased and often incompetent to capture the intrinsic distributions. To address this problem, we propose a prompt-based method, termed as PromptDFD, that allows us to take advantage of learned language priors, which effectively harmonizes the synthetic sentences to be semantically and grammatically correct. Specifically, PromptDFD leverages a pre-trained generative model to provide language priors and introduces a reinforced topic prompter to control data synthesis, making the generated samples thematically relevant and semantically plausible, and thus friendly to downstream tasks. As shown in our experiments, the proposed method substantially improves the synthesis quality and achieves considerable improvements on distillation performance. In some cases, PromptDFD even gives rise to results on par with those from the data-driven knowledge distillation with access to the original training data.
Documents
- semanticoriented unlabeled priming for largescale language models
- few shot rationale generation using selftraining with dual teachers
- towards zerolabel language learning
- small visual language models can also be openended fewshot learners
- what do llms know about financial markets a case study on reddit market sentiment analysis
Abstract
Due to the high costs associated with finetuning large language models,various recent works propose to adapt them to specific tasks without anyparameter updates through in-context learning. Unfortunately, for in-contextlearning there is currently no way to leverage unlabeled data, which is oftenmuch easier to obtain in large quantities than labeled examples. In this work,we therefore investigate ways to make use of unlabeled examples to improve thezero-shot performance of pretrained language models without any finetuning: Weintroduce Semantic-Oriented Unlabeled Priming (SOUP), a method that classifiesexamples by retrieving semantically similar unlabeled examples, assigninglabels to them in a zero-shot fashion, and then using them for in-contextlearning. We also propose bag-of-contexts priming, a new priming strategy thatis more suitable for our setting and enables the usage of more examples thanfit into the context window.
Abstract
Self-rationalizing models that also generate a free-text explanation fortheir predicted labels are an important tool to build trustworthy AIapplications. Since generating explanations for annotated labels is a laboriousand costly pro cess, recent models rely on large pretrained language models(PLMs) as their backbone and few-shot learning. In this work we explore aself-training approach leveraging both labeled and unlabeled data to furtherimprove few-shot models, under the assumption that neither human writtenrationales nor annotated task labels are available at scale. We introduce anovel dual-teacher learning framework, which learns two specialized teachermodels for task prediction and rationalization using self-training and distillstheir knowledge into a multi-tasking student model that can jointly generatethe task label and rationale. Furthermore, we formulate a new loss function,Masked Label Regularization (MLR) which promotes explanations to be stronglyconditioned on predicted labels. Evaluation on three public datasetsdemonstrate that the proposed methods are effective in modeling task labels andgenerating faithful rationales.
Abstract
This paper explores zero-label learning in Natural Language Processing (NLP),whereby no human-annotated data is used anywhere during training and models aretrained purely on synthetic data. At the core of our framework is a novelapproach for better leveraging the powerful pretrained language models.Specifically, inspired by the recent success of few-shot inference on GPT-3, wepresent a training data creation procedure named Unsupervised Data Generation(UDG), which leverages few-shot prompts to synthesize high-quality trainingdata without real human annotations. Our method enables zero-label learning aswe train task-specific models solely on the synthetic data, yet we achievebetter or comparable results from strong baseline models trained onhuman-labeled data. Furthermore, when mixed with labeled data, our approachserves as a highly effective data augmentation procedure, achieving newstate-of-the-art results on the SuperGLUE benchmark.
Abstract
We present Self-Context Adaptation (SeCAt), a self-supervised approach thatunlocks open-ended few-shot abilities of small visual language models. Ourproposed adaptation algorithm explicitly learns from symbolic, yetself-supervised training tasks. Specifically, our approach imitates imagecaptions in a self-supervised way based on clustering a large pool of imagesfollowed by assigning semantically-unrelated names to clusters. By doing so, weconstruct the `self-context', a training signal consisting of interleavedsequences of image and pseudo-caption pairs and a query image for which themodel is trained to produce the right pseudo-caption. We demonstrate theperformance and flexibility of SeCAt on several multimodal few-shot datasets,spanning various granularities. By using models with approximately 1Bparameters we outperform the few-shot abilities of much larger models, such asFrozen and FROMAGe. SeCAt opens new possibilities for research in open-endedfew-shot learning that otherwise requires access to large or proprietarymodels.
Abstract
Market sentiment analysis on social media content requires knowledge of both financial markets and social media jargon, which makes it a challenging task for human raters. The resulting lack of high-quality labeled data stands in the way of conventional supervised learning methods. Instead, we approach this problem using semi-supervised learning with a large language model (LLM). Our pipeline generates weak financial sentiment labels for Reddit posts with an LLM and then uses that data to train a small model that can be served in production. We find that prompting the LLM to produce Chain-of-Thought summaries and forcing it through several reasoning paths helps generate more stable and accurate labels, while using a regression loss further improves distillation quality. With only a handful of prompts, the final model performs on par with existing supervised models. Though production applications of our model are limited by ethical considerations, the model’s competitive performance points to the great potential of using LLMs for tasks that otherwise require skill-intensive annotation.
Documents
- what can transformers learn incontext a case study of simple function classes
- metaincontext learning in large language models
- metavl transferring incontext learning ability from language models to visionlanguage models
- rethinking the role of scale for incontext learning an interpretabilitybased case study at 66 billion scale
- an explanation of incontext learning as implicit bayesian inference
Abstract
In-context learning refers to the ability of a model to condition on a promptsequence consisting of in-context examples (input-output pairs corresponding tosome task) along with a new query input, and generate the corresponding output.Crucially, in-context learning happens only at inference time without anyparameter updates to the model. While large language models such as GPT-3exhibit some ability to perform in-context learning, it is unclear what therelationship is between tasks on which this succeeds and what is present in thetraining data. To make progress towards understanding in-context learning, weconsider the well-defined problem of training a model to in-context learn afunction class (e.g., linear functions): that is, given data derived from somefunctions in the class, can we train a model to in-context learn "most"functions from this class? We show empirically that standard Transformers canbe trained from scratch to perform in-context learning of linear functions --that is, the trained model is able to learn unseen linear functions fromin-context examples with performance comparable to the optimal least squaresestimator. In fact, in-context learning is possible even under two forms ofdistribution shift: (i) between the training data of the model andinference-time prompts, and (ii) between the in-context examples and the queryinput during inference. We also show that we can train Transformers toin-context learn more complex function classes -- namely sparse linearfunctions, two-layer neural networks, and decision trees -- with performancethat matches or exceeds task-specific learning algorithms. Our code and modelsare available at https://github.com/dtsip/in-context-learning .
Abstract
Large language models have shown tremendous performance in a variety oftasks. In-context learning -- the ability to improve at a task after beingprovided with a number of demonstrations -- is seen as one of the maincontributors to their success. In the present paper, we demonstrate that thein-context learning abilities of large language models can be recursivelyimproved via in-context learning itself. We coin this phenomenonmeta-in-context learning. Looking at two idealized domains, a one-dimensionalregression task and a two-armed bandit task, we show that meta-in-contextlearning adaptively reshapes a large language model's priors over expectedtasks. Furthermore, we find that meta-in-context learning modifies thein-context learning strategies of such models. Finally, we extend our approachto a benchmark of real-world regression problems where we observe competitiveperformance to traditional learning algorithms. Taken together, our workimproves our understanding of in-context learning and paves the way towardadapting large language models to the environment they are applied purelythrough meta-in-context learning rather than traditional finetuning.
Abstract
Large-scale language models have shown the ability to adapt to a new task viaconditioning on a few demonstrations (i.e., in-context learning). However, inthe vision-language domain, most large-scale pre-trained vision-language (VL)models do not possess the ability to conduct in-context learning. How can weenable in-context learning for VL models? In this paper, we study aninteresting hypothesis: can we transfer the in-context learning ability fromthe language domain to VL domain? Specifically, we first meta-trains a languagemodel to perform in-context learning on NLP tasks (as in MetaICL); then wetransfer this model to perform VL tasks by attaching a visual encoder. Ourexperiments suggest that indeed in-context learning ability can be transferredcross modalities: our model considerably improves the in-context learningcapability on VL tasks and can even compensate for the size of the modelsignificantly. On VQA, OK-VQA, and GQA, our method could outperform thebaseline model while having 20 times fewer parameters.
Abstract
Language models have been shown to perform better with an increase in scaleon a wide variety of tasks via the in-context learning paradigm. In this paper,we investigate the hypothesis that the ability of a large language model toin-context learn-perform a task is not uniformly spread across all of itsunderlying components. Using a 66 billion parameter language model (OPT-66B)across a diverse set of 14 downstream tasks, we find this is indeed the case:$\sim$70% of attention heads and $\sim$20% of feed forward networks can beremoved with minimal decline in task performance. We find substantial overlapin the set of attention heads (un)important for in-context learning acrosstasks and number of in-context examples. We also address our hypothesis througha task-agnostic lens, finding that a small set of attention heads in OPT-66Bscore highly on their ability to perform primitive induction operationsassociated with in-context learning, namely, prefix matching and copying. Theseinduction heads overlap with task-specific important heads, reinforcingarguments by Olsson et al. (arXiv:2209.11895) regarding induction headgenerality to more sophisticated behaviors associated with in-context learning.Overall, our study provides several insights that indicate large languagemodels may be under-trained for in-context learning and opens up questions onhow to pre-train language models to more effectively perform in-contextlearning.
Abstract
Large language models (LMs) such as GPT-3 have the surprising ability to doin-context learning, where the model learns to do a downstream task simply byconditioning on a prompt consisting of input-output examples. The LM learnsfrom these examples without being explicitly pretrained to learn. Thus, it isunclear what enables in-context learning. In this paper, we study howin-context learning can emerge when pretraining documents have long-rangecoherence. Here, the LM must infer a latent document-level concept to generatecoherent next tokens during pretraining. At test time, in-context learningoccurs when the LM also infers a shared latent concept between examples in aprompt. We prove when this occurs despite a distribution mismatch betweenprompts and pretraining data in a setting where the pretraining distribution isa mixture of HMMs. In contrast to messy large-scale datasets used to train LMscapable of in-context learning, we generate a small-scale synthetic dataset(GINC) where Transformers and LSTMs both exhibit in-context learning. Beyondthe theory, experiments on GINC exhibit large-scale real-world phenomenaincluding improved in-context performance with model scaling (despite the samepretraining loss), sensitivity to example order, and instances where zero-shotis better than few-shot in-context learning.
Documents
- summqa at mediqachat 2023incontext learning with gpt4 for medical summarization
- scalable approach to medical wearable postmarket surveillance
- spec a soft promptbased calibration on mitigating performance variability in clinical notes summarization
- early diagnostic markers of lateonset neonatal sepsis
- the potential and pitfalls of using a large language model such as chatgpt or gpt4 as a clinical assistant
Abstract
Medical dialogue summarization is challenging due to the unstructured natureof medical conversations, the use of medical terminology in gold summaries, andthe need to identify key information across multiple symptom sets. We present anovel system for the Dialogue2Note Medical Summarization tasks in the MEDIQA2023 Shared Task. Our approach for section-wise summarization (Task A) is atwo-stage process of selecting semantically similar dialogues and using thetop-k similar dialogues as in-context examples for GPT-4. For full-notesummarization (Task B), we use a similar solution with k=1. We achieved 3rdplace in Task A (2nd among all teams), 4th place in Task B Division WiseSummarization (2nd among all teams), 15th place in Task A Section HeaderClassification (9th among all teams), and 8th place among all teams in Task B.Our results highlight the effectiveness of few-shot prompting for this task,though we also identify several weaknesses of prompting-based approaches. Wecompare GPT-4 performance with several finetuned baselines. We find that GPT-4summaries are more abstractive and shorter. We make our code publiclyavailable.
Abstract
Objective We sought to develop a weak supervision-based approach to demonstrate feasibility of post-market surveillance of wearable devices that render AF pre-diagnosis. Materials and Methods Two approaches were evaluated to reduce clinical note labeling overhead for creating a training set for a classifier: one using programmatic codes, and the other using prompts to large language models (LLMs). Probabilistically labeled notes were then used to fine-tune a classifier, which identified patients with AF pre-diagnosis mentions in a note. A retrospective cohort study was conducted, where the baseline characteristics and subsequent care patterns of patients identified by the classifier were compared against those who did not receive pre-diagnosis. Results Label model derived from prompt-based labeling heuristics using LLMs (precision = 0.67, recall = 0.83, F1 = 0.74) nearly achieved the performance of code-based heuristics (precision = 0.84, recall = 0.72, F1 = 0.77), while cutting down the cost to create a labeled training set. The classifier learned on the labeled notes accurately identified patients with AF pre-diagnosis (precision = 0.85, recall = 0.81, F1 = 0.83). Those patients who received pre-diagnosis exhibited different demographic and comorbidity characteristics, and were enriched for anticoagulation and eventual diagnosis of AF. At the index diagnosis, existence of pre-diagnosis did not stratify patients on clinical characteristics, but did correlate with anticoagulant prescription. Discussion and Conclusion Our work establishes the feasibility of an EHR-based surveillance system for wearable devices that render AF pre-diagnosis. Further work is necessary to generalize these findings for patient populations at other sites.
Abstract
Electronic health records (EHRs) store an extensive array of patient information, encompassing medical histories, diagnoses, treatments, and test outcomes. These records are crucial for enabling healthcare providers to make well-informed decisions regarding patient care. Summarizing clinical notes further assists healthcare professionals in pinpointing potential health risks and making better-informed decisions. This process contributes to reducing errors and enhancing patient outcomes by ensuring providers have access to the most pertinent and current patient data. Recent research has shown that incorporating prompts with large language models (LLMs) substantially boosts the efficacy of summarization tasks. However, we show that this approach also leads to increased output variance, resulting in notably divergent outputs even when prompts share similar meanings. To tackle this challenge, we introduce a model-agnostic Soft Prompt-Based Calibration (SPeC) pipeline that employs soft prompts to diminish variance while preserving the advantages of prompt-based summarization. Experimental findings on multiple clinical note tasks and LLMs indicate that our method not only bolsters performance but also effectively curbs variance for various LLMs, providing a more uniform and dependable solution for summarizing vital medical information.
Abstract
Objective: Early diagnosis of nosocomial infections in newborns is a great challenge, because in the initial phase of systemic infection, clinical symptoms are often non-specific, and routinely used hematological markers are not sufficiently informative. The aim of this study was to determine the potential of early inflammatory markers to diagnose late-onset neonatal sepsis—procalcitonin (PCT), interleukin 6 (IL-6), interleukin 8 (IL-8) and endocan (ESM-1). Material and methods: A prospective clinical–epidemiological study was conducted in a third-level NICU in Pleven, Bulgaria. Patients with suspected late-onset sepsis and healthy controls were tested. A sandwich ELISA method was used to measure the serum concentrations of biomarkers. Results: Sixty newborns were included, of which 35% symptomatic and infected, 33.3% symptomatic but uninfected and 31.7% asymptomatic controls. The mean values of PCT, IL-6, I/T index and PLT differ significantly in the three groups. For ESM-1, IL-8 and CRP, the difference was statistically insignificant. The best sensitivity (78%) and negative predictive value (84%) was found for IL-6. The combinations of PCT + IL-6 and PCT + IL-6+ I/T+ PLT showed very good diagnostic potential. Conclusion: The introduction into the routine practice of indicators such as PCT and IL-6 may provide an opportunity to promptly optimize the diagnostic and therapeutic approach to LOS.
Abstract
Recent studies have demonstrated promising performance of ChatGPT and GPT-4 on several medical domain tasks. However, none have assessed its performance using a large-scale real-world electronic health record database, nor have evaluated its utility in providing clinical diagnostic assistance for patients across a full range of disease presentation. We performed two analyses using ChatGPT and GPT-4, one to identify patients with specific medical diagnoses using a real-world large electronic health record database and the other, in providing diagnostic assistance to healthcare workers in the prospective evaluation of hypothetical patients. Our results show that GPT-4 across disease classification tasks with chain of thought and few-shot prompting can achieve performance as high as 96% F1 scores. For patient assessment, GPT-4 can accurately diagnose three out of four times. However, there were mentions of factually incorrect statements, overlooking crucial medical findings, recommendations for unnecessary investigations and overtreatment. These issues coupled with privacy concerns, make these models currently inadequate for real world clinical use. However, limited data and time needed for prompt engineering in comparison to configuration of conventional machine learning workflows highlight their potential for scalability across healthcare applications.
Documents
- lowresource authorship style transfer can nonfamous authors be imitated
- controlling personality style in dialogue with zeroshot promptbased learning
- fewshot adaptation for parsing contextual utterances with llms
- promptandrerank a method for zeroshot and fewshot arbitrary textual style transfer with small language models
- log parsing with promptbased fewshot learning
Abstract
Authorship style transfer involves altering text to match the style of atarget author whilst preserving the original meaning. Existing unsupervisedapproaches like STRAP have largely focused on style transfer to target authorswith many examples of their writing style in books, speeches, or otherpublished works. This high-resource training data requirement (often greaterthan 100,000 words) makes these approaches primarily useful for style transferto published authors, politicians, or other well-known figures and authorshipstyles, while style transfer to non-famous authors has not been well-studied.We introduce the \textit{low-resource authorship style transfer} task, a morechallenging class of authorship style transfer where only a limited amount oftext in the target author's style may exist. In our experiments, wespecifically choose source and target authors from Reddit and style transfertheir Reddit posts, limiting ourselves to just 16 posts (on average ~500 words)of the target author's style. Style transfer accuracy is typically measured byhow often a classifier or human judge will classify an output as written by thetarget author. Recent authorship representations models excel at authorshipidentification even with just a few writing samples, making automaticevaluation of this task possible for the first time through evaluation metricswe propose. Our results establish an in-context learning technique we developas the strongest baseline, though we find current approaches do not yet achievemastery of this challenging task. We release our data and implementations toencourage further investigation.
Abstract
Prompt-based or in-context learning has achieved high zero-shot performance on many natural language generation (NLG) tasks. Here we explore the performance of prompt-based learning for simultaneously controlling the personality and the semantic accuracy of an NLG for task-oriented dialogue. We experiment with prompt-based learning on the PERSONAGE restaurant recommendation corpus to generate semantically and stylistically-controlled text for 5 different Big-5 personality types: agreeable, disagreeable, conscientious, unconscientious, and extravert. We test two different classes of discrete prompts to generate utterances for a particular personality style: (1) prompts that demonstrate generating directly from a meaning representation that includes a personality specification; and (2) prompts that rely on first converting the meaning representation to a textual pseudo-reference, and then using the pseudo-reference in a textual style transfer (TST) prompt. In each case, we show that we can vastly improve performance by over-generating outputs and ranking them, testing several ranking functions based on automatic metrics for semantic accuracy, personality-match, and fluency. We also test whether NLG personality demonstrations from the restaurant domain can be used with meaning representations for the video game domain to generate personality stylized utterances about video games. Our findings show that the TST prompts produces the highest semantic accuracy (78.46% for restaurants and 87.6% for video games) and personality accuracy (100% for restaurants and 97% for video games). Our results on transferring personality style to video game utterances are surprisingly good. To our knowledge, there is no previous work testing the application of prompt-based learning to simultaneously controlling both style and semantic accuracy in NLG.
Abstract
We evaluate the ability of semantic parsers based on large language models(LLMs) to handle contextual utterances. In real-world settings, there typicallyexists only a limited number of annotated contextual utterances due toannotation cost, resulting in an imbalance compared to non-contextualutterances. Therefore, parsers must adapt to contextual utterances with a fewtraining examples. We examine four major paradigms for doing so inconversational semantic parsing i.e., Parse-with-Utterance-History,Parse-with-Reference-Program, Parse-then-Resolve, and Rewrite-then-Parse. Tofacilitate such cross-paradigm comparisons, we constructSMCalFlow-EventQueries, a subset of contextual examples from SMCalFlow withadditional annotations. Experiments with in-context learning and fine-tuningsuggest that Rewrite-then-Parse is the most promising paradigm whenholistically considering parsing accuracy, annotation cost, and error types.
Abstract
We propose a method for arbitrary textual style transfer (TST)—the task of transforming a text into any given style—utilizing general-purpose pre-trained language models. Our method, Prompt-and-Rerank, is based on a mathematical formulation of the TST task, decomposing it into three constituent components: textual similarity, target style strength, and fluency. Our method uses zero-shot or few-shot prompting to obtain a set of candidate generations in the target style, and then re-ranks them according to the three components. Our method enables small pre-trained language models to perform on par with state-of-the-art large-scale models while using two orders of magnitude less compute and memory. We also investigate the effect of model size and prompt design (e.g., prompt paraphrasing and delimiter-pair choice) on style transfer quality across seven diverse textual style transfer datasets, finding, among other things, that delimiter-pair choice has a large impact on performance, and that models have biases on the direction of style transfer.
Abstract
Logs generated by large-scale software systems provide crucial informationfor engineers to understand the system status and diagnose problems of thesystems. Log parsing, which converts raw log messages into structured data, isthe first step to enabling automated log analytics. Existing log parsersextract the common part as log templates using statistical features. However,these log parsers often fail to identify the correct templates and parametersbecause: 1) they often overlook the semantic meaning of log messages, and 2)they require domain-specific knowledge for different log datasets. To addressthe limitations of existing methods, in this paper, we propose LogPPT tocapture the patterns of templates using prompt-based few-shot learning. LogPPTutilises a novel prompt tuning method to recognise keywords and parametersbased on a few labelled log data. In addition, an adaptive random samplingalgorithm is designed to select a small yet diverse training set. We haveconducted extensive experiments on 16 public log datasets. The experimentalresults show that LogPPT is effective and efficient for log parsing.
Documents
- fewshot reranking for multihop qa via language model prompting
- generate rather than retrieve large language models are strong context generators
- robut a systematic study of table qa robustness against humanannotated adversarial perturbations
- augmented embeddings for custom retrievals
- interleaving retrieval with chainofthought reasoning for knowledgeintensive multistep questions
Abstract
We study few-shot reranking for multi-hop QA with open-domain questions. Toalleviate the need for a large number of labeled question-document pairs forretriever training, we propose PromptRank, which relies on large languagemodels prompting for multi-hop path reranking. PromptRank first constructs aninstruction-based prompt that includes a candidate document path and thencomputes the relevance score between a given question and the path based on theconditional likelihood of the question given the path prompt according to alanguage model. PromptRank yields strong retrieval performance on HotpotQA withonly 128 training examples compared to state-of-the-art methods trained onthousands of examples -- 73.6 recall@10 by PromptRank vs. 77.8 by PathRetrieverand 77.5 by multi-hop dense retrieval. Code available athttps://github.com/mukhal/PromptRank
Abstract
Knowledge-intensive tasks, such as open-domain question answering (QA), require access to a large amount of world or domain knowledge. A common approach for knowledge-intensive tasks is to employ a retrieve-then-read pipeline that first retrieves a handful of relevant contextual documents from an external corpus such as Wikipedia and then predicts an answer conditioned on the retrieved documents. In this paper, we present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators. We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer. Furthermore, we propose a novel clustering-based prompting method that selects distinct prompts, resulting in the generated documents that cover different perspectives, leading to better recall over acceptable answers. We conduct extensive experiments on three different knowledge-intensive tasks, including open-domain QA, fact checking, and dialogue system. Notably, GenRead achieves 71.6 and 54.4 exact match scores on TriviaQA and WebQ, significantly outperforming the state-of-the-art retrieve-then-read pipeline DPR-FiD by +4.0 and +3.9, without retrieving any documents from any external knowledge source. Lastly, we demonstrate the model performance can be further improved by combining retrieval and generation. Our code and generated documents can be found at https://github.com/wyu97/GenRead.
Abstract
Despite significant progress having been made in question answering ontabular data (Table QA), it's unclear whether, and to what extent existingTable QA models are robust to task-specific perturbations, e.g., replacing keyquestion entities or shuffling table columns. To systematically study therobustness of Table QA models, we propose a benchmark called RobuT, whichbuilds upon existing Table QA datasets (WTQ, WikiSQL-Weak, and SQA) andincludes human-annotated adversarial perturbations in terms of table header,table content, and question. Our results indicate that both state-of-the-artTable QA models and large language models (e.g., GPT-3) with few-shot learningfalter in these adversarial sets. We propose to address this problem by usinglarge language models to generate adversarial examples to enhance training,which significantly improves the robustness of Table QA models. Our data andcode is publicly available at https://github.com/yilunzhao/RobuT.
Abstract
Information retrieval involves selecting artifacts from a corpus that are most relevant to a given search query. The flavor of retrieval typically used in classical applications can be termed as homogeneous and relaxed, where queries and corpus elements are both natural language (NL) utterances (homogeneous) and the goal is to pick most relevant elements from the corpus in the Top-K, where K is large, such as 10, 25, 50 or even 100 (relaxed). Recently, retrieval is being used extensively in preparing prompts for large language models (LLMs) to enable LLMs to perform targeted tasks. These new applications of retrieval are often heterogeneous and strict -- the queries and the corpus contain different kinds of entities, such as NL and code, and there is a need for improving retrieval at Top-K for small values of K, such as K=1 or 3 or 5. Current dense retrieval techniques based on pretrained embeddings provide a general-purpose and powerful approach for retrieval, but they are oblivious to task-specific notions of similarity of heterogeneous artifacts. We introduce Adapted Dense Retrieval, a mechanism to transform embeddings to enable improved task-specific, heterogeneous and strict retrieval. Adapted Dense Retrieval works by learning a low-rank residual adaptation of the pretrained black-box embedding. We empirically validate our approach by showing improvements over the state-of-the-art general-purpose embeddings-based baseline.
Abstract
Prompting-based large language models (LLMs) are surprisingly powerful at generating natural language reasoning steps or Chains-of-Thoughts (CoT) for multi-step question answering (QA). They struggle, however, when the necessary knowledge is either unavailable to the LLM or not up-to-date within its parameters. While using the question to retrieve relevant text from an external knowledge source helps LLMs, we observe that this one-step retrieve-and-read approach is insufficient for multi-step QA. Here, what to retrieve depends on what has already been derived, which in turn may depend on what was previously retrieved. To address this, we propose IRCoT, a new approach for multi-step QA that interleaves retrieval with steps (sentences) in a CoT, guiding the retrieval with CoT and in turn using retrieved results to improve CoT. Using IRCoT with GPT3 substantially improves retrieval (up to 21 points) as well as downstream QA (up to 15 points) on four datasets: HotpotQA, 2WikiMultihopQA, MuSiQue, and IIRC. We observe similar substantial gains in out-of-distribution (OOD) settings as well as with much smaller models such as Flan-T5-large without additional training. IRCoT reduces model hallucination, resulting in factually more accurate CoT reasoning.
Documents
- solving probability and statistics problems by program synthesis
- a neural network solves, explains, and generates university math problems by program synthesis and fewshot learning at human level
- limits of an ai program for solving college math problems
- using large language model to solve and explain physics word problems approaching human level
- thought propagation an analogical approach to complex reasoning with large language models
Abstract
We solve university level probability and statistics questions by programsynthesis using OpenAI's Codex, a Transformer trained on text and fine-tuned oncode. We transform course problems from MIT's 18.05 Introduction to Probabilityand Statistics and Harvard's STAT110 Probability into programming tasks. Wethen execute the generated code to get a solution. Since these course questionsare grounded in probability, we often aim to have Codex generate probabilisticprograms that simulate a large number of probabilistic dependencies to computeits solution. Our approach requires prompt engineering to transform thequestion from its original form to an explicit, tractable form that results ina correct program and solution. To estimate the amount of work needed totranslate an original question into its tractable form, we measure thesimilarity between original and transformed questions. Our work is the first tointroduce a new dataset of university-level probability and statistics problemsand solve these problems in a scalable fashion using the program synthesiscapabilities of large language models.
Abstract
We demonstrate that a neural network pre-trained on text and fine-tuned oncode solves mathematics course problems, explains solutions, and generates newquestions at a human level. We automatically synthesize programs using few-shotlearning and OpenAI's Codex transformer and execute them to solve courseproblems at 81% automatic accuracy. We curate a new dataset of questions fromMIT's largest mathematics courses (Single Variable and Multivariable Calculus,Differential Equations, Introduction to Probability and Statistics, LinearAlgebra, and Mathematics for Computer Science) and Columbia University'sComputational Linear Algebra. We solve questions from a MATH dataset (onPrealgebra, Algebra, Counting and Probability, Intermediate Algebra, NumberTheory, and Precalculus), the latest benchmark of advanced mathematics problemsdesigned to assess mathematical reasoning. We randomly sample questions andgenerate solutions with multiple modalities, including numbers, equations, andplots. The latest GPT-3 language model pre-trained on text automatically solvesonly 18.8% of these university questions using zero-shot learning and 30.8%using few-shot learning and the most recent chain of thought prompting. Incontrast, program synthesis with few-shot learning using Codex fine-tuned oncode generates programs that automatically solve 81% of these questions. Ourapproach improves the previous state-of-the-art automatic solution accuracy onthe benchmark topics from 8.8% to 81.1%. We perform a survey to evaluate thequality and difficulty of generated questions. This work is the first toautomatically solve university-level mathematics course questions at a humanlevel and the first work to explain and generate university-level mathematicscourse questions at scale, a milestone for higher education.
Abstract
Drori et al. (2022) report that "A neural network solves, explains, andgenerates university math problems by program synthesis and few-shot learningat human level ... [It] automatically answers 81\% of university-levelmathematics problems." The system they describe is indeed impressive; however,the above description is very much overstated. The work of solving the problemsis done, not by a neural network, but by the symbolic algebra package Sympy.Problems of various formats are excluded from consideration. The so-called"explanations" are just rewordings of lines of code. Answers are marked ascorrect that are not in the form specified in the problem. Most seriously, itseems that in many cases the system uses the correct answer given in the testcorpus to guide its path to solving the problem.
Abstract
Our work demonstrates that large language model (LLM) pre-trained on textscan not only solve pure math word problems, but also physics word problems,whose solution requires calculation and inference based on prior physicalknowledge. We collect and annotate the first physics word problemdataset-PhysQA, which contains over 1000 junior high school physics wordproblems (covering Kinematics, Mass&Density, Mechanics, Heat, Electricity).Then we use OpenAI' s GPT3.5 to generate the answer of these problems and foundthat GPT3.5 could automatically solve 49.3% of the problems through zero-shotlearning and 73.2% through few-shot learning. This result demonstrates that byusing similar problems and their answers as prompt, LLM could solve elementaryphysics word problems approaching human level performance. In addition tosolving problems, GPT3.5 can also summarize the knowledge or topics covered bythe problems, provide relevant explanations, and generate new physics wordproblems based on the input. Our work is the first research to focus on theautomatic solving, explanation, and generation of physics word problems acrossvarious types and scenarios, and we achieve an acceptable and state-of-the-artaccuracy. This underscores the potential of LLMs for further applications insecondary education.
Abstract
Large Language Models (LLMs) have achieved remarkable success in reasoning tasks with the development of prompting methods. However, existing prompting approaches cannot reuse insights of solving similar problems and suffer from accumulated errors in multi-step reasoning, since they prompt LLMs to reason \textit{from scratch}. To address these issues, we propose \textbf{\textit{Thought Propagation} (TP)}, which explores the analogous problems and leverages their solutions to enhance the complex reasoning ability of LLMs. These analogous problems are related to the input one, with reusable solutions and problem-solving strategies. Thus, it is promising to propagate insights of solving previous analogous problems to inspire new problem-solving. To achieve this, TP first prompts LLMs to propose and solve a set of analogous problems that are related to the input one. Then, TP reuses the results of analogous problems to directly yield a new solution or derive a knowledge-intensive plan for execution to amend the initial solution obtained from scratch. TP is compatible with existing prompting approaches, allowing plug-and-play generalization and enhancement in a wide range of tasks without much labor in task-specific prompt engineering. Experiments across three challenging tasks demonstrate TP enjoys a substantial improvement over the baselines by an average of 12\% absolute increase in finding the optimal solutions in Shortest-path Reasoning, 13\% improvement of human preference in Creative Writing, and 15\% enhancement in the task completion rate of LLM-Agent Planning.
Documents
- instanceaware prompt learning for language understanding and generation
- fewshot multimodal sentiment analysis based on multimodal probabilistic fusion prompts
- fewshot stance detection via targetaware prompt distillation
- psg promptbased sequence generation for acronym extraction
- knowprompt knowledgeaware prompttuning with synergistic optimization for relation extraction
Abstract
Recently, prompt learning has become a new paradigm to utilize pre-trainedlanguage models (PLMs) and achieves promising results in downstream tasks witha negligible increase of parameters. The current usage of discrete andcontinuous prompts assumes that the prompt is fixed for a specific task and allsamples in the task share the same prompt. However, a task may contain quitediverse samples in which some are easy and others are difficult, and diverseprompts are desirable. In this paper, we propose an instance-aware promptlearning method that learns a different prompt for each instance. Specifically,we suppose that each learnable prompt token has a different contribution todifferent instances, and we learn the contribution by calculating the relevancescore between an instance and each prompt token. The contribution weightedprompt would be instance aware. We apply our method to both unidirectional andbidirectional PLMs on both language understanding and generation tasks.Extensive experiments demonstrate that our method obtains considerableimprovements compared to strong baselines. Especially, our method achieves thestate-of-the-art on the SuperGLUE few-shot learning benchmark.
Abstract
Multimodal sentiment analysis has gained significant attention due to the proliferation of multimodal content on social media. However, existing studies in this area rely heavily on large-scale supervised data, which is time-consuming and labor-intensive to collect. Thus, there is a need to address the challenge of few-shot multimodal sentiment analysis. To tackle this problem, we propose a novel method called Multimodal Probabilistic Fusion Prompts (MultiPoint) that leverages diverse cues from different modalities for multimodal sentiment detection in the few-shot scenario. Specifically, we start by introducing a Consistently Distributed Sampling approach called CDS, which ensures that the few-shot dataset has the same category distribution as the full dataset. Unlike previous approaches primarily using prompts based on the text modality, we design unified multimodal prompts to reduce discrepancies between different modalities and dynamically incorporate multimodal demonstrations into the context of each multimodal instance. To enhance the model's robustness, we introduce a probabilistic fusion method to fuse output predictions from multiple diverse prompts for each input. Our extensive experiments on six datasets demonstrate the effectiveness of our approach. First, our method outperforms strong baselines in the multimodal few-shot setting. Furthermore, under the same amount of data (1% of the full dataset), our CDS-based experimental results significantly outperform those based on previously sampled datasets constructed from the same number of instances of each class.
Abstract
Stance detection aims to identify whether the author of a text is in favorof, against, or neutral to a given target. The main challenge of this taskcomes two-fold: few-shot learning resulting from the varying targets and thelack of contextual information of the targets. Existing works mainly focus onsolving the second issue by designing attention-based models or introducingnoisy external knowledge, while the first issue remains under-explored. In thispaper, inspired by the potential capability of pre-trained language models(PLMs) serving as knowledge bases and few-shot learners, we propose tointroduce prompt-based fine-tuning for stance detection. PLMs can provideessential contextual information for the targets and enable few-shot learningvia prompts. Considering the crucial role of the target in stance detectiontask, we design target-aware prompts and propose a novel verbalizer. Instead ofmapping each label to a concrete word, our verbalizer maps each label to avector and picks the label that best captures the correlation between thestance and the target. Moreover, to alleviate the possible defect of dealingwith varying targets with a single hand-crafted prompt, we propose to distillthe information learned from multiple prompts. Experimental results show thesuperior performance of our proposed model in both full-data and few-shotscenarios.
Abstract
Acronym extraction aims to find acronyms (i.e., short-forms) and theirmeanings (i.e., long-forms) from the documents, which is important forscientific document understanding (SDU@AAAI-22) tasks. Previous works aredevoted to modeling this task as a paragraph-level sequence labeling problem.However, it lacks the effective use of the external knowledge, especially whenthe datasets are in a low-resource setting. Recently, the prompt-based methodwith the vast pre-trained language model can significantly enhance theperformance of the low-resourced downstream tasks. In this paper, we propose aPrompt-based Sequence Generation (PSG) method for the acronym extraction task.Specifically, we design a template for prompting the extracted acronym textswith auto-regression. A position extraction algorithm is designed forextracting the position of the generated answers. The results on the acronymextraction of Vietnamese and Persian in a low-resource setting show that theproposed method outperforms all other competitive state-of-the-art (SOTA)methods.
Abstract
Recently, prompt-tuning has achieved promising results for specific few-shot classification tasks. The core idea of prompt-tuning is to insert text pieces (i.e., templates) into the input and transform a classification task into a masked language modeling problem. However, for relation extraction, determining an appropriate prompt template requires domain expertise, and it is cumbersome and time-consuming to obtain a suitable label word. Furthermore, there exists abundant semantic and prior knowledge among the relation labels that cannot be ignored. To this end, we focus on incorporating knowledge among relation labels into prompt-tuning for relation extraction and propose a Knowledge-aware Prompt-tuning approach with synergistic optimization (KnowPrompt). Specifically, we inject latent knowledge contained in relation labels into prompt construction with learnable virtual type words and answer words. Then, we synergistically optimize their representation with structured constraints. Extensive experimental results on five datasets with standard and low-resource settings demonstrate the effectiveness of our approach. Our code and datasets are available in GitHub1 for reproducibility.
Documents
- better integrating vision and semantics for improving fewshot classification
- multiprompt with depth partitioned crossmodal learning
- multimodal prompt transformer with hybrid contrastive learning for emotion recognition in conversation
- not all features matter enhancing fewshot clip with adaptive prior refinement
- gopro generate and optimize prompts in clip using selfsupervised learning
Abstract
Some recent methods address few-shot classification by integrating visual and semantic prototypes. However, they usually ignore the difference in feature structure between the visual and semantic modalities, which leads to limited performance improvements. In this paper, we propose a novel method, called bimodal integrator (BMI), to better integrate visual and semantic prototypes. In BMI, we first construct a latent space for each modality via a variational autoencoder, and then align the semantic latent space to the visual latent space. Through this semantics-to-vision alignment, the semantic modality is mapped to the visual latent space and has the same feature structure as the visual modality. As a result, the visual and semantic prototypes can be better integrated. In addition, based on the multivariate Gaussian distribution and the prompt engineering, a data augmentation scheme is designed to ensure the accuracy of modality alignment during the training process. Experimental results demonstrate that BMI significantly improves few-shot classification, making simple baselines outperform the most advanced methods on miniImageNet and tieredImageNet datasets.
Abstract
In recent years, soft prompt learning methods have been proposed to fine-tune large-scale vision-language pre-trained models for various downstream tasks. These methods typically combine learnable textual tokens with class tokens as input for models with frozen parameters. However, they often employ a single prompt to describe class contexts, failing to capture categories' diverse attributes adequately. This study introduces the Partitioned Multi-modal Prompt (PMPO), a multi-modal prompting technique that extends the soft prompt from a single learnable prompt to multiple prompts. Our method divides the visual encoder depths and connects learnable prompts to the separated visual depths, enabling different prompts to capture the hierarchical contextual depths of visual representations. Furthermore, to maximize the advantages of multi-prompt learning, we incorporate prior information from manually designed templates and learnable multi-prompts, thus improving the generalization capabilities of our approach. We evaluate the effectiveness of our approach on three challenging tasks: new class generalization, cross-dataset evaluation, and domain generalization. For instance, our method achieves a $79.28$ harmonic mean, averaged over 11 diverse image recognition datasets ($+7.62$ compared to CoOp), demonstrating significant competitiveness compared to state-of-the-art prompting methods.
Abstract
Emotion Recognition in Conversation (ERC) plays an important role in driving the development of human-machine interaction. Emotions can exist in multiple modalities, and multimodal ERC mainly faces two problems: (1) the noise problem in the cross-modal information fusion process, and (2) the prediction problem of less sample emotion labels that are semantically similar but different categories. To address these issues and fully utilize the features of each modality, we adopted the following strategies: first, deep emotion cues extraction was performed on modalities with strong representation ability, and feature filters were designed as multimodal prompt information for modalities with weak representation ability. Then, we designed a Multimodal Prompt Transformer (MPT) to perform cross-modal information fusion. MPT embeds multimodal fusion information into each attention layer of the Transformer, allowing prompt information to participate in encoding textual features and being fused with multi-level textual information to obtain better multimodal fusion features. Finally, we used the Hybrid Contrastive Learning (HCL) strategy to optimize the model's ability to handle labels with few samples. This strategy uses unsupervised contrastive learning to improve the representation ability of multimodal fusion and supervised contrastive learning to mine the information of labels with few samples. Experimental results show that our proposed model outperforms state-of-the-art models in ERC on two benchmark datasets.
Abstract
The popularity of Contrastive Language-Image Pre-training (CLIP) haspropelled its application to diverse downstream vision tasks. To improve itscapacity on downstream tasks, few-shot learning has become a widely-adoptedtechnique. However, existing methods either exhibit limited performance orsuffer from excessive learnable parameters. In this paper, we propose APE, anAdaptive Prior rEfinement method for CLIP's pre-trained knowledge, whichachieves superior accuracy with high computational efficiency. Via a priorrefinement module, we analyze the inter-class disparity in the downstream dataand decouple the domain-specific knowledge from the CLIP-extracted cache model.On top of that, we introduce two model variants, a training-free APE and atraining-required APE-T. We explore the trilateral affinities between the testimage, prior cache model, and textual representations, and only enable alightweight category-residual module to be trained. For the average accuracyover 11 benchmarks, both APE and APE-T attain state-of-the-art and respectivelyoutperform the second-best by +1.59% and +1.99% under 16 shots with x30 lesslearnable parameters.
Abstract
Large-scale foundation models, such as CLIP, have demonstrated remarkable success in visual recognition tasks by embedding images in a semantically rich space. Self-supervised learning (SSL) has also shown promise in improving visual recognition by learning invariant features. However, the combination of CLIP with SSL is found to face challenges due to the multi-task framework that blends CLIP's contrastive loss and SSL's loss, including difficulties with loss weighting and inconsistency among different views of images in CLIP's output space. To overcome these challenges, we propose a prompt learning-based model called GOPro, which is a unified framework that ensures similarity between various augmented views of input images in a shared image-text embedding space, using a pair of learnable image and text projectors atop CLIP, to promote invariance and generalizability. To automatically learn such prompts, we leverage the visual content and style primitives extracted from pre-trained CLIP and adapt them to the target task. In addition to CLIP's cross-domain contrastive loss, we introduce a visual contrastive loss and a novel prompt consistency loss, considering the different views of the images. GOPro is trained end-to-end on all three loss objectives, combining the strengths of CLIP and SSL in a principled manner. Empirical evaluations demonstrate that GOPro outperforms the state-of-the-art prompting techniques on three challenging domain generalization tasks across multiple benchmarks by a significant margin. Our code is available at https://github.com/mainaksingha01/GOPro.
Documents
- jvnv a corpus of japanese emotional speech with verbal content and nonverbal expressions
- megatts 2 zeroshot texttospeech with arbitrary length speech prompts
- naturalspeech 2 latent diffusion models are natural and zeroshot speech and singing synthesizers
- audioldm 2 learning holistic audio generation with selfsupervised pretraining
- scaling asr improves zero and few shot learning
Abstract
We present the JVNV, a Japanese emotional speech corpus with verbal content and nonverbal vocalizations whose scripts are generated by a large-scale language model. Existing emotional speech corpora lack not only proper emotional scripts but also nonverbal vocalizations (NVs) that are essential expressions in spoken language to express emotions. We propose an automatic script generation method to produce emotional scripts by providing seed words with sentiment polarity and phrases of nonverbal vocalizations to ChatGPT using prompt engineering. We select 514 scripts with balanced phoneme coverage from the generated candidate scripts with the assistance of emotion confidence scores and language fluency scores. We demonstrate the effectiveness of JVNV by showing that JVNV has better phoneme coverage and emotion recognizability than previous Japanese emotional speech corpora. We then benchmark JVNV on emotional text-to-speech synthesis using discrete codes to represent NVs. We show that there still exists a gap between the performance of synthesizing read-aloud speech and emotional speech, and adding NVs in the speech makes the task even harder, which brings new challenges for this task and makes JVNV a valuable resource for relevant works in the future. To our best knowledge, JVNV is the first speech corpus that generates scripts automatically using large language models.
Abstract
Zero-shot text-to-speech aims at synthesizing voices with unseen speechprompts. Previous large-scale multispeaker TTS models have successfullyachieved this goal with an enrolled recording within 10 seconds. However, mostof them are designed to utilize only short speech prompts. The limitedinformation in short speech prompts significantly hinders the performance offine-grained identity imitation. In this paper, we introduce Mega-TTS 2, ageneric zero-shot multispeaker TTS model that is capable of synthesizing speechfor unseen speakers with arbitrary-length prompts. Specifically, we 1) design amulti-reference timbre encoder to extract timbre information from multiplereference speeches; 2) and train a prosody language model with arbitrary-lengthspeech prompts; With these designs, our model is suitable for prompts ofdifferent lengths, which extends the upper bound of speech quality forzero-shot text-to-speech. Besides arbitrary-length prompts, we introducearbitrary-source prompts, which leverages the probabilities derived frommultiple P-LLM outputs to produce expressive and controlled prosody.Furthermore, we propose a phoneme-level auto-regressive duration model tointroduce in-context learning capabilities to duration modeling. Experimentsdemonstrate that our method could not only synthesize identity-preservingspeech with a short prompt of an unseen speaker but also achieve improvedperformance with longer speech prompts. Audio samples can be found inhttps://mega-tts.github.io/mega2_demo/.
Abstract
Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is important to capture the diversity in human speech such as speaker identities, prosodies, and styles (e.g., singing). Current large TTS systems usually quantize speech into discrete tokens and use language models to generate these tokens one by one, which suffer from unstable prosody, word skipping/repeating issue, and poor voice quality. In this paper, we develop NaturalSpeech 2, a TTS system that leverages a neural audio codec with residual vector quantizers to get the quantized latent vectors and uses a diffusion model to generate these latent vectors conditioned on text input. To enhance the zero-shot capability that is important to achieve diverse speech synthesis, we design a speech prompting mechanism to facilitate in-context learning in the diffusion model and the duration/pitch predictor. We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers. NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, robustness, and voice quality in a zero-shot setting, and performs novel zero-shot singing synthesis with only a speech prompt. Audio samples are available at https://speechresearch.github.io/naturalspeech2.
Abstract
Although audio generation shares commonalities across different types ofaudio, such as speech, music, and sound effects, designing models for each typerequires careful consideration of specific objectives and biases that cansignificantly differ from those of other types. To bring us closer to a unifiedperspective of audio generation, this paper proposes a framework that utilizesthe same learning method for speech, music, and sound effect generation. Ourframework introduces a general representation of audio, called "language ofaudio" (LOA). Any audio can be translated into LOA based on AudioMAE, aself-supervised pre-trained representation learning model. In the generationprocess, we translate any modalities into LOA by using a GPT-2 model, and weperform self-supervised audio generation learning with a latent diffusion modelconditioned on LOA. The proposed framework naturally brings advantages such asin-context learning abilities and reusable self-supervised pretrained AudioMAEand latent diffusion models. Experiments on the major benchmarks oftext-to-audio, text-to-music, and text-to-speech demonstrate state-of-the-artor competitive performance against previous approaches. Our code, pretrainedmodel, and demo are available at https://audioldm.github.io/audioldm2.
Abstract
With 4.5 million hours of English speech from 10 different sources across 120countries and models of up to 10 billion parameters, we explore the frontiersof scale for automatic speech recognition. We propose data selection techniquesto efficiently scale training data to find the most valuable samples in massivedatasets. To efficiently scale model sizes, we leverage various optimizationssuch as sparse transducer loss and model sharding. By training 1-10B parameteruniversal English ASR models, we push the limits of speech recognitionperformance across many domains. Furthermore, our models learn powerful speechrepresentations with zero and few-shot capabilities on novel domains and stylesof speech, exceeding previous results across multiple in-house and publicbenchmarks. For speakers with disorders due to brain damage, our best zero-shotand few-shot models achieve 22% and 60% relative improvement on the AphasiaBanktest set, respectively, while realizing the best performance on public socialmedia videos. Furthermore, the same universal model reaches equivalentperformance with 500x less in-domain data on the SPGISpeech financial-domaindataset.
Documents
- inboxbart get instructions into biomedical multitask learning
- sweeping heterogeneity with smart mops mixture of prompts for llm task adaptation
- decomposed prompting a modular approach for solving complex tasks
- jiuzhang 20 a unified chinese pretrained language model for multitask mathematical problem solving
- evaluating the instructionfollowing robustness of large language models to prompt injection
Abstract
Single-task models have proven pivotal in solving specific tasks; however, they have limitations in real-world applications where multi-tasking is necessary and domain shifts are exhibited. Recently, instructional prompts have shown significant improvement towards multi-task generalization; however, the effect of instructional prompts and Multi-Task Learning (MTL) has not been systematically studied in the biomedical domain. Motivated by this, this paper explores the impact of instructional prompts for biomedical MTL. We introduce the BoX, a collection of 32 instruction tasks for Biomedical NLP across (X) various categories. Using this meta-dataset, we propose a unified model termed In-BoXBART, that can jointly learn all tasks of the BoX without any task-specific modules. To the best of our knowledge, this is the first attempt to propose a unified model in the biomedical domain and use instructions to achieve generalization across several biomedical tasks. Experimental results indicate that the proposed model: 1) outperforms the single-task baseline by ~3% and multi-task (without instruction) baseline by ~18% on an average, and 2) shows ~23% improvement compared to the single-task baseline in few-shot learning (i.e., 32 instances per task) on an average. Our analysis indicates that there is significant room for improvement across tasks in the BoX, implying the scope for future research direction.
Abstract
Large Language Models (LLMs) have the ability to solve a variety of tasks, such as text summarization and mathematical questions, just out of the box, but they are often trained with a single task in mind. Due to high computational costs, the current trend is to use prompt instruction tuning to better adjust monolithic, pretrained LLMs for new -- but often individual -- downstream tasks. Thus, how one would expand prompt tuning to handle -- concomitantly -- heterogeneous tasks and data distributions is a widely open question. To address this gap, we suggest the use of \emph{Mixture of Prompts}, or MoPs, associated with smart gating functionality: the latter -- whose design is one of the contributions of this paper -- can identify relevant skills embedded in different groups of prompts and dynamically assign combined experts (i.e., collection of prompts), based on the target task. Additionally, MoPs are empirically agnostic to any model compression technique applied -- for efficiency reasons -- as well as instruction data source and task composition. In practice, MoPs can simultaneously mitigate prompt training"interference"in multi-task, multi-source scenarios (e.g., task and data heterogeneity across sources), as well as possible implications from model approximations. As a highlight, MoPs manage to decrease final perplexity from $\sim20\%$ up to $\sim70\%$, as compared to baselines, in the federated scenario, and from $\sim 3\%$ up to $\sim30\%$ in the centralized scenario.
Abstract
Few-shot prompting is a surprisingly powerful way to use Large Language Models (LLMs) to solve various tasks. However, this approach struggles as the task complexity increases or when the individual reasoning steps of the task themselves are hard to learn, especially when embedded in more complex tasks. To address this, we propose Decomposed Prompting, a new approach to solve complex tasks by decomposing them (via prompting) into simpler sub-tasks that can be delegated to a library of prompting-based LLMs dedicated to these sub-tasks. This modular structure allows each prompt to be optimized for its specific sub-task, further decomposed if necessary, and even easily replaced with more effective prompts, trained models, or symbolic functions if desired. We show that the flexibility and modularity of Decomposed Prompting allows it to outperform prior work on few-shot prompting using GPT3. On symbolic reasoning tasks, we can further decompose sub-tasks that are hard for LLMs into even simpler solvable sub-tasks. When the complexity comes from the input length, we can recursively decompose the task into the same task but with smaller inputs. We also evaluate our approach on textual multi-step reasoning tasks: on long-context multi-hop QA task, we can more effectively teach the sub-tasks via our separate sub-tasks prompts; and on open-domain multi-hop QA, we can incorporate a symbolic information retrieval within our decomposition framework, leading to improved performance on both tasks. Datasets, Code and Prompts available at https://github.com/allenai/DecomP.
Abstract
Although pre-trained language models~(PLMs) have recently advanced theresearch progress in mathematical reasoning, they are not specially designed asa capable multi-task solver, suffering from high cost for multi-task deployment(\eg a model copy for a task) and inferior performance on complex mathematicalproblems in practical applications. To address these issues, in this paper, wepropose \textbf{JiuZhang~2.0}, a unified Chinese PLM specially for multi-taskmathematical problem solving. Our idea is to maintain a moderate-sized modeland employ the \emph{cross-task knowledge sharing} to improve the modelcapacity in a multi-task setting. Specially, we construct aMixture-of-Experts~(MoE) architecture for modeling mathematical text, so as tocapture the common mathematical knowledge across tasks. For optimizing the MoEarchitecture, we design \emph{multi-task continual pre-training} and\emph{multi-task fine-tuning} strategies for multi-task adaptation. Thesetraining strategies can effectively decompose the knowledge from the task dataand establish the cross-task sharing via expert networks. In order to furtherimprove the general capacity of solving different complex tasks, we leveragelarge language models~(LLMs) as complementary models to iteratively refine thegenerated solution by our PLM, via in-context learning. Extensive experimentshave demonstrated the effectiveness of our model.
Abstract
Large Language Models (LLMs) have shown remarkable proficiency in followinginstructions, making them valuable in customer-facing applications. However,their impressive capabilities also raise concerns about the amplification ofrisks posed by adversarial instructions, which can be injected into the modelinput by third-party attackers to manipulate LLMs' original instructions andprompt unintended actions and content. Therefore, it is crucial to understandLLMs' ability to accurately discern which instructions to follow to ensuretheir safe deployment in real-world scenarios. In this paper, we propose apioneering benchmark for automatically evaluating the robustness ofinstruction-following LLMs against adversarial instructions injected in theprompt. The objective of this benchmark is to quantify the extent to which LLMsare influenced by injected adversarial instructions and assess their ability todifferentiate between these injected adversarial instructions and original userinstructions. Through experiments conducted with state-of-the-artinstruction-following LLMs, we uncover significant limitations in theirrobustness against adversarial instruction injection attacks. Furthermore, ourfindings indicate that prevalent instruction-tuned models are prone to being``overfitted'' to follow any instruction phrase in the prompt without trulyunderstanding which instructions should be followed. This highlights the needto address the challenge of training models to comprehend prompts instead ofmerely following instruction phrases and completing the text. The data and codecan be found at \url{https://github.com/Leezekun/Adv-Instruct-Eval}.
Documents
- comparative analysis of gpt4 and human graders in evaluating praise given to students in synthetic dialogues
- comparative analysis of gpt4 and human graders in evaluating human tutors giving praise to students
- improving language model negotiation with selfplay and incontext learning from ai feedback
- chatgpt4pcg competition characterlike level generation for science birds
- large language models are biased to overestimate profoundness
Abstract
Research suggests that providing specific and timely feedback to human tutorsenhances their performance. However, it presents challenges due to thetime-consuming nature of assessing tutor performance by human evaluators. Largelanguage models, such as the AI-chatbot ChatGPT, hold potential for offeringconstructive feedback to tutors in practical settings. Nevertheless, theaccuracy of AI-generated feedback remains uncertain, with scant researchinvestigating the ability of models like ChatGPT to deliver effective feedback.In this work-in-progress, we evaluate 30 dialogues generated by GPT-4 in atutor-student setting. We use two different prompting approaches, the zero-shotchain of thought and the few-shot chain of thought, to identify specificcomponents of effective praise based on five criteria. These approaches arethen compared to the results of human graders for accuracy. Our goal is toassess the extent to which GPT-4 can accurately identify each praise criterion.We found that both zero-shot and few-shot chain of thought approaches yieldcomparable results. GPT-4 performs moderately well in identifying instanceswhen the tutor offers specific and immediate praise. However, GPT-4underperforms in identifying the tutor's ability to deliver sincere praise,particularly in the zero-shot prompting scenario where examples of sinceretutor praise statements were not provided. Future work will focus on enhancingprompt engineering, developing a more general tutoring rubric, and evaluatingour method using real-life tutoring dialogues.
Abstract
Research suggests that providing specific and timely feedback to human tutors enhances their performance. However, it presents challenges due to the time-consuming nature of assessing tutor performance by human evaluators. Large language models, such as the AI-chatbot ChatGPT, hold potential for offering constructive feedback to tutors in practical settings. Nevertheless, the accuracy of AI-generated feedback remains uncertain, with scant research investigating the ability of models like ChatGPT to deliver effective feedback. In this work-in-progress, we evaluate 30 dialogues generated by GPT-4 in a tutor-student setting. We use two different prompting approaches, the zero-shot chain of thought and the few-shot chain of thought, to identify specific components of effective praise based on five criteria. These approaches are then compared to the results of human graders for accuracy. Our goal is to assess the extent to which GPT-4 can accurately identify each praise criterion. We found that both zero-shot and few-shot chain of thought approaches yield comparable results. GPT-4 performs moderately well in identifying instances when the tutor offers specific and immediate praise. However, GPT-4 underperforms in identifying the tutor's ability to deliver sincere praise, particularly in the zero-shot prompting scenario where examples of sincere tutor praise statements were not provided. Future work will focus on enhancing prompt engineering, developing a more general tutoring rubric, and evaluating our method using real-life tutoring dialogues.
Abstract
We study whether multiple large language models (LLMs) can autonomouslyimprove each other in a negotiation game by playing, reflecting, andcriticizing. We are interested in this question because if LLMs were able toimprove each other, it would imply the possibility of creating strong AI agentswith minimal human intervention. We ask two LLMs to negotiate with each other,playing the roles of a buyer and a seller, respectively. They aim to reach adeal with the buyer targeting a lower price and the seller a higher one. Athird language model, playing the critic, provides feedback to a player toimprove the player's negotiation strategies. We let the two agents playmultiple rounds, using previous negotiation history and AI feedback asin-context demonstrations to improve the model's negotiation strategyiteratively. We use different LLMs (GPT and Claude) for different roles and usethe deal price as the evaluation metric. Our experiments reveal multipleintriguing findings: (1) Only a subset of the language models we consider canself-play and improve the deal price from AI feedback, weaker models either donot understand the game's rules or cannot incorporate AI feedback for furtherimprovement. (2) Models' abilities to learn from the feedback differ whenplaying different roles. For example, it is harder for Claude-instant toimprove as the buyer than as the seller. (3) When unrolling the game tomultiple rounds, stronger agents can consistently improve their performance bymeaningfully using previous experiences and iterative AI feedback, yet have ahigher risk of breaking the deal. We hope our work provides insightful initialexplorations of having models autonomously improve each other with game playingand AI feedback.
Abstract
This paper presents the first ChatGPT4PCG Competition at the 2023 IEEE Conference on Games. The objective of this competition is for participants to create effective prompts for ChatGPT--enabling it to generate Science Birds levels with high stability and character-like qualities--fully using their creativity as well as prompt engineering skills. ChatGPT is a conversational agent developed by OpenAI. Science Birds is selected as the competition platform because designing an Angry Birds-like level is not a trivial task due to the in-game gravity; the quality of the levels is determined by their stability. To lower the entry barrier to the competition, we limit the task to the generation of capitalized English alphabetical characters. We also allow only a single prompt to be used for generating all the characters. Here, the quality of the generated levels is determined by their stability and similarity to the given characters. A sample prompt is provided to participants for their reference. An experiment is conducted to determine the effectiveness of several modified versions of this sample prompt on level stability and similarity by testing them on several characters. To the best of our knowledge, we believe that ChatGPT4PCG is the first competition of its kind and hope to inspire enthusiasm for prompt engineering in procedural content generation.
Abstract
Recent advancements in natural language processing by large language models(LLMs), such as GPT-4, have been suggested to approach Artificial GeneralIntelligence. And yet, it is still under dispute whether LLMs possess similarreasoning abilities to humans. This study evaluates GPT-4 and various otherLLMs in judging the profoundness of mundane, motivational, and pseudo-profoundstatements. We found a significant statement-to-statement correlation betweenthe LLMs and humans, irrespective of the type of statements and the promptingtechnique used. However, LLMs systematically overestimate the profoundness ofnonsensical statements, with the exception of Tk-instruct, which uniquelyunderestimates the profoundness of statements. Only few-shot learning prompts,as opposed to chain-of-thought prompting, draw LLMs ratings closer to humans.Furthermore, this work provides insights into the potential biases induced byReinforcement Learning from Human Feedback (RLHF), inducing an increase in thebias to overestimate the profoundness of statements.