Documents
- chemical identification and indexing in pubmed articles via bert and texttotext approaches
- daprompt deterministic assumption prompt learning for event causality identification
- relationprompt leveraging prompts to generate synthetic data for zeroshot relation triplet extraction
- continuous prompt tuning based textual entailment model for ecommerce entity typing
- multilabel fewshot icd coding as autoregressive generation with prompt
Abstract
The Biocreative VII Track-2 challenge consists of named entity recognition,entity-linking (or entity-normalization), and topic indexing tasks -- withentities and topics limited to chemicals for this challenge. Named entityrecognition is a well-established problem and we achieve our best performancewith BERT-based BioMegatron models. We extend our BERT-based approach to theentity linking task. After the second stage of pretraining BioBERT with ametric-learning loss strategy called self-alignment pretraining (SAP), we linkentities based on the cosine similarity between their SAP-BioBERT wordembeddings. Despite the success of our named entity recognition experiments, wefind the chemical indexing task generally more challenging. In addition to conventional NER methods, we attempt both named entityrecognition and entity linking with a novel text-to-text or "prompt" basedmethod that uses generative language models such as T5 and GPT. We achieveencouraging results with this new approach.
Abstract
Event Causality Identification (ECI) aims at determining whether there is a causal relation between two event mentions. Conventional prompt learning designs a prompt template to first predict an answer word and then maps it to the final decision. Unlike conventional prompts, we argue that predicting an answer word may not be a necessary prerequisite for the ECI task. Instead, we can first make a deterministic assumption on the existence of causal relation between two events and then evaluate its rationality to either accept or reject the assumption. The design motivation is to try the most utilization of the encyclopedia-like knowledge embedded in a pre-trained language model. In light of such considerations, we propose a deterministic assumption prompt learning model, called DAPrompt, for the ECI task. In particular, we design a simple deterministic assumption template concatenating with the input event pair, which includes two masks as predicted events' tokens. We use the probabilities of predicted events to evaluate the assumption rationality for the final event causality decision. Experiments on the EventStoryLine corpus and Causal-TimeBank corpus validate our design objective in terms of significant performance improvements over the state-of-the-art algorithms.
Abstract
Despite the importance of relation extraction in building and representing knowledge, less research is focused on generalizing to unseen relations types. We introduce the task setting of Zero-Shot Relation Triplet Extraction (ZeroRTE) to encourage further research in low-resource relation extraction methods. Given an input sentence, each extracted triplet consists of the head entity, relation label, and tail entity where the relation label is not seen at the training stage. To solve ZeroRTE, we propose to synthesize relation examples by prompting language models to generate structured texts. Concretely, we unify language model prompts and structured text approaches to design a structured prompt template for generating synthetic relation samples when conditioning on relation label prompts (RelationPrompt). To overcome the limitation for extracting multiple relation triplets in a sentence, we design a novel Triplet Search Decoding method. Experiments on FewRel and Wiki-ZSL datasets show the efficacy of RelationPrompt for the ZeroRTE task and zero-shot relation classification. Our code and data are available at github.com/declare-lab/RelationPrompt.
Abstract
The explosion of e-commerce has caused the need for processing and analysis of product titles, like entity typing in product titles. However, the rapid activity in e-commerce has led to the rapid emergence of new entities, which is difficult for general entity typing. Besides, product titles in e-commerce have very different language styles from text data in general domain. In order to handle new entities in product titles and address the special language styles of product titles in e-commerce domain, we propose our textual entailment model with continuous prompt tuning based hypotheses and fusion embeddings for e-commerce entity typing. First, we reformulate entity typing into a textual entailment problem to handle new entities that are not present during training. Second, we design a model to automatically generate textual entailment hypotheses using a continuous prompt tuning method, which can generate better textual entailment hypotheses without manual design. Third, we utilize the fusion embeddings of BERT embedding and Char-acterBERT embedding to solve the problem that the language styles of product titles in e-commerce are different from that of general domain. To analyze the effect of each contribution, we compare the performance of entity typing and textual entailment model, and conduct ablation studies on continuous prompt tuning and fusion embeddings. We also evaluate the impact of different prompt template initialization for the continuous prompt tuning. We show our proposed model improves the average F1 score by around 2% compared to the baseline BERT entity typing model.
Abstract
Automatic International Classification of Diseases (ICD) coding aims to assign multiple ICD codes to a medical note with an average of 3,000+ tokens. This task is challenging due to the high-dimensional space of multi-label assignment (155,000+ ICD code candidates) and the long-tail challenge - Many ICD codes are infrequently assigned yet infrequent ICD codes are important clinically. This study addresses the long-tail challenge by transforming this multi-label classification task into an autoregressive generation task. Specifically, we first introduce a novel pretraining objective to generate free text diagnoses and procedures using the SOAP structure, the medical logic physicians use for note documentation. Second, instead of directly predicting the high dimensional space of ICD codes, our model generates the lower dimension of text descriptions, which then infers ICD codes. Third, we designed a novel prompt template for multi-label classification. We evaluate our Generation with Prompt (GPsoap) model with the benchmark of all code assignment (MIMIC-III-full) and few shot ICD code assignment evaluation benchmark (MIMIC-III-few). Experiments on MIMIC-III-few show that our model performs with a marco F130.2, which substantially outperforms the previous MIMIC-III-full SOTA model (marco F1 4.3) and the model specifically designed for few/zero shot setting (marco F1 18.7). Finally, we design a novel ensemble learner, a cross-attention reranker with prompts, to integrate previous SOTA and our best few-shot coding predictions. Experiments on MIMIC-III-full show that our ensemble learner substantially improves both macro and micro F1, from 10.4 to 14.6 and from 58.2 to 59.1, respectively.
Documents
- gopro generate and optimize prompts in clip using selfsupervised learning
- regionblip a unified multimodal pretraining framework for holistic and regional comprehension
- direct multimodal fewshot learning of speech and images
- vast a visionaudiosubtitletext omnimodality foundation model and dataset
- anovl adapting visionlanguage models for unified zeroshot anomaly localization
Abstract
Large-scale foundation models, such as CLIP, have demonstrated remarkable success in visual recognition tasks by embedding images in a semantically rich space. Self-supervised learning (SSL) has also shown promise in improving visual recognition by learning invariant features. However, the combination of CLIP with SSL is found to face challenges due to the multi-task framework that blends CLIP's contrastive loss and SSL's loss, including difficulties with loss weighting and inconsistency among different views of images in CLIP's output space. To overcome these challenges, we propose a prompt learning-based model called GOPro, which is a unified framework that ensures similarity between various augmented views of input images in a shared image-text embedding space, using a pair of learnable image and text projectors atop CLIP, to promote invariance and generalizability. To automatically learn such prompts, we leverage the visual content and style primitives extracted from pre-trained CLIP and adapt them to the target task. In addition to CLIP's cross-domain contrastive loss, we introduce a visual contrastive loss and a novel prompt consistency loss, considering the different views of the images. GOPro is trained end-to-end on all three loss objectives, combining the strengths of CLIP and SSL in a principled manner. Empirical evaluations demonstrate that GOPro outperforms the state-of-the-art prompting techniques on three challenging domain generalization tasks across multiple benchmarks by a significant margin. Our code is available at https://github.com/mainaksingha01/GOPro.
Abstract
In this work, we investigate extending the comprehension of Multi-modal Large Language Models (MLLMs) to regional objects. To this end, we propose to extract features corresponding to regional objects as soft prompts for LLM, which provides a straightforward and scalable approach and eliminates the need for LLM fine-tuning. To effectively extract regional features from regular image features and irregular point cloud features, we present a novel and unified position-assisted feature extraction module. Furthermore, training an MLLM from scratch is highly time-consuming. Thus, we propose incrementally extending existing pre-trained MLLMs to comprehend more modalities and the regional objects of those modalities. Specifically, we freeze the Q-Former from BLIP-2, an impressive MLLM, and optimize the modality-specific Lora parameters in Q-Former and LLM for each newly introduced modality. The freezing of the Q-Former eliminates the need for extensive pre-training on massive image-text data. The freezed Q-Former pre-trained from massive image-text data is also beneficial for the pre-training on image-region-text data. We name our framework RegionBLIP. We pre-train RegionBLIP on image-region-text, point-cloud-text, and point-cloud-region-text data. Experimental results verify that \Ours{} can preserve the image comprehension capability of BILP-2 and further gain a comprehension of the newly introduced point cloud modality and regional objects. The Data, Code, and Pre-trained models will be available at https://github.com/mightyzau/RegionBLIP.
Abstract
We propose direct multimodal few-shot models that learn a shared embeddingspace of spoken words and images from only a few paired examples. Imagine anagent is shown an image along with a spoken word describing the object in thepicture, e.g. pen, book and eraser. After observing a few paired examples ofeach class, the model is asked to identify the "book" in a set of unseenpictures. Previous work used a two-step indirect approach relying on learnedunimodal representations: speech-speech and image-image comparisons areperformed across the support set of given speech-image pairs. We propose twodirect models which instead learn a single multimodal space where inputs fromdifferent modalities are directly comparable: a multimodal triplet network(MTriplet) and a multimodal correspondence autoencoder (MCAE). To train thesedirect models, we mine speech-image pairs: the support set is used to pair upunlabelled in-domain speech and images. In a speech-to-image digit matchingtask, direct models outperform indirect models, with the MTriplet achieving thebest multimodal five-shot accuracy. We show that the improvements are due tothe combination of unsupervised and transfer learning in the direct models, andthe absence of two-step compounding errors.
Abstract
Vision and text have been fully explored in contemporary video-text foundational models, while other modalities such as audio and subtitles in videos have not received sufficient attention. In this paper, we resort to establish connections between multi-modality video tracks, including Vision, Audio, and Subtitle, and Text by exploring an automatically generated large-scale omni-modality video caption dataset called VAST-27M. Specifically, we first collect 27 million open-domain video clips and separately train a vision and an audio captioner to generate vision and audio captions. Then, we employ an off-the-shelf Large Language Model (LLM) to integrate the generated captions, together with subtitles and instructional prompts into omni-modality captions. Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA). Extensive experiments have been conducted to demonstrate the effectiveness of our proposed VAST-27M corpus and VAST foundation model. VAST achieves 22 new state-of-the-art results on various cross-modality benchmarks. Code, model and dataset will be released at https://github.com/TXH-mercury/VAST.
Abstract
Contrastive Language-Image Pre-training (CLIP) models have shown promising performance on zero-shot visual recognition tasks by learning visual representations under natural language supervision. Recent studies attempt the use of CLIP to tackle zero-shot anomaly detection by matching images with normal and abnormal state prompts. However, since CLIP focuses on building correspondence between paired text prompts and global image-level representations, the lack of patch-level vision to text alignment limits its capability on precise visual anomaly localization. In this work, we introduce a training-free adaptation (TFA) framework of CLIP for zero-shot anomaly localization. In the visual encoder, we innovate a training-free value-wise attention mechanism to extract intrinsic local tokens of CLIP for patch-level local description. From the perspective of text supervision, we particularly design a unified domain-aware contrastive state prompting template. On top of the proposed TFA, we further introduce a test-time adaptation (TTA) mechanism to refine anomaly localization results, where a layer of trainable parameters in the adapter is optimized using TFA's pseudo-labels and synthetic noise-corrupted tokens. With both TFA and TTA adaptation, we significantly exploit the potential of CLIP for zero-shot anomaly localization and demonstrate the effectiveness of our proposed methods on various datasets.
Documents
- one step of gradient descent is provably the optimal incontext learner with one layer of linear selfattention
- how many pretraining tasks are needed for incontext learning of linear regression
- transformers as statisticians provable incontext learning with incontext algorithm selection
- trained transformers learn linear models incontext
- braininspired globallocal learning incorporated with neuromorphic computing
Abstract
Recent works have empirically analyzed in-context learning and shown thattransformers trained on synthetic linear regression tasks can learn toimplement ridge regression, which is the Bayes-optimal predictor, givensufficient capacity [Aky\"urek et al., 2023], while one-layer transformers withlinear self-attention and no MLP layer will learn to implement one step ofgradient descent (GD) on a least-squares linear regression objective [vonOswald et al., 2022]. However, the theory behind these observations remainspoorly understood. We theoretically study transformers with a single layer oflinear self-attention, trained on synthetic noisy linear regression data.First, we mathematically show that when the covariates are drawn from astandard Gaussian distribution, the one-layer transformer which minimizes thepre-training loss will implement a single step of GD on the least-squareslinear regression objective. Then, we find that changing the distribution ofthe covariates and weight vector to a non-isotropic Gaussian distribution has astrong impact on the learned algorithm: the global minimizer of thepre-training loss now implements a single step of $\textit{pre-conditioned}$GD. However, if only the distribution of the responses is changed, then thisdoes not have a large effect on the learned algorithm: even when the responsecomes from a more general family of $\textit{nonlinear}$ functions, the globalminimizer of the pre-training loss still implements a single step of GD on aleast-squares linear regression objective.
Abstract
Transformers pretrained on diverse tasks exhibit remarkable in-contextlearning (ICL) capabilities, enabling them to solve unseen tasks solely basedon input contexts without adjusting model parameters. In this paper, we studyICL in one of its simplest setups: pretraining a linearly parameterizedsingle-layer linear attention model for linear regression with a Gaussianprior. We establish a statistical task complexity bound for the attention modelpretraining, showing that effective pretraining only requires a small number ofindependent tasks. Furthermore, we prove that the pretrained model closelymatches the Bayes optimal algorithm, i.e., optimally tuned ridge regression, byachieving nearly Bayes optimal risk on unseen tasks under a fixed contextlength. These theoretical findings complement prior experimental research andshed light on the statistical foundations of ICL.
Abstract
Neural sequence models based on the transformer architecture havedemonstrated remarkable \emph{in-context learning} (ICL) abilities, where theycan perform new tasks when prompted with training and test examples, withoutany parameter update to the model. This work first provides a comprehensivestatistical theory for transformers to perform ICL. Concretely, we show thattransformers can implement a broad class of standard machine learningalgorithms in context, such as least squares, ridge regression, Lasso, learninggeneralized linear models, and gradient descent on two-layer neural networks,with near-optimal predictive power on various in-context data distributions.Using an efficient implementation of in-context gradient descent as theunderlying mechanism, our transformer constructions admit mild size bounds, andcan be learned with polynomially many pretraining sequences. Building on these ``base'' ICL algorithms, intriguingly, we show thattransformers can implement more complex ICL procedures involving\emph{in-context algorithm selection}, akin to what a statistician can do inreal life -- A \emph{single} transformer can adaptively select different baseICL algorithms -- or even perform qualitatively different tasks -- on differentinput sequences, without any explicit prompting of the right algorithm or task.We both establish this in theory by explicit constructions, and also observethis phenomenon experimentally. In theory, we construct two general mechanismsfor algorithm selection with concrete examples: pre-ICL testing, and post-ICLvalidation. As an example, we use the post-ICL validation mechanism toconstruct a transformer that can perform nearly Bayes-optimal ICL on achallenging task -- noisy linear models with mixed noise levels.Experimentally, we demonstrate the strong in-context algorithm selectioncapabilities of standard transformer architectures.
Abstract
Attention-based neural networks such as transformers have demonstrated aremarkable ability to exhibit in-context learning (ICL): Given a short promptsequence of tokens from an unseen task, they can formulate relevant per-tokenand next-token predictions without any parameter updates. By embedding asequence of labeled training data and unlabeled test data as a prompt, thisallows for transformers to behave like supervised learning algorithms. Indeed,recent work has shown that when training transformer architectures over randominstances of linear regression problems, these models' predictions mimic thoseof ordinary least squares. Towards understanding the mechanisms underlying this phenomenon, weinvestigate the dynamics of ICL in transformers with a single linearself-attention layer trained by gradient flow on linear regression tasks. Weshow that despite non-convexity, gradient flow with a suitable randominitialization finds a global minimum of the objective function. At this globalminimum, when given a test prompt of labeled examples from a new predictiontask, the transformer achieves prediction error competitive with the bestlinear predictor over the test prompt distribution. We additionallycharacterize the robustness of the trained transformer to a variety ofdistribution shifts and show that although a number of shifts are tolerated,shifts in the covariate distribution of the prompts are not. Motivated by this,we consider a generalized ICL setting where the covariate distributions canvary across prompts. We show that although gradient flow succeeds at finding aglobal minimum in this setting, the trained transformer is still brittle undermild covariate shifts. We complement this finding with experiments on large,nonlinear transformer architectures which we show are more robust undercovariate shifts.
Abstract
Two main routes of learning methods exist at present including error-drivenglobal learning and neuroscience-oriented local learning. Integrating them intoone network may provide complementary learning capabilities for versatilelearning scenarios. At the same time, neuromorphic computing holds greatpromise, but still needs plenty of useful algorithms and algorithm-hardwareco-designs for exploiting the advantages. Here, we report a neuromorphic hybridlearning model by introducing a brain-inspired meta-learning paradigm and adifferentiable spiking model incorporating neuronal dynamics and synapticplasticity. It can meta-learn local plasticity and receive top-down supervisioninformation for multiscale synergic learning. We demonstrate the advantages ofthis model in multiple different tasks, including few-shot learning, continuallearning, and fault-tolerance learning in neuromorphic vision sensors. Itachieves significantly higher performance than single-learning methods, andshows promise in empowering neuromorphic applications revolution. We furtherimplemented the hybrid model in the Tianjic neuromorphic platform by exploitingalgorithm-hardware co-designs and proved that the model can fully utilizeneuromorphic many-core architecture to develop hybrid computation paradigm.
Documents
- supercharging academic writing with generative ai framework, techniques, and caveats
- constitutionmaker interactively critiquing large language models by converting feedback into principles
- exploring efl students' prompt engineering in humanai story writing an activity theory perspective
- choice over control how users write with large language models using diegetic and nondiegetic prompting
- lmcanvas objectoriented interaction to personalize large language modelpowered writing environments
Abstract
Academic writing is an indispensable yet laborious part of the researchenterprise. This Perspective maps out principles and methods for usinggenerative artificial intelligence (AI), specifically large language models(LLMs), to elevate the quality and efficiency of academic writing. We introducea human-AI collaborative framework that delineates the rationale (why), process(how), and nature (what) of AI engagement in writing. The framework pinpointsboth short-term and long-term reasons for engagement and their underlyingmechanisms (e.g., cognitive offloading and imaginative stimulation). It revealsthe role of AI throughout the writing process, conceptualized through atwo-stage model for human-AI collaborative writing, and the nature of AIassistance in writing, represented through a model of writing-assistance typesand levels. Building on this framework, we describe effective promptingtechniques for incorporating AI into the writing routine (outlining, drafting,and editing) as well as strategies for maintaining rigorous scholarship,adhering to varied journal policies, and avoiding overreliance on AI.Ultimately, the prudent integration of AI into academic writing can ease thecommunication burden, empower authors, accelerate discovery, and promotediversity in science.
Abstract
Large language model (LLM) prompting is a promising new approach for users tocreate and customize their own chatbots. However, current methods for steeringa chatbot's outputs, such as prompt engineering and fine-tuning, do not supportusers in converting their natural feedback on the model's outputs to changes inthe prompt or model. In this work, we explore how to enable users tointeractively refine model outputs through their feedback, by helping themconvert their feedback into a set of principles (i.e. a constitution) thatdictate the model's behavior. From a formative study, we (1) found that usersneeded support converting their feedback into principles for the chatbot and(2) classified the different principle types desired by users. Inspired bythese findings, we developed ConstitutionMaker, an interactive tool forconverting user feedback into principles, to steer LLM-based chatbots. WithConstitutionMaker, users can provide either positive or negative feedback innatural language, select auto-generated feedback, or rewrite the chatbot'sresponse; each mode of feedback automatically generates a principle that isinserted into the chatbot's prompt. In a user study with 14 participants, wecompare ConstitutionMaker to an ablated version, where users write their ownprinciples. With ConstitutionMaker, participants felt that their principlescould better guide the chatbot, that they could more easily convert theirfeedback into principles, and that they could write principles moreefficiently, with less mental demand. ConstitutionMaker helped users identifyways to improve the chatbot, formulate their intuitive responses to the modelinto feedback, and convert this feedback into specific and clear principles.Together, these findings inform future tools that support the interactivecritiquing of LLM outputs.
Abstract
This study applies Activity Theory to investigate how English as a foreign language (EFL) students prompt generative artificial intelligence (AI) tools during short story writing. Sixty-seven Hong Kong secondary school students created generative-AI tools using open-source language models and wrote short stories with them. The study collected and analyzed the students' generative-AI tools, short stories, and written reflections on their conditions or purposes for prompting. The research identified three main themes regarding the purposes for which students prompt generative-AI tools during short story writing: a lack of awareness of purposes, overcoming writer's block, and developing, expanding, and improving the story. The study also identified common characteristics of students' activity systems, including the sophistication of their generative-AI tools, the quality of their stories, and their school's overall academic achievement level, for their prompting of generative-AI tools for the three purposes during short story writing. The study's findings suggest that teachers should be aware of students' purposes for prompting generative-AI tools to provide tailored instructions and scaffolded guidance. The findings may also help designers provide differentiated instructions for users at various levels of story development when using a generative-AI tool.
Abstract
We propose a conceptual perspective on prompts for Large Language Models (LLMs) that distinguishes between (1) diegetic prompts (part of the narrative, e.g. “Once upon a time, I saw a fox...”), and (2) non-diegetic prompts (external, e.g. “Write about the adventures of the fox.”). With this lens, we study how 129 crowd workers on Prolific write short texts with different user interfaces (1 vs 3 suggestions, with/out non-diegetic prompts; implemented with GPT-3): When the interface offered multiple suggestions and provided an option for non-diegetic prompting, participants preferred choosing from multiple suggestions over controlling them via non-diegetic prompts. When participants provided non-diegetic prompts it was to ask for inspiration, topics or facts. Single suggestions in particular were guided both with diegetic and non-diegetic information. This work informs human-AI interaction with generative models by revealing that (1) writing non-diegetic prompts requires effort, (2) people combine diegetic and non-diegetic prompting, and (3) they use their draft (i.e. diegetic information) and suggestion timing to strategically guide LLMs.
Abstract
Large language models (LLMs) can enhance writing by automating or supporting specific tasks in writers' workflows (e.g., paraphrasing, creating analogies). Leveraging this capability, a collection of interfaces have been developed that provide LLM-powered tools for specific writing tasks. However, these interfaces provide limited support for writers to create personal tools for their own unique tasks, and may not comprehensively fulfill a writer's needs -- requiring them to continuously switch between interfaces during writing. In this work, we envision LMCanvas, an interface that enables writers to create their own LLM-powered writing tools and arrange their personal writing environment by interacting with"blocks"in a canvas. In this interface, users can create text blocks to encapsulate writing and LLM prompts, model blocks for model parameter configurations, and connect these to create pipeline blocks that output generations. In this workshop paper, we discuss the design for LMCanvas and our plans to develop this concept.
Documents
- fewshot learning with multilingual language models
- alexatm 20b fewshot learning using a largescale multilingual seq2seq model
- fewshot learning for sentence pair classification and its applications in software engineering
- discrete and soft prompting for multilingual models
- multitask pretraining of modular prompt for chinese fewshot learning
Abstract
Large-scale generative language models such as GPT-3 are competitive few-shotlearners. While these models are known to be able to jointly represent manydifferent languages, their training data is dominated by English, potentiallylimiting their cross-lingual generalization. In this work, we trainmultilingual generative language models on a corpus covering a diverse set oflanguages, and study their few- and zero-shot learning capabilities in a widerange of tasks. Our largest model with 7.5 billion parameters sets new state ofthe art in few-shot learning in more than 20 representative languages,outperforming GPT-3 of comparable size in multilingual commonsense reasoning(with +7.4% absolute accuracy improvement in 0-shot settings and +9.4% in4-shot settings) and natural language inference (+5.4% in each of 0-shot and4-shot settings). On the FLORES-101 machine translation benchmark, our modeloutperforms GPT-3 on 171 out of 182 directions with 32 training examples, whilesurpassing the official supervised baseline in 45 directions. We conduct anin-depth analysis of different multilingual prompting approaches, showing inparticular that strong few-shot learning performance across languages can beachieved via cross-lingual transfer through both templates and demonstrationexamples. Finally, we evaluate our models in social value tasks such as hatespeech detection in five languages and find it has limitations similar tocomparable sized GPT-3 models.
Abstract
In this work, we demonstrate that multilingual large-scalesequence-to-sequence (seq2seq) models, pre-trained on a mixture of denoisingand Causal Language Modeling (CLM) tasks, are more efficient few-shot learnersthan decoder-only models on various tasks. In particular, we train a 20 billionparameter multilingual seq2seq model called Alexa Teacher Model (AlexaTM 20B)and show that it achieves state-of-the-art (SOTA) performance on 1-shotsummarization tasks, outperforming a much larger 540B PaLM decoder model.AlexaTM 20B also achieves SOTA in 1-shot machine translation, especially forlow-resource languages, across almost all language pairs supported by the model(Arabic, English, French, German, Hindi, Italian, Japanese, Marathi,Portuguese, Spanish, Tamil, and Telugu) on Flores-101 dataset. We also show inzero-shot setting, AlexaTM 20B outperforms GPT3 (175B) on SuperGLUE and SQuADv2datasets and provides SOTA performance on multilingual tasks such as XNLI,XCOPA, Paws-X, and XWinograd. Overall, our results present a compelling casefor seq2seq models as a powerful alternative to decoder-only models forLarge-scale Language Model (LLM) training.
Abstract
Few-shot learning-the ability to train models with access to limited data-hasbecome increasingly popular in the natural language processing (NLP) domain, aslarge language models such as GPT and T0 have been empirically shown to achievehigh performance in numerous tasks with access to just a handful of labeledexamples. Smaller language models such as BERT and its variants have also beenshown to achieve strong performance with just a handful of labeled exampleswhen combined with few-shot learning algorithms like pattern-exploitingtraining (PET) and SetFit. The focus of this work is to investigate theperformance of alternative few-shot learning approaches with BERT-based models.Specifically, vanilla fine-tuning, PET and SetFit are compared for numerousBERT-based checkpoints over an array of training set sizes. To facilitate thisinvestigation, applications of few-shot learning are considered in softwareengineering. For each task, high-performance techniques and their associatedmodel checkpoints are identified through detailed empirical analysis. Ourresults establish PET as a strong few-shot learning approach, and our analysisshows that with just a few hundred labeled examples it can achieve performancenear that of fine-tuning on full-sized data sets.
Abstract
It has been shown for English that discrete and soft prompting performstrongly in few-shot learning with pretrained language models (PLMs). In thispaper, we show that discrete and soft prompting perform better than finetuningin multilingual cases: Crosslingual transfer and in-language training ofmultilingual natural language inference. For example, with 48 English trainingexamples, finetuning obtains 33.74% accuracy in crosslingual transfer, barelysurpassing the majority baseline (33.33%). In contrast, discrete and softprompting outperform finetuning, achieving 36.43% and 38.79%. We alsodemonstrate good performance of prompting with training data in multiplelanguages other than English.
Abstract
Prompt tuning is a parameter-efficient approach to adapting pre-trainedlanguage models to downstream tasks. Although prompt tuning has been shown tomatch the performance of full model tuning when training data is sufficient, ittends to struggle in few-shot learning settings. In this paper, we presentMulti-task Pre-trained Modular Prompt (MP2) to boost prompt tuning for few-shotlearning. MP2 is a set of combinable prompts pre-trained on 38 Chinese tasks.On downstream tasks, the pre-trained prompts are selectively activated andcombined, leading to strong compositional generalization to unseen tasks. Tobridge the gap between pre-training and fine-tuning, we formulate upstream anddownstream tasks into a unified machine reading comprehension task. Extensiveexperiments under two learning paradigms, i.e., gradient descent and black-boxtuning, show that MP2 significantly outperforms prompt tuning, full modeltuning, and prior prompt pre-training methods in few-shot settings. Inaddition, we demonstrate that MP2 can achieve surprisingly fast and strongadaptation to downstream tasks by merely learning 8 parameters to combine thepre-trained modular prompts.
Documents
- generate rather than retrieve large language models are strong context generators
- query rewriting for retrievalaugmented large language models
- query2doc query expansion with large language models
- s$^3$hqa a threestage approach for multihop texttable hybrid question answering
- robut a systematic study of table qa robustness against humanannotated adversarial perturbations
Abstract
Knowledge-intensive tasks, such as open-domain question answering (QA), require access to a large amount of world or domain knowledge. A common approach for knowledge-intensive tasks is to employ a retrieve-then-read pipeline that first retrieves a handful of relevant contextual documents from an external corpus such as Wikipedia and then predicts an answer conditioned on the retrieved documents. In this paper, we present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators. We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer. Furthermore, we propose a novel clustering-based prompting method that selects distinct prompts, resulting in the generated documents that cover different perspectives, leading to better recall over acceptable answers. We conduct extensive experiments on three different knowledge-intensive tasks, including open-domain QA, fact checking, and dialogue system. Notably, GenRead achieves 71.6 and 54.4 exact match scores on TriviaQA and WebQ, significantly outperforming the state-of-the-art retrieve-then-read pipeline DPR-FiD by +4.0 and +3.9, without retrieving any documents from any external knowledge source. Lastly, we demonstrate the model performance can be further improved by combining retrieval and generation. Our code and generated documents can be found at https://github.com/wyu97/GenRead.
Abstract
Large Language Models (LLMs) play powerful, black-box readers in the retrieve-then-read pipeline, making remarkable progress in knowledge-intensive tasks. This work introduces a new framework, Rewrite-Retrieve-Read instead of the previous retrieve-then-read for the retrieval-augmented LLMs from the perspective of the query rewriting. Unlike prior studies focusing on adapting either the retriever or the reader, our approach pays attention to the adaptation of the search query itself, for there is inevitably a gap between the input text and the needed knowledge in retrieval. We first prompt an LLM to generate the query, then use a web search engine to retrieve contexts. Furthermore, to better align the query to the frozen modules, we propose a trainable scheme for our pipeline. A small language model is adopted as a trainable rewriter to cater to the black-box LLM reader. The rewriter is trained using the feedback of the LLM reader by reinforcement learning. Evaluation is conducted on downstream tasks, open-domain QA and multiple-choice QA. Experiments results show consistent performance improvement, indicating that our framework is proven effective and scalable, and brings a new framework for retrieval-augmented LLM.
Abstract
This paper introduces a simple yet effective query expansion approach, denoted as query2doc, to improve both sparse and dense retrieval systems. The proposed method first generates pseudo-documents by few-shot prompting large language models (LLMs), and then expands the query with generated pseudo-documents. LLMs are trained on web-scale text corpora and are adept at knowledge memorization. The pseudo-documents from LLMs often contain highly relevant information that can aid in query disambiguation and guide the retrievers. Experimental results demonstrate that query2doc boosts the performance of BM25 by 3% to 15% on ad-hoc IR datasets, such as MS-MARCO and TREC DL, without any model fine-tuning. Furthermore, our method also benefits state-of-the-art dense retrievers in terms of both in-domain and out-of-domain results.
Abstract
Answering multi-hop questions over hybrid factual knowledge from the giventext and table (TextTableQA) is a challenging task. Existing models mainlyadopt a retriever-reader framework, which have several deficiencies, such asnoisy labeling in training retriever, insufficient utilization of heterogeneousinformation over text and table, and deficient ability for different reasoningoperations. In this paper, we propose a three-stage TextTableQA frameworkS3HQA, which comprises of retriever, selector, and reasoner. We use a retrieverwith refinement training to solve the noisy labeling problem. Then, a hybridselector considers the linked relationships between heterogeneous data toselect the most relevant factual knowledge. For the final stage, instead ofadapting a reading comprehension module like in previous methods, we employ ageneration-based reasoner to obtain answers. This includes two approaches: arow-wise generator and an LLM prompting generator~(first time used in thistask). The experimental results demonstrate that our method achievescompetitive results in the few-shot setting. When trained on the full dataset,our approach outperforms all baseline methods, ranking first on the HybridQAleaderboard.
Abstract
Despite significant progress having been made in question answering ontabular data (Table QA), it's unclear whether, and to what extent existingTable QA models are robust to task-specific perturbations, e.g., replacing keyquestion entities or shuffling table columns. To systematically study therobustness of Table QA models, we propose a benchmark called RobuT, whichbuilds upon existing Table QA datasets (WTQ, WikiSQL-Weak, and SQA) andincludes human-annotated adversarial perturbations in terms of table header,table content, and question. Our results indicate that both state-of-the-artTable QA models and large language models (e.g., GPT-3) with few-shot learningfalter in these adversarial sets. We propose to address this problem by usinglarge language models to generate adversarial examples to enhance training,which significantly improves the robustness of Table QA models. Our data andcode is publicly available at https://github.com/yilunzhao/RobuT.
Documents
- geotechnical parrot tales (gpt) harnessing large language models in geotechnical engineering
- prompt engineering for healthcare methodologies and applications
- safety analysis in the era of large language models a case study of stpa using chatgpt
- is gpt4 a good trader
- software testing with large language model survey, landscape, and vision
Abstract
The widespread adoption of large language models (LLMs), such as OpenAI's ChatGPT, could revolutionize various industries, including geotechnical engineering. However, GPT models can sometimes generate plausible-sounding but false outputs, leading to hallucinations. In this article, we discuss the importance of prompt engineering in mitigating these risks and harnessing the full potential of GPT for geotechnical applications. We explore the challenges and pitfalls associated with LLMs and highlight the role of context in ensuring accurate and valuable responses. Furthermore, we examine the development of context-specific search engines and the potential of LLMs to become a natural interface for complex tasks, such as data analysis and design. We also develop a unified interface using natural language to handle complex geotechnical engineering tasks and data analysis. By integrating GPT into geotechnical engineering workflows, professionals can streamline their work and develop sustainable and resilient infrastructure systems for the future.
Abstract
This review will introduce the latest advances in prompt engineering in the field of natural language processing (NLP) for the medical domain. First, we will provide a brief overview of the development of prompt engineering and emphasize its significant contributions to healthcare NLP applications such as question-answering systems, text summarization, and machine translation. With the continuous improvement of general large language models, the importance of prompt engineering in the healthcare domain is becoming increasingly prominent. The aim of this article is to provide useful resources and bridges for healthcare NLP researchers to better explore the application of prompt engineering in this field. We hope that this review can provide new ideas and inspire ample possibilities for research and application in medical NLP.
Abstract
Can safety analysis make use of Large Language Models (LLMs)? A case studyexplores Systems Theoretic Process Analysis (STPA) applied to AutomaticEmergency Brake (AEB) and Electricity Demand Side Management (DSM) systemsusing ChatGPT. We investigate how collaboration schemes, input semanticcomplexity, and prompt guidelines influence STPA results. Comparative resultsshow that using ChatGPT without human intervention may be inadequate due toreliability related issues, but with careful design, it may outperform humanexperts. No statistically significant differences are found when varying theinput semantic complexity or using common prompt guidelines, which suggests thenecessity for developing domain-specific prompt engineering. We also highlightfuture challenges, including concerns about LLM trustworthiness and thenecessity for standardisation and regulation in this domain.
Abstract
Recently, large language models (LLMs), particularly GPT-4, have demonstrated significant capabilities in various planning and reasoning tasks \cite{cheng2023gpt4,bubeck2023sparks}. Motivated by these advancements, there has been a surge of interest among researchers to harness the capabilities of GPT-4 for the automated design of quantitative factors that do not overlap with existing factor libraries, with an aspiration to achieve alpha returns \cite{webpagequant}. In contrast to these work, this study aims to examine the fidelity of GPT-4's comprehension of classic trading theories and its proficiency in applying its code interpreter abilities to real-world trading data analysis. Such an exploration is instrumental in discerning whether the underlying logic GPT-4 employs for trading is intrinsically reliable. Furthermore, given the acknowledged interpretative latitude inherent in most trading theories, we seek to distill more precise methodologies of deploying these theories from GPT-4's analytical process, potentially offering invaluable insights to human traders. To achieve this objective, we selected daily candlestick (K-line) data from specific periods for certain assets, such as the Shanghai Stock Index. Through meticulous prompt engineering, we guided GPT-4 to analyze the technical structures embedded within this data, based on specific theories like the Elliott Wave Theory. We then subjected its analytical output to manual evaluation, assessing its interpretative depth and accuracy vis-\`a-vis these trading theories from multiple dimensions. The results and findings from this study could pave the way for a synergistic amalgamation of human expertise and AI-driven insights in the realm of trading.
Abstract
Pre-trained large language models (LLMs) have recently emerged as abreakthrough technology in natural language processing and artificialintelligence, with the ability to handle large-scale datasets and exhibitremarkable performance across a wide range of tasks. Meanwhile, softwaretesting is a crucial undertaking that serves as a cornerstone for ensuring thequality and reliability of software products. As the scope and complexity ofsoftware systems continue to grow, the need for more effective software testingtechniques becomes increasingly urgent, and making it an area ripe forinnovative approaches such as the use of LLMs. This paper provides acomprehensive review of the utilization of LLMs in software testing. Itanalyzes 52 relevant studies that have used LLMs for software testing, fromboth the software testing and LLMs perspectives. The paper presents a detaileddiscussion of the software testing tasks for which LLMs are commonly used,among which test case preparation and program repair are the mostrepresentative ones. It also analyzes the commonly used LLMs, the types ofprompt engineering that are employed, as well as the accompanied techniqueswith these LLMs. It also summarizes the key challenges and potentialopportunities in this direction. This work can serve as a roadmap for futureresearch in this area, highlighting potential avenues for exploration, andidentifying gaps in our current understanding of the use of LLMs in softwaretesting.
Documents
- larger language models do incontext learning differently
- rethinking the role of demonstrations what makes incontext learning work
- true fewshot learning with language models
- rethinking the role of scale for incontext learning an interpretabilitybased case study at 66 billion scale
- incontext example selection with influences
Abstract
We study how in-context learning (ICL) in language models is affected by semantic priors versus input-label mappings. We investigate two setups-ICL with flipped labels and ICL with semantically-unrelated labels-across various model families (GPT-3, InstructGPT, Codex, PaLM, and Flan-PaLM). First, experiments on ICL with flipped labels show that overriding semantic priors is an emergent ability of model scale. While small language models ignore flipped labels presented in-context and thus rely primarily on semantic priors from pretraining, large models can override semantic priors when presented with in-context exemplars that contradict priors, despite the stronger semantic priors that larger models may hold. We next study semantically-unrelated label ICL (SUL-ICL), in which labels are semantically unrelated to their inputs (e.g., foo/bar instead of negative/positive), thereby forcing language models to learn the input-label mappings shown in in-context exemplars in order to perform the task. The ability to do SUL-ICL also emerges primarily with scale, and large-enough language models can even perform linear classification in a SUL-ICL setting. Finally, we evaluate instruction-tuned models and find that instruction tuning strengthens both the use of semantic priors and the capacity to learn input-label mappings, but more of the former.
Abstract
Large language models (LMs) are able to in-context learn -- perform a newtask via inference alone by conditioning on a few input-label pairs(demonstrations) and making predictions for new inputs. However, there has beenlittle understanding of how the model learns and which aspects of thedemonstrations contribute to end task performance. In this paper, we show thatground truth demonstrations are in fact not required -- randomly replacinglabels in the demonstrations barely hurts performance on a range ofclassification and multi-choce tasks, consistently over 12 different modelsincluding GPT-3. Instead, we find that other aspects of the demonstrations arethe key drivers of end task performance, including the fact that they provide afew examples of (1) the label space, (2) the distribution of the input text,and (3) the overall format of the sequence. Together, our analysis provides anew way of understanding how and why in-context learning works, while openingup new questions about how much can be learned from large language modelsthrough inference alone.
Abstract
Pretrained language models (LMs) perform well on many tasks even whenlearning from a few examples, but prior work uses many held-out examples totune various aspects of learning, such as hyperparameters, training objectives,and natural language templates ("prompts"). Here, we evaluate the few-shotability of LMs when such held-out examples are unavailable, a setting we calltrue few-shot learning. We test two model selection criteria, cross-validationand minimum description length, for choosing LM prompts and hyperparameters inthe true few-shot setting. On average, both marginally outperform randomselection and greatly underperform selection based on held-out examples.Moreover, selection criteria often prefer models that perform significantlyworse than randomly-selected ones. We find similar results even when takinginto account our uncertainty in a model's true performance during selection, aswell as when varying the amount of computation and number of examples used forselection. Overall, our findings suggest that prior work significantlyoverestimated the true few-shot ability of LMs given the difficulty of few-shotmodel selection.
Abstract
Language models have been shown to perform better with an increase in scaleon a wide variety of tasks via the in-context learning paradigm. In this paper,we investigate the hypothesis that the ability of a large language model toin-context learn-perform a task is not uniformly spread across all of itsunderlying components. Using a 66 billion parameter language model (OPT-66B)across a diverse set of 14 downstream tasks, we find this is indeed the case:$\sim$70% of attention heads and $\sim$20% of feed forward networks can beremoved with minimal decline in task performance. We find substantial overlapin the set of attention heads (un)important for in-context learning acrosstasks and number of in-context examples. We also address our hypothesis througha task-agnostic lens, finding that a small set of attention heads in OPT-66Bscore highly on their ability to perform primitive induction operationsassociated with in-context learning, namely, prefix matching and copying. Theseinduction heads overlap with task-specific important heads, reinforcingarguments by Olsson et al. (arXiv:2209.11895) regarding induction headgenerality to more sophisticated behaviors associated with in-context learning.Overall, our study provides several insights that indicate large languagemodels may be under-trained for in-context learning and opens up questions onhow to pre-train language models to more effectively perform in-contextlearning.
Abstract
In-context learning (ICL) is a powerful paradigm emerged from large languagemodels (LLMs). Despite its promises, ICL performance is known to be highlysensitive to input examples. In this work, we use $\textit{in-contextinfluences}$ to analyze few-shot ICL performance directly from the in-contextexamples. Our proposed influence-based example selection method can identifyboth positive and negative examples, outperforming several baselines whenevaluated on 9 SuperGLUE tasks. Our analysis uncovers up to a $16.3\%$performance gap between using the most negative in-context examples compared tothe most positive. In a case study, we apply our influence-based framework toquantify the phenomena of recency bias in example ordering for few-shot ICL.
Documents
- scalable and transferable blackbox jailbreaks for language models via persona modulation
- do anything now characterizing and evaluating inthewild jailbreak prompts on large language models
- marked personas using natural language prompts to measure stereotypes in language models
- prompt injection attack against llmintegrated applications
- red teaming language models with language models
Abstract
Despite efforts to align large language models to produce harmless responses,they are still vulnerable to jailbreak prompts that elicit unrestrictedbehaviour. In this work, we investigate persona modulation as a black-boxjailbreaking method to steer a target model to take on personalities that arewilling to comply with harmful instructions. Rather than manually craftingprompts for each persona, we automate the generation of jailbreaks using alanguage model assistant. We demonstrate a range of harmful completions madepossible by persona modulation, including detailed instructions forsynthesising methamphetamine, building a bomb, and laundering money. Theseautomated attacks achieve a harmful completion rate of 42.5% in GPT-4, which is185 times larger than before modulation (0.23%). These prompts also transfer toClaude 2 and Vicuna with harmful completion rates of 61.0% and 35.9%,respectively. Our work reveals yet another vulnerability in commercial largelanguage models and highlights the need for more comprehensive safeguards.
Abstract
The misuse of large language models (LLMs) has garnered significant attention from the general public and LLM vendors. In response, efforts have been made to align LLMs with human values and intent use. However, a particular type of adversarial prompts, known as jailbreak prompt, has emerged and continuously evolved to bypass the safeguards and elicit harmful content from LLMs. In this paper, we conduct the first measurement study on jailbreak prompts in the wild, with 6,387 prompts collected from four platforms over six months. Leveraging natural language processing technologies and graph-based community detection methods, we discover unique characteristics of jailbreak prompts and their major attack strategies, such as prompt injection and privilege escalation. We also observe that jailbreak prompts increasingly shift from public platforms to private ones, posing new challenges for LLM vendors in proactive detection. To assess the potential harm caused by jailbreak prompts, we create a question set comprising 46,800 samples across 13 forbidden scenarios. Our experiments show that current LLMs and safeguards cannot adequately defend jailbreak prompts in all scenarios. Particularly, we identify two highly effective jailbreak prompts which achieve 0.99 attack success rates on ChatGPT (GPT-3.5) and GPT-4, and they have persisted online for over 100 days. Our work sheds light on the severe and evolving threat landscape of jailbreak prompts. We hope our study can facilitate the research community and LLM vendors in promoting safer and regulated LLMs.
Abstract
To recognize and mitigate harms from large language models (LLMs), we need to understand the prevalence and nuances of stereotypes in LLM outputs. Toward this end, we present Marked Personas, a prompt-based method to measure stereotypes in LLMs for intersectional demographic groups without any lexicon or data labeling.Grounded in the sociolinguistic concept of markedness (which characterizes explicitly linguistically marked categories versus unmarked defaults), our proposed method is twofold: 1) prompting an LLM to generate personas, i.e., natural language descriptions, of the target demographic group alongside personas of unmarked, default groups; 2) identifying the words that significantly distinguish personas of the target group from corresponding unmarked ones.We find that the portrayals generated by GPT-3.5 and GPT-4 contain higher rates of racial stereotypes than human-written portrayals using the same prompts. The words distinguishing personas of marked (non-white, non-male) groups reflect patterns of othering and exoticizing these demographics. An intersectional lens further reveals tropes that dominate portrayals of marginalized groups, such as tropicalism and the hypersexualization of minoritized women. These representational harms have concerning implications for downstream applications like story generation.
Abstract
Large Language Models (LLMs), renowned for their superior proficiency in language comprehension and generation, stimulate a vibrant ecosystem of applications around them. However, their extensive assimilation into various services introduces significant security risks. This study deconstructs the complexities and implications of prompt injection attacks on actual LLM-integrated applications. Initially, we conduct an exploratory analysis on ten commercial applications, highlighting the constraints of current attack strategies in practice. Prompted by these limitations, we subsequently formulate HouYi, a novel black-box prompt injection attack technique, which draws inspiration from traditional web injection attacks. HouYi is compartmentalized into three crucial elements: a seamlessly-incorporated pre-constructed prompt, an injection prompt inducing context partition, and a malicious payload designed to fulfill the attack objectives. Leveraging HouYi, we unveil previously unknown and severe attack outcomes, such as unrestricted arbitrary LLM usage and uncomplicated application prompt theft. We deploy HouYi on 36 actual LLM-integrated applications and discern 31 applications susceptible to prompt injection. 10 vendors have validated our discoveries, including Notion, which has the potential to impact millions of users. Our investigation illuminates both the possible risks of prompt injection attacks and the possible tactics for mitigation.
Abstract
Language Models (LMs) often cannot be deployed because of their potential to harm users in hard-to-predict ways. Prior work identifies harmful behaviors before deployment by using human annotators to hand-write test cases. However, human annotation is expensive, limiting the number and diversity of test cases. In this work, we automatically find cases where a target LM behaves in a harmful way, by generating test cases (“red teaming”) using another LM. We evaluate the target LM’s replies to generated test questions using a classifier trained to detect offensive content, uncovering tens of thousands of offensive replies in a 280B parameter LM chatbot. We explore several methods, from zero-shot generation to reinforcement learning, for generating test cases with varying levels of diversity and difficulty. Furthermore, we use prompt engineering to control LM-generated test cases to uncover a variety of other harms, automatically finding groups of people that the chatbot discusses in offensive ways, personal and hospital phone numbers generated as the chatbot’s own contact info, leakage of private training data in generated text, and harms that occur over the course of a conversation. Overall, LM-based red teaming is one promising tool (among many needed) for finding and fixing diverse, undesirable LM behaviors before impacting users.
Documents
- pal programaided language models
- structured chainofthought prompting for code generation
- lpml llmprompting markup language for mathematical reasoning
- limits of an ai program for solving college math problems
- solving probability and statistics problems by program synthesis
Abstract
Large language models (LLMs) have recently demonstrated an impressive ability to perform arithmetic and symbolic reasoning tasks, when provided with a few examples at test time ("few-shot prompting"). Much of this success can be attributed to prompting methods such as"chain-of-thought'', which employ LLMs for both understanding the problem description by decomposing it into steps, as well as solving each step of the problem. While LLMs seem to be adept at this sort of step-by-step decomposition, LLMs often make logical and arithmetic mistakes in the solution part, even when the problem is decomposed correctly. In this paper, we present Program-Aided Language models (PAL): a novel approach that uses the LLM to read natural language problems and generate programs as the intermediate reasoning steps, but offloads the solution step to a runtime such as a Python interpreter. With PAL, decomposing the natural language problem into runnable steps remains the only learning task for the LLM, while solving is delegated to the interpreter. We demonstrate this synergy between a neural LLM and a symbolic interpreter across 13 mathematical, symbolic, and algorithmic reasoning tasks from BIG-Bench Hard and other benchmarks. In all these natural language reasoning tasks, generating code using an LLM and reasoning using a Python interpreter leads to more accurate results than much larger models. For example, PAL using Codex achieves state-of-the-art few-shot accuracy on the GSM8K benchmark of math word problems, surpassing PaLM-540B which uses chain-of-thought by absolute 15% top-1. Our code and data are publicly available at http://reasonwithpal.com/ .
Abstract
Large Language Models (LLMs) (e.g., ChatGPT) have shown impressiveperformance in code generation. LLMs take prompts as inputs, andChain-of-Thought (CoT) prompting is the state-of-the-art prompting technique.CoT prompting asks LLMs first to generate CoTs (i.e., intermediate naturallanguage reasoning steps) and then output the code. However, CoT prompting isdesigned for natural language generation and has low accuracy in codegeneration. In this paper, we propose Structured CoTs (SCoTs) and present a novelprompting technique for code generation, named SCoT prompting. Our motivationis source code contains rich structural information and any code can becomposed of three program structures (i.e., sequence, branch, and loopstructures). Intuitively, structured intermediate reasoning steps make forstructured source code. Thus, we ask LLMs to use program structures to buildCoTs, obtaining SCoTs. Then, LLMs generate the final code based on SCoTs.Compared to CoT prompting, SCoT prompting explicitly constrains LLMs to thinkabout how to solve requirements from the view of source code and further theperformance of LLMs in code generation. We apply SCoT prompting to two LLMs(i.e., ChatGPT and Codex) and evaluate it on three benchmarks (i.e., HumanEval,MBPP, and MBCPP). (1) SCoT prompting outperforms the state-of-the-art baseline- CoT prompting by up to 13.79% in Pass@1. (2) Human evaluation shows humandevelopers prefer programs from SCoT prompting. (3) SCoT prompting is robust toexamples and achieves substantial improvements.
Abstract
In utilizing large language models (LLMs) for mathematical reasoning, addressing the errors in the reasoning and calculation present in the generated text by LLMs is a crucial challenge. In this paper, we propose a novel framework that integrates the Chain-of-Thought (CoT) method with an external tool (Python REPL). We discovered that by prompting LLMs to generate structured text in XML-like markup language, we could seamlessly integrate CoT and the external tool and control the undesired behaviors of LLMs. With our approach, LLMs can utilize Python computation to rectify errors within CoT. We applied our method to ChatGPT (GPT-3.5) to solve challenging mathematical problems and demonstrated that combining CoT and Python REPL through the markup language enhances the reasoning capability of LLMs. Our approach enables LLMs to write the markup language and perform advanced mathematical reasoning using only zero-shot prompting.
Abstract
Drori et al. (2022) report that "A neural network solves, explains, andgenerates university math problems by program synthesis and few-shot learningat human level ... [It] automatically answers 81\% of university-levelmathematics problems." The system they describe is indeed impressive; however,the above description is very much overstated. The work of solving the problemsis done, not by a neural network, but by the symbolic algebra package Sympy.Problems of various formats are excluded from consideration. The so-called"explanations" are just rewordings of lines of code. Answers are marked ascorrect that are not in the form specified in the problem. Most seriously, itseems that in many cases the system uses the correct answer given in the testcorpus to guide its path to solving the problem.
Abstract
We solve university level probability and statistics questions by programsynthesis using OpenAI's Codex, a Transformer trained on text and fine-tuned oncode. We transform course problems from MIT's 18.05 Introduction to Probabilityand Statistics and Harvard's STAT110 Probability into programming tasks. Wethen execute the generated code to get a solution. Since these course questionsare grounded in probability, we often aim to have Codex generate probabilisticprograms that simulate a large number of probabilistic dependencies to computeits solution. Our approach requires prompt engineering to transform thequestion from its original form to an explicit, tractable form that results ina correct program and solution. To estimate the amount of work needed totranslate an original question into its tractable form, we measure thesimilarity between original and transformed questions. Our work is the first tointroduce a new dataset of university-level probability and statistics problemsand solve these problems in a scalable fashion using the program synthesiscapabilities of large language models.