Documents
- can language models solve graph problems in natural language
- retrieverewriteanswer a kgtotext enhanced llms framework for knowledge graph question answering
- decomposed prompting a modular approach for solving complex tasks
- adaptivesolver framework for dynamic strategy selection in large language model reasoning
- jiuzhang 20 a unified chinese pretrained language model for multitask mathematical problem solving
Abstract
Large language models (LLMs) are increasingly adopted for a variety of taskswith implicit graphical structures, such as planning in robotics, multi-hopquestion answering or knowledge probing, structured commonsense reasoning, andmore. While LLMs have advanced the state-of-the-art on these tasks withstructure implications, whether LLMs could explicitly process textualdescriptions of graphs and structures, map them to grounded conceptual spaces,and perform structured operations remains underexplored. To this end, wepropose NLGraph (Natural Language Graph), a comprehensive benchmark ofgraph-based problem solving designed in natural language. NLGraph contains29,370 problems, covering eight graph reasoning tasks with varying complexityfrom simple tasks such as connectivity and shortest path up to complex problemssuch as maximum flow and simulating graph neural networks. We evaluate LLMs(GPT-3/4) with various prompting approaches on the NLGraph benchmark and findthat 1) language models do demonstrate preliminary graph reasoning abilities,2) the benefit of advanced prompting and in-context learning diminishes on morecomplex graph problems, while 3) LLMs are also (un)surprisingly brittle in theface of spurious correlations in graph and problem settings. We then proposeBuild-a-Graph Prompting and Algorithmic Prompting, two instruction-basedapproaches to enhance LLMs in solving natural language graph problems.Build-a-Graph and Algorithmic prompting improve the performance of LLMs onNLGraph by 3.07% to 16.85% across multiple tasks and settings, while how tosolve the most complicated graph reasoning tasks in our setup with languagemodels remains an open research question. The NLGraph benchmark and evaluationcode are available at https://github.com/Arthur-Heng/NLGraph.
Abstract
Despite their competitive performance on knowledge-intensive tasks, large language models (LLMs) still have limitations in memorizing all world knowledge especially long tail knowledge. In this paper, we study the KG-augmented language model approach for solving the knowledge graph question answering (KGQA) task that requires rich world knowledge. Existing work has shown that retrieving KG knowledge to enhance LLMs prompting can significantly improve LLMs performance in KGQA. However, their approaches lack a well-formed verbalization of KG knowledge, i.e., they ignore the gap between KG representations and textual representations. To this end, we propose an answer-sensitive KG-to-Text approach that can transform KG knowledge into well-textualized statements most informative for KGQA. Based on this approach, we propose a KG-to-Text enhanced LLMs framework for solving the KGQA task. Experiments on several KGQA benchmarks show that the proposed KG-to-Text augmented LLMs approach outperforms previous KG-augmented LLMs approaches regarding answer accuracy and usefulness of knowledge statements.
Abstract
Few-shot prompting is a surprisingly powerful way to use Large Language Models (LLMs) to solve various tasks. However, this approach struggles as the task complexity increases or when the individual reasoning steps of the task themselves are hard to learn, especially when embedded in more complex tasks. To address this, we propose Decomposed Prompting, a new approach to solve complex tasks by decomposing them (via prompting) into simpler sub-tasks that can be delegated to a library of prompting-based LLMs dedicated to these sub-tasks. This modular structure allows each prompt to be optimized for its specific sub-task, further decomposed if necessary, and even easily replaced with more effective prompts, trained models, or symbolic functions if desired. We show that the flexibility and modularity of Decomposed Prompting allows it to outperform prior work on few-shot prompting using GPT3. On symbolic reasoning tasks, we can further decompose sub-tasks that are hard for LLMs into even simpler solvable sub-tasks. When the complexity comes from the input length, we can recursively decompose the task into the same task but with smaller inputs. We also evaluate our approach on textual multi-step reasoning tasks: on long-context multi-hop QA task, we can more effectively teach the sub-tasks via our separate sub-tasks prompts; and on open-domain multi-hop QA, we can incorporate a symbolic information retrieval within our decomposition framework, leading to improved performance on both tasks. Datasets, Code and Prompts available at https://github.com/allenai/DecomP.
Abstract
Large Language Models (LLMs) are showcasing impressive ability in handling complex reasoning tasks. In real-world situations, problems often span a spectrum of complexities. Humans inherently adjust their problem-solving approaches based on task complexity. However, most methodologies that leverage LLMs tend to adopt a uniform approach: utilizing consistent models, prompting methods, and degrees of problem decomposition, regardless of the problem complexity. Inflexibility of them can bring unnecessary computational overhead or sub-optimal performance. To address this problem, we introduce an Adaptive-Solver framework. It strategically modulates solving strategies based on the difficulties of the problems. Given an initial solution, the framework functions with two primary modules. The initial evaluation module assesses the adequacy of the current solution. If improvements are needed, the subsequent adaptation module comes into play. Within this module, three key adaptation strategies are employed: (1) Model Adaptation: Switching to a stronger LLM when a weaker variant is inadequate. (2) Prompting Method Adaptation: Alternating between different prompting techniques to suit the problem's nuances. (3) Decomposition Granularity Adaptation: Breaking down a complex problem into more fine-grained sub-questions to enhance solvability. Through such dynamic adaptations, our framework not only enhances computational efficiency but also elevates the overall performance. This dual-benefit ensures both the efficiency of the system for simpler tasks and the precision required for more complex questions. Experimental results from complex reasoning tasks reveal that the prompting method adaptation and decomposition granularity adaptation enhance performance across all tasks. Furthermore, the model adaptation approach significantly reduces API costs (up to 50%) while maintaining superior performance.
Abstract
Although pre-trained language models~(PLMs) have recently advanced theresearch progress in mathematical reasoning, they are not specially designed asa capable multi-task solver, suffering from high cost for multi-task deployment(\eg a model copy for a task) and inferior performance on complex mathematicalproblems in practical applications. To address these issues, in this paper, wepropose \textbf{JiuZhang~2.0}, a unified Chinese PLM specially for multi-taskmathematical problem solving. Our idea is to maintain a moderate-sized modeland employ the \emph{cross-task knowledge sharing} to improve the modelcapacity in a multi-task setting. Specially, we construct aMixture-of-Experts~(MoE) architecture for modeling mathematical text, so as tocapture the common mathematical knowledge across tasks. For optimizing the MoEarchitecture, we design \emph{multi-task continual pre-training} and\emph{multi-task fine-tuning} strategies for multi-task adaptation. Thesetraining strategies can effectively decompose the knowledge from the task dataand establish the cross-task sharing via expert networks. In order to furtherimprove the general capacity of solving different complex tasks, we leveragelarge language models~(LLMs) as complementary models to iteratively refine thegenerated solution by our PLM, via in-context learning. Extensive experimentshave demonstrated the effectiveness of our model.
Documents
- noise2music textconditioned music generation with diffusion models
- investigating prompt engineering in diffusion models
- a taxonomy of prompt modifiers for texttoimage generation
- prompt engineering for textbased generative art
- layoutllmt2i eliciting layout guidance from llm for texttoimage generation
Abstract
We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts. Two types of diffusion models, a generator model, which generates an intermediate representation conditioned on text, and a cascader model, which generates high-fidelity audio conditioned on the intermediate representation and possibly the text, are trained and utilized in succession to generate high-fidelity music. We explore two options for the intermediate representation, one using a spectrogram and the other using audio with lower fidelity. We find that the generated audio is not only able to faithfully reflect key elements of the text prompt such as genre, tempo, instruments, mood, and era, but goes beyond to ground fine-grained semantics of the prompt. Pretrained large language models play a key role in this story -- they are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models. Generated examples: https://google-research.github.io/noise2music
Abstract
With the spread of the use of Text2Img diffusion models such as DALL-E 2,Imagen, Mid Journey and Stable Diffusion, one challenge that artists face isselecting the right prompts to achieve the desired artistic output. We presenttechniques for measuring the effect that specific words and phrases in promptshave, and (in the Appendix) present guidance on the selection of prompts toproduce desired effects.
Abstract
Text-to-image generation has seen an explosion of interest since 2021. Today,beautiful and intriguing digital images and artworks can be synthesized fromtextual inputs ("prompts") with deep generative models. Online communitiesaround text-to-image generation and AI generated art have quickly emerged. Thispaper identifies six types of prompt modifiers used by practitioners in theonline community based on a 3-month ethnographic study. The novel taxonomy ofprompt modifiers provides researchers a conceptual starting point forinvestigating the practice of text-to-image generation, but may also helppractitioners of AI generated art improve their images. We further outline howprompt modifiers are applied in the practice of "prompt engineering." Wediscuss research opportunities of this novel creative practice in the field ofHuman-Computer Interaction (HCI). The paper concludes with a discussion ofbroader implications of prompt engineering from the perspective of Human-AIInteraction (HAI) in future applications beyond the use case of text-to-imagegeneration and AI generated art.
Abstract
Text-based generative art has seen an explosion of interest in 2021. Online communities around text-based generative art as a novel digital medium have quickly emerged. This short paper identifies five types of prompt modifiers used by practitioners in the community of text-based generative art based on a 3-month ethnographic study on Twitter. The novel taxonomy of prompt modifiers provides researchers a conceptual starting point for investigating the practices of text-based generative art, but also may help practitioners of text-based generative art improve their images. The paper concludes with a discussion of research opportunities in the space of text-based generative art and the broader implications of prompt engineering from the perspective of human-AI interaction in future applications beyond the use case of text-based generative art.
Abstract
In the text-to-image generation field, recent remarkable progress in StableDiffusion makes it possible to generate rich kinds of novel photorealisticimages. However, current models still face misalignment issues (e.g.,problematic spatial relation understanding and numeration failure) in complexnatural scenes, which impedes the high-faithfulness text-to-image generation.Although recent efforts have been made to improve controllability by givingfine-grained guidance (e.g., sketch and scribbles), this issue has not beenfundamentally tackled since users have to provide such guidance informationmanually. In this work, we strive to synthesize high-fidelity images that aresemantically aligned with a given textual prompt without any guidance. Towardthis end, we propose a coarse-to-fine paradigm to achieve layout planning andimage generation. Concretely, we first generate the coarse-grained layoutconditioned on a given textual prompt via in-context learning based on LargeLanguage Models. Afterward, we propose a fine-grained object-interactiondiffusion method to synthesize high-faithfulness images conditioned on theprompt and the automatically generated layout. Extensive experimentsdemonstrate that our proposed method outperforms the state-of-the-art models interms of layout and image generation. Our code and settings are available athttps://layoutllm-t2i.github.io.
Documents
- synthetic prompting generating chainofthought demonstrations for large language models
- planandsolve prompting improving zeroshot chainofthought reasoning by large language models
- fill in the blank exploring and enhancing llm capabilities for backward reasoning in math word problems
- stress testing chainofthought prompting for large language models
- automatic chain of thought prompting in large language models
Abstract
Large language models can perform various reasoning tasks by using chain-of-thought prompting, which guides them to find answers through step-by-step demonstrations. However, the quality of the prompts depends on the demonstrations given to the models, and creating many of them by hand is costly. We introduce Synthetic prompting, a method that leverages a few handcrafted examples to prompt the model to generate more examples by itself, and selects effective demonstrations to elicit better reasoning. Our method alternates between a backward and forward process to generate new examples. The backward process generates a question that match a sampled reasoning chain, so that the question is solvable and clear. The forward process produces a more detailed reasoning chain for the question, improving the quality of the example. We evaluate our method on numerical, symbolic, and algorithmic reasoning tasks, and show that it outperforms existing prompting techniques.
Abstract
Large language models (LLMs) have recently been shown to deliver impressive performance in various NLP tasks. To tackle multi-step reasoning tasks, Few-shot chain-of-thought (CoT) prompting includes a few manually crafted step-by-step reasoning demonstrations which enable LLMs to explicitly generate reasoning steps and improve their reasoning task accuracy. To eliminate the manual efforts, Zero-shot-CoT concatenates the target problem statement with “Let’s think step by step” as an input prompt to LLMs. Despite the success of Zero-shot-CoT, it still suffers from three pitfalls: calculation errors, missing-step errors, and semantic misunderstanding errors. To address the missing-step errors, we propose Plan-and-Solve (PS) Prompting. It consists of two components: first, devising a plan to divide the entire task into smaller subtasks, and then carrying out the subtasks according to the plan. To address the calculation errors and improve the quality of generated reasoning steps, we extend PS prompting with more detailed instructions and derive PS+ prompting. We evaluate our proposed prompting strategy on ten datasets across three reasoning problems. The experimental results over GPT-3 show that our proposed zero-shot prompting consistently outperforms Zero-shot-CoT across all datasets by a large margin, is comparable to or exceeds Zero-shot-Program-of-Thought Prompting, and has comparable performance with 8-shot CoT prompting on the math reasoning problem. The code can be found at https://github.com/AGI-Edgerunners/Plan-and-Solve-Prompting.
Abstract
While forward reasoning (i.e. find the answer given the question) has been explored extensively in the recent literature, backward reasoning is relatively unexplored. We examine the backward reasoning capabilities of LLMs on Math Word Problems (MWPs): given a mathematical question and its answer, with some details omitted from the question, can LLMs effectively retrieve the missing information? In this paper, we formally define the backward reasoning task on math word problems and modify three datasets to evaluate this task: GSM8k, SVAMP and MultiArith. Our findings show a significant drop in the accuracy of models on backward reasoning compared to forward reasoning across four SOTA LLMs (GPT4, GPT3.5, PaLM-2, and LLaMa-2). Utilizing the specific format of this task, we propose three novel techniques that improve performance: Rephrase reformulates the given problem into a forward reasoning problem, PAL-Tools combines the idea of Program-Aided LLMs to produce a set of equations that can be solved by an external solver, and Check your Work exploits the availability of natural verifier of high accuracy in the forward direction, interleaving solving and verification steps. Finally, realizing that each of our base methods correctly solves a different set of problems, we propose a novel Bayesian formulation for creating an ensemble over these base methods aided by a verifier to further boost the accuracy by a significant margin. Extensive experimentation demonstrates that our techniques successively improve the performance of LLMs on the backward reasoning task, with the final ensemble-based method resulting in a substantial performance gain compared to the raw LLMs with standard prompting techniques such as chain-of-thought.
Abstract
This report examines the effectiveness of Chain-of-Thought (CoT) prompting in improving the multi-step reasoning abilities of large language models (LLMs). Inspired by previous studies \cite{Min2022RethinkingWork}, we analyze the impact of three types of CoT prompt perturbations, namely CoT order, CoT values, and CoT operators on the performance of GPT-3 on various tasks. Our findings show that incorrect CoT prompting leads to poor performance on accuracy metrics. Correct values in the CoT is crucial for predicting correct answers. Moreover, incorrect demonstrations, where the CoT operators or the CoT order are wrong, do not affect the performance as drastically when compared to the value based perturbations. This research deepens our understanding of CoT prompting and opens some new questions regarding the capability of LLMs to learn reasoning in context.
Abstract
Large language models (LLMs) can perform complex reasoning by generating intermediate reasoning steps. Providing these steps for prompting demonstrations is called chain-of-thought (CoT) prompting. CoT prompting has two major paradigms. One leverages a simple prompt like"Let's think step by step"to facilitate step-by-step thinking before answering a question. The other uses a few manual demonstrations one by one, each composed of a question and a reasoning chain that leads to an answer. The superior performance of the second paradigm hinges on the hand-crafting of task-specific demonstrations one by one. We show that such manual efforts may be eliminated by leveraging LLMs with the"Let's think step by step"prompt to generate reasoning chains for demonstrations one by one, i.e., let's think not just step by step, but also one by one. However, these generated chains often come with mistakes. To mitigate the effect of such mistakes, we find that diversity matters for automatically constructing demonstrations. We propose an automatic CoT prompting method: Auto-CoT. It samples questions with diversity and generates reasoning chains to construct demonstrations. On ten public benchmark reasoning tasks with GPT-3, Auto-CoT consistently matches or exceeds the performance of the CoT paradigm that requires manual designs of demonstrations. Code is available at https://github.com/amazon-research/auto-cot
Documents
- stance detection with supervised, zeroshot, and fewshot applications
- explainable depression symptom detection in social media
- embedding democratic values into social media ais via societal objective functions
- framing the newsfrom human perception to large language model inferences
- mindwatch a smart cloudbased ai solution for suicide ideation detection leveraging large language models
Abstract
Stance detection is the identification of an author's beliefs about a subjectfrom a document. Researchers widely rely on sentiment analysis to accomplishthis. However, recent research has show that sentiment analysis is only looselycorrelated with stance, if at all. This paper advances methods in text analysisby precisely defining the task of stance detection, providing a generalizedframework for the task, and then presenting three distinct approaches forperforming stance detection: supervised classification, zero-shotclassification with NLI classifiers, and in-context learning. In doing so, Idemonstrate how zero-shot and few-shot language classifiers can replace humanlabelers for a variety of tasks and discuss how their application andlimitations differ from supervised classifiers. Finally, I demonstrate anapplication of zero-shot stance detection by replicating Block Jr et al.(2022).
Abstract
Users of social platforms often perceive these sites as supportive spaces topost about their mental health issues. Those conversations contain importanttraces about individuals' health risks. Recently, researchers have exploitedthis online information to construct mental health detection models, which aimto identify users at risk on platforms like Twitter, Reddit or Facebook. Mostof these models are centred on achieving good classification results, ignoringthe explainability and interpretability of the decisions. Recent research haspointed out the importance of using clinical markers, such as the use ofsymptoms, to improve trust in the computational models by health professionals.In this paper, we propose using transformer-based architectures to detect andexplain the appearance of depressive symptom markers in the users' writings. Wepresent two approaches: i) train a model to classify, and another one toexplain the classifier's decision separately and ii) unify the two taskssimultaneously using a single model. Additionally, for this latter manner, wealso investigated the performance of recent conversational LLMs when usingin-context learning. Our natural language explanations enable clinicians tointerpret the models' decisions based on validated symptoms, enhancing trust inthe automated process. We evaluate our approach using recent symptom-baseddatasets, employing both offline and expert-in-the-loop metrics to assess thequality of the explanations generated by our models. The experimental resultsshow that it is possible to achieve good classification results whilegenerating interpretable symptom-based explanations.
Abstract
Can we design artificial intelligence (AI) systems that rank our social media feeds to consider democratic values such as mitigating partisan animosity as part of their objective functions? We introduce a method for translating established, vetted social scientific constructs into AI objective functions, which we term societal objective functions, and demonstrate the method with application to the political science construct of anti-democratic attitudes. Traditionally, we have lacked observable outcomes to use to train such models, however, the social sciences have developed survey instruments and qualitative codebooks for these constructs, and their precision facilitates translation into detailed prompts for large language models. We apply this method to create a democratic attitude model that estimates the extent to which a social media post promotes anti-democratic attitudes, and test this democratic attitude model across three studies. In Study 1, we first test the attitudinal and behavioral effectiveness of the intervention among US partisans (N=1,380) by manually annotating (alpha=.895) social media posts with anti-democratic attitude scores and testing several feed ranking conditions based on these scores. Removal (d=.20) and downranking feeds (d=.25) reduced participants' partisan animosity without compromising their experience and engagement. In Study 2, we scale up the manual labels by creating the democratic attitude model, finding strong agreement with manual labels (rho=.75). Finally, in Study 3, we replicate Study 1 using the democratic attitude model instead of manual labels to test its attitudinal and behavioral impact (N=558), and again find that the feed downranking using the societal objective function reduced partisan animosity (d=.25). This method presents a novel strategy to draw on social science theory and methods to mitigate societal harms in social media AIs.
Abstract
Identifying the frames of news is important to understand the articles'vision, intention, message to be conveyed, and which aspects of the news areemphasized. Framing is a widely studied concept in journalism, and has emergedas a new topic in computing, with the potential to automate processes andfacilitate the work of journalism professionals. In this paper, we study thisissue with articles related to the Covid-19 anti-vaccine movement. First, tounderstand the perspectives used to treat this theme, we developed a protocolfor human labeling of frames for 1786 headlines of No-Vax movement articles ofEuropean newspapers from 5 countries. Headlines are key units in the writtenpress, and worth of analysis as many people only read headlines (or use them toguide their decision for further reading.) Second, considering advances inNatural Language Processing (NLP) with large language models, we investigatedtwo approaches for frame inference of news headlines: first with a GPT-3.5fine-tuning approach, and second with GPT-3.5 prompt-engineering. Our workcontributes to the study and analysis of the performance that these models haveto facilitate journalistic tasks like classification of frames, whileunderstanding whether the models are able to replicate human perception in theidentification of these frames.
Abstract
Suicide, a serious public health concern affecting millions of individuals worldwide, refers to the intentional act of ending one's own life. Mental health issues such as depression, frustration, and hopelessness can directly or indirectly influence the emergence of suicidal thoughts. Early identification of these thoughts is crucial for timely diagnosis. In recent years, advances in artificial intelligence (AI) and natural language processing (NLP) have paved the way for revolutionizing mental health support and education. In this proof-of-concept study, we have created MindWatch, a cutting-edge tool that harnesses the power of AI-driven language models to serve as a valuable computer-aided system for the mental health professions to achieve two important goals such as early symptom detection, and personalized psychoeducation. We utilized ALBERT and Bio-Clinical BERT language models and fine-tuned them with the Reddit dataset to build the classifiers. We evaluated the performance of bi-LSTM, ALBERT, Bio-Clinical BERT, OpenAI GPT3.5 (via prompt engineering), and an ensembled voting classifier to detect suicide ideation. For personalized psychoeducation, we used the state-of-the-art Llama 2 foundation model leveraging prompt engineering. The tool is developed in the Amazon Web Service environment. All models performed exceptionally well, with accuracy and precision/recall greater than 92%. ALBERT performed better (AUC=.98) compared to the zero-shot classification accuracies obtained from OpenAI GPT3.5 Turbo (ChatGPT) on hidden datasets (AUC=.91). Furthermore, we observed that the inconclusiveness rate of the Llama 2 model is low while tested for few examples. This study emphasizes how transformer models can help provide customized psychoeducation to individuals dealing with mental health issues. By tailoring content to address their unique mental health conditions, treatment choices, and self-help resources, this approach empowers individuals to actively engage in their recovery journey. Additionally, these models have the potential to advance the automated detection of depressive disorders.
Documents
- exploring efl students' prompt engineering in humanai story writing an activity theory perspective
- coaudit tools to help humans doublecheck aigenerated content
- game of tones faculty detection of gpt4 generated content in university assessments
- supercharging academic writing with generative ai framework, techniques, and caveats
- cases of efl secondary students' prompt engineering pathways to complete a writing task with chatgpt
Abstract
This study applies Activity Theory to investigate how English as a foreign language (EFL) students prompt generative artificial intelligence (AI) tools during short story writing. Sixty-seven Hong Kong secondary school students created generative-AI tools using open-source language models and wrote short stories with them. The study collected and analyzed the students' generative-AI tools, short stories, and written reflections on their conditions or purposes for prompting. The research identified three main themes regarding the purposes for which students prompt generative-AI tools during short story writing: a lack of awareness of purposes, overcoming writer's block, and developing, expanding, and improving the story. The study also identified common characteristics of students' activity systems, including the sophistication of their generative-AI tools, the quality of their stories, and their school's overall academic achievement level, for their prompting of generative-AI tools for the three purposes during short story writing. The study's findings suggest that teachers should be aware of students' purposes for prompting generative-AI tools to provide tailored instructions and scaffolded guidance. The findings may also help designers provide differentiated instructions for users at various levels of story development when using a generative-AI tool.
Abstract
Users are increasingly being warned to check AI-generated content for correctness. Still, as LLMs (and other generative models) generate more complex output, such as summaries, tables, or code, it becomes harder for the user to audit or evaluate the output for quality or correctness. Hence, we are seeing the emergence of tool-assisted experiences to help the user double-check a piece of AI-generated content. We refer to these as co-audit tools. Co-audit tools complement prompt engineering techniques: one helps the user construct the input prompt, while the other helps them check the output response. As a specific example, this paper describes recent research on co-audit tools for spreadsheet computations powered by generative models. We explain why co-audit experiences are essential for any application of generative AI where quality is important and errors are consequential (as is common in spreadsheet computations). We propose a preliminary list of principles for co-audit, and outline research challenges.
Abstract
This study explores the robustness of university assessments against the useof Open AI's Generative Pre-Trained Transformer 4 (GPT-4) generated content andevaluates the ability of academic staff to detect its use when supported by theTurnitin Artificial Intelligence (AI) detection tool. The research involvedtwenty-two GPT-4 generated submissions being created and included in theassessment process to be marked by fifteen different faculty members. The studyreveals that although the detection tool identified 91% of the experimentalsubmissions as containing some AI-generated content, the total detected contentwas only 54.8%. This suggests that the use of adversarial techniques regardingprompt engineering is an effective method in evading AI detection tools andhighlights that improvements to AI detection software are needed. Using theTurnitin AI detect tool, faculty reported 54.5% of the experimental submissionsto the academic misconduct process, suggesting the need for increased awarenessand training into these tools. Genuine submissions received a mean score of54.4, whereas AI-generated content scored 52.3, indicating the comparableperformance of GPT-4 in real-life situations. Recommendations include adjustingassessment strategies to make them more resistant to the use of AI tools, usingAI-inclusive assessment where possible, and providing comprehensive trainingprograms for faculty and students. This research contributes to understandingthe relationship between AI-generated content and academic assessment, urgingfurther investigation to preserve academic integrity.
Abstract
Academic writing is an indispensable yet laborious part of the researchenterprise. This Perspective maps out principles and methods for usinggenerative artificial intelligence (AI), specifically large language models(LLMs), to elevate the quality and efficiency of academic writing. We introducea human-AI collaborative framework that delineates the rationale (why), process(how), and nature (what) of AI engagement in writing. The framework pinpointsboth short-term and long-term reasons for engagement and their underlyingmechanisms (e.g., cognitive offloading and imaginative stimulation). It revealsthe role of AI throughout the writing process, conceptualized through atwo-stage model for human-AI collaborative writing, and the nature of AIassistance in writing, represented through a model of writing-assistance typesand levels. Building on this framework, we describe effective promptingtechniques for incorporating AI into the writing routine (outlining, drafting,and editing) as well as strategies for maintaining rigorous scholarship,adhering to varied journal policies, and avoiding overreliance on AI.Ultimately, the prudent integration of AI into academic writing can ease thecommunication burden, empower authors, accelerate discovery, and promotediversity in science.
Abstract
ChatGPT is a state-of-the-art (SOTA) chatbot. Although it has potential to support English as a foreign language (EFL) students' writing, to effectively collaborate with it, a student must learn to engineer prompts, that is, the skill of crafting appropriate instructions so that ChatGPT produces desired outputs. However, writing an appropriate prompt for ChatGPT is not straightforward for non-technical users who suffer a trial-and-error process. This paper examines the content of EFL students' ChatGPT prompts when completing a writing task and explores patterns in the quality and quantity of the prompts. The data come from iPad screen recordings of secondary school EFL students who used ChatGPT and other SOTA chatbots for the first time to complete the same writing task. The paper presents a case study of four distinct pathways that illustrate the trial-and-error process and show different combinations of prompt content and quantity. The cases contribute evidence for the need to provide prompt engineering education in the context of the EFL writing classroom, if students are to move beyond an individual trial-and-error process, learning a greater variety of prompt content and more sophisticated prompts to support their writing.
Documents
- effective test generation using pretrained large language models and mutation testing
- a study on prompt design, advantages and limitations of chatgpt for deep learning program repair
- acecoder utilizing existing code to enhance code generation
- structured chainofthought prompting for code generation
- fixing hardware security bugs with large language models
Abstract
One of the critical phases in software development is software testing.Testing helps with identifying potential bugs and reducing maintenance costs.The goal of automated test generation tools is to ease the development of testsby suggesting efficient bug-revealing tests. Recently, researchers haveleveraged Large Language Models (LLMs) of code to generate unit tests. Whilethe code coverage of generated tests was usually assessed, the literature hasacknowledged that the coverage is weakly correlated with the efficiency oftests in bug detection. To improve over this limitation, in this paper, weintroduce MuTAP for improving the effectiveness of test cases generated by LLMsin terms of revealing bugs by leveraging mutation testing. Our goal is achievedby augmenting prompts with surviving mutants, as those mutants highlight thelimitations of test cases in detecting bugs. MuTAP is capable of generatingeffective test cases in the absence of natural language descriptions of theProgram Under Test (PUTs). We employ different LLMs within MuTAP and evaluatetheir performance on different benchmarks. Our results show that our proposedmethod is able to detect up to 28% more faulty human-written code snippets.Among these, 17% remained undetected by both the current state-of-the-art fullyautomated test generation tool (i.e., Pynguin) and zero-shot/few-shot learningapproaches on LLMs. Furthermore, MuTAP achieves a Mutation Score (MS) of 93.57%on synthetic buggy code, outperforming all other approaches in our evaluation.Our findings suggest that although LLMs can serve as a useful tool to generatetest cases, they require specific post-processing steps to enhance theeffectiveness of the generated test cases which may suffer from syntactic orfunctional errors and may be ineffective in detecting certain types of bugs andtesting corner cases PUTs.
Abstract
ChatGPT has revolutionized many research and industrial fields. ChatGPT has shown great potential in software engineering to boost various traditional tasks such as program repair, code understanding, and code generation. However, whether automatic program repair (APR) applies to deep learning (DL) programs is still unknown. DL programs, whose decision logic is not explicitly encoded in the source code, have posed unique challenges to APR. While to repair DL programs, an APR approach needs to not only parse the source code syntactically but also needs to understand the code intention. With the best prior work, the performance of fault localization is still far less than satisfactory (only about 30\%). Therefore, in this paper, we explore ChatGPT's capability for DL program repair by asking three research questions. (1) Can ChatGPT debug DL programs effectively? (2) How can ChatGPT's repair performance be improved by prompting? (3) In which way can dialogue help facilitate the repair? On top of that, we categorize the common aspects useful for prompt design for DL program repair. Also, we propose various prompt templates to facilitate the performance and summarize the advantages and disadvantages of ChatGPT's abilities such as detecting bad code smell, code refactoring, and detecting API misuse/deprecation.
Abstract
Large Language Models (LLMs) have shown great success in code generation.LLMs take as the input a prompt and output the code. A key question is how tomake prompts (i.e., Prompting Techniques). Existing prompting techniques aredesigned for natural language generation and have low accuracy in codegeneration. In this paper, we propose a new prompting technique named AceCoder. Ourmotivation is that code generation meets two unique challenges (i.e.,requirement understanding and code implementation). AceCoder contains two novelmechanisms (i.e., guided code generation and example retrieval) to solve thesechallenges. (1) Guided code generation asks LLMs first to analyze requirementsand output an intermediate preliminary (e.g., test cases). The preliminary isused to clarify requirements and tell LLMs "what to write". (2) Exampleretrieval selects similar programs as examples in prompts, which provide lotsof relevant content (e.g., algorithms, APIs) and teach LLMs "how to write". Weapply AceCoder to three LLMs (e.g., Codex) and evaluate it on three publicbenchmarks using the Pass@k. Results show that AceCoder can significantlyimprove the performance of LLMs on code generation. (1) In terms of Pass@1,AceCoder outperforms the state-of-the-art baseline by up to 56.4% in MBPP,70.7% in MBJP, and 88.4% in MBJSP. (2) AceCoder is effective in LLMs withdifferent sizes (i.e., 6B to 13B) and different languages (i.e., Python, Java,and JavaScript). (3) Human evaluation shows human developers prefer programsfrom AceCoder.
Abstract
Large Language Models (LLMs) (e.g., ChatGPT) have shown impressiveperformance in code generation. LLMs take prompts as inputs, andChain-of-Thought (CoT) prompting is the state-of-the-art prompting technique.CoT prompting asks LLMs first to generate CoTs (i.e., intermediate naturallanguage reasoning steps) and then output the code. However, CoT prompting isdesigned for natural language generation and has low accuracy in codegeneration. In this paper, we propose Structured CoTs (SCoTs) and present a novelprompting technique for code generation, named SCoT prompting. Our motivationis source code contains rich structural information and any code can becomposed of three program structures (i.e., sequence, branch, and loopstructures). Intuitively, structured intermediate reasoning steps make forstructured source code. Thus, we ask LLMs to use program structures to buildCoTs, obtaining SCoTs. Then, LLMs generate the final code based on SCoTs.Compared to CoT prompting, SCoT prompting explicitly constrains LLMs to thinkabout how to solve requirements from the view of source code and further theperformance of LLMs in code generation. We apply SCoT prompting to two LLMs(i.e., ChatGPT and Codex) and evaluate it on three benchmarks (i.e., HumanEval,MBPP, and MBCPP). (1) SCoT prompting outperforms the state-of-the-art baseline- CoT prompting by up to 13.79% in Pass@1. (2) Human evaluation shows humandevelopers prefer programs from SCoT prompting. (3) SCoT prompting is robust toexamples and achieves substantial improvements.
Abstract
Novel AI-based code-writing Large Language Models (LLMs) such as OpenAI's Codex have demonstrated capabilities in many coding-adjacent domains. In this work we consider how LLMs maybe leveraged to automatically repair security relevant bugs present in hardware designs. We focus on bug repair in code written in the Hardware Description Language Verilog. For this study we build a corpus of domain-representative hardware security bugs. We then design and implement a framework to quantitatively evaluate the performance of any LLM tasked with fixing the specified bugs. The framework supports design space exploration of prompts (i.e., prompt engineering) and identifying the best parameters for the LLM. We show that an ensemble of LLMs can repair all ten of our benchmarks. This ensemble outperforms the state-of-the-art Cirfix hardware bug repair tool on its own suite of bugs. These results show that LLMs can repair hardware security bugs and the framework is an important step towards the ultimate goal of an automated end-to-end bug repair framework.
Documents
- multitask pretraining of modular prompt for chinese fewshot learning
- prompting electra fewshot learning with discriminative pretrained models
- differentiable entailment for parameter efficient few shot learning
- discrete and soft prompting for multilingual models
- fewshot learning for sentence pair classification and its applications in software engineering
Abstract
Prompt tuning is a parameter-efficient approach to adapting pre-trainedlanguage models to downstream tasks. Although prompt tuning has been shown tomatch the performance of full model tuning when training data is sufficient, ittends to struggle in few-shot learning settings. In this paper, we presentMulti-task Pre-trained Modular Prompt (MP2) to boost prompt tuning for few-shotlearning. MP2 is a set of combinable prompts pre-trained on 38 Chinese tasks.On downstream tasks, the pre-trained prompts are selectively activated andcombined, leading to strong compositional generalization to unseen tasks. Tobridge the gap between pre-training and fine-tuning, we formulate upstream anddownstream tasks into a unified machine reading comprehension task. Extensiveexperiments under two learning paradigms, i.e., gradient descent and black-boxtuning, show that MP2 significantly outperforms prompt tuning, full modeltuning, and prior prompt pre-training methods in few-shot settings. Inaddition, we demonstrate that MP2 can achieve surprisingly fast and strongadaptation to downstream tasks by merely learning 8 parameters to combine thepre-trained modular prompts.
Abstract
Pre-trained masked language models successfully perform few-shot learning byformulating downstream tasks as text infilling. However, as a strongalternative in full-shot settings, discriminative pre-trained models likeELECTRA do not fit into the paradigm. In this work, we adapt prompt-basedfew-shot learning to ELECTRA and show that it outperforms masked languagemodels in a wide range of tasks. ELECTRA is pre-trained to distinguish if atoken is generated or original. We naturally extend that to prompt-basedfew-shot learning by training to score the originality of the target optionswithout introducing new parameters. Our method can be easily adapted to tasksinvolving multi-token predictions without extra computation overhead. Analysisshows that ELECTRA learns distributions that align better with downstreamtasks.
Abstract
Few-shot learning allows pre-trained language models to adapt to downstreamtasks while using a limited number of training examples. However, practicalapplications are limited when all model parameters must be optimized. In thiswork we apply a new technique for parameter efficient few shot learning whileadopting a strict definition of parameter efficiency. Our training methodcombines 1) intermediate training by reformulating natural language tasks asentailment tasks \cite{wang_entailment_2021} and 2) differentiable optimizationof template and label tokens \cite{zhang_differentiable_2021}. We quantify thetradeoff between parameter efficiency and performance in the few-shot regimeand propose a simple model agnostic approach that can be extended to any taskBy achieving competitive performance while only optimizing 3\% of a model'sparameters and allowing for batched inference, we allow for more efficientpractical deployment of models.
Abstract
It has been shown for English that discrete and soft prompting performstrongly in few-shot learning with pretrained language models (PLMs). In thispaper, we show that discrete and soft prompting perform better than finetuningin multilingual cases: Crosslingual transfer and in-language training ofmultilingual natural language inference. For example, with 48 English trainingexamples, finetuning obtains 33.74% accuracy in crosslingual transfer, barelysurpassing the majority baseline (33.33%). In contrast, discrete and softprompting outperform finetuning, achieving 36.43% and 38.79%. We alsodemonstrate good performance of prompting with training data in multiplelanguages other than English.
Abstract
Few-shot learning-the ability to train models with access to limited data-hasbecome increasingly popular in the natural language processing (NLP) domain, aslarge language models such as GPT and T0 have been empirically shown to achievehigh performance in numerous tasks with access to just a handful of labeledexamples. Smaller language models such as BERT and its variants have also beenshown to achieve strong performance with just a handful of labeled exampleswhen combined with few-shot learning algorithms like pattern-exploitingtraining (PET) and SetFit. The focus of this work is to investigate theperformance of alternative few-shot learning approaches with BERT-based models.Specifically, vanilla fine-tuning, PET and SetFit are compared for numerousBERT-based checkpoints over an array of training set sizes. To facilitate thisinvestigation, applications of few-shot learning are considered in softwareengineering. For each task, high-performance techniques and their associatedmodel checkpoints are identified through detailed empirical analysis. Ourresults establish PET as a strong few-shot learning approach, and our analysisshows that with just a few hundred labeled examples it can achieve performancenear that of fine-tuning on full-sized data sets.
Documents
- legoprover neural theorem proving with growing libraries
- longllmlingua accelerating and enhancing llms in long context scenarios via prompt compression
- memoryefficient finetuning of compressed large language models via sub4bit integer quantization
- tool documentation enables zeroshot toolusage with large language models
- tcrallm token compression retrieval augmented large language model for inference cost reduction
Abstract
Despite the success of large language models (LLMs), the task of theorem proving still remains one of the hardest reasoning tasks that is far from being fully solved. Prior methods using language models have demonstrated promising results, but they still struggle to prove even middle school level theorems. One common limitation of these methods is that they assume a fixed theorem library during the whole theorem proving process. However, as we all know, creating new useful theorems or even new theories is not only helpful but crucial and necessary for advancing mathematics and proving harder and deeper results. In this work, we present LEGO-Prover, which employs a growing skill library containing verified lemmas as skills to augment the capability of LLMs used in theorem proving. By constructing the proof modularly, LEGO-Prover enables LLMs to utilize existing skills retrieved from the library and to create new skills during the proving process. These skills are further evolved (by prompting an LLM) to enrich the library on another scale. Modular and reusable skills are constantly added to the library to enable tackling increasingly intricate mathematical problems. Moreover, the learned library further bridges the gap between human proofs and formal proofs by making it easier to impute missing steps. LEGO-Prover advances the state-of-the-art pass rate on miniF2F-valid (48.0% to 57.0%) and miniF2F-test (45.5% to 47.1%). During the proving process, LEGO-Prover also manages to generate over 20,000 skills (theorems/lemmas) and adds them to the growing library. Our ablation study indicates that these newly added skills are indeed helpful for proving theorems, resulting in an improvement from a success rate of 47.1% to 50.4%. We also release our code and all the generated skills.
Abstract
In long context scenarios, large language models (LLMs) face three mainchallenges: higher computational/financial cost, longer latency, and inferiorperformance. Some studies reveal that the performance of LLMs depends on boththe density and the position of the key information (question relevant) in theinput prompt. Inspired by these findings, we propose LongLLMLingua for promptcompression towards improving LLMs' perception of the key information tosimultaneously address the three challenges. We conduct evaluation on a widerange of long context scenarios including single-/multi-document QA, few-shotlearning, summarization, synthetic tasks, and code completion. The experimentalresults show that LongLLMLingua compressed prompt can derive higher performancewith much less cost. The latency of the end-to-end system is also reduced. Forexample, on NaturalQuestions benchmark, LongLLMLingua gains a performance boostof up to 17.1% over the original prompt with ~4x fewer tokens as input toGPT-3.5-Turbo. It can derive cost savings of \$28.5 and \$27.4 per 1,000samples from the LongBench and ZeroScrolls benchmark, respectively.Additionally, when compressing prompts of ~10k tokens at a compression rate of2x-10x, LongLLMLingua can speed up the end-to-end latency by 1.4x-3.8x. Ourcode is available at https://aka.ms/LLMLingua.
Abstract
Large language models (LLMs) face the challenges in fine-tuning anddeployment due to their high memory demands and computational costs. Whileparameter-efficient fine-tuning (PEFT) methods aim to reduce the memory usageof the optimizer state during fine-tuning, the inherent size of pre-trained LLMweights continues to be a pressing concern. Even though quantization techniquesare widely proposed to ease memory demands and accelerate LLM inference, mostof these techniques are geared towards the deployment phase. To bridge thisgap, this paper presents Parameter-Efficient and Quantization-aware Adaptation(PEQA) - a simple yet effective method that combines the advantages of PEFTwith quantized LLMs. By updating solely the quantization scales, PEQA can bedirectly applied to quantized LLMs, ensuring seamless task transitions.Parallel to existing PEFT methods, PEQA significantly reduces the memoryoverhead associated with the optimizer state. Furthermore, it leverages theadvantages of quantization to substantially reduce model sizes. Even afterfine-tuning, the quantization structure of a PEQA-tuned LLM remains intact,allowing for accelerated inference on the deployment stage. We employPEQA-tuning for task-specific adaptation on LLMs with up to 65 billionparameters. To assess the logical reasoning and language comprehension ofPEQA-tuned LLMs, we fine-tune low-bit quantized LLMs using a instructiondataset. Our results show that even when LLMs are quantized to below 4-bitprecision, their capabilities in language modeling, few-shot in-contextlearning, and comprehension can be resiliently restored to (or even improvedover) their full-precision original performances with PEQA.
Abstract
Today, large language models (LLMs) are taught to use new tools by providing a few demonstrations of the tool's usage. Unfortunately, demonstrations are hard to acquire, and can result in undesirable biased usage if the wrong demonstration is chosen. Even in the rare scenario that demonstrations are readily available, there is no principled selection protocol to determine how many and which ones to provide. As tasks grow more complex, the selection search grows combinatorially and invariably becomes intractable. Our work provides an alternative to demonstrations: tool documentation. We advocate the use of tool documentation, descriptions for the individual tool usage, over demonstrations. We substantiate our claim through three main empirical findings on 6 tasks across both vision and language modalities. First, on existing benchmarks, zero-shot prompts with only tool documentation are sufficient for eliciting proper tool usage, achieving performance on par with few-shot prompts. Second, on a newly collected realistic tool-use dataset with hundreds of available tool APIs, we show that tool documentation is significantly more valuable than demonstrations, with zero-shot documentation significantly outperforming few-shot without documentation. Third, we highlight the benefits of tool documentations by tackling image generation and video tracking using just-released unseen state-of-the-art models as tools. Finally, we highlight the possibility of using tool documentation to automatically enable new applications: by using nothing more than the documentation of GroundingDino, Stable Diffusion, XMem, and SAM, LLMs can re-invent the functionalities of the just-released Grounded-SAM and Track Anything models.
Abstract
Since ChatGPT released its API for public use, the number of applicationsbuilt on top of commercial large language models (LLMs) increase exponentially.One popular usage of such models is leveraging its in-context learning abilityand generating responses given user queries leveraging knowledge obtained byretrieval augmentation. One problem of deploying commercial retrieval-augmentedLLMs is the cost due to the additionally retrieved context that largelyincreases the input token size of the LLMs. To mitigate this, we propose atoken compression scheme that includes two methods: summarization compressionand semantic compression. The first method applies a T5-based model that isfine-tuned by datasets generated using self-instruct containing samples withvarying lengths and reduce token size by doing summarization. The second methodfurther compresses the token size by removing words with lower impact on thesemantic. In order to adequately evaluate the effectiveness of the proposedmethods, we propose and utilize a dataset called Food-Recommendation DB (FRDB)focusing on food recommendation for women around pregnancy period or infants.Our summarization compression can reduce 65% of the retrieval token size withfurther 0.3% improvement on the accuracy; semantic compression provides a moreflexible way to trade-off the token size with performance, for which we canreduce the token size by 20% with only 1.6% of accuracy drop.
Documents
- fewshot event detection an empirical study and a unified view
- fewshot multimodal sentiment analysis based on multimodal probabilistic fusion prompts
- instanceaware prompt learning for language understanding and generation
- prefer prompt ensemble learning via feedbackreflectrefine
- fewshot stance detection via targetaware prompt distillation
Abstract
Few-shot event detection (ED) has been widely studied, while this brings noticeable discrepancies, e.g., various motivations, tasks, and experimental settings, that hinder the understanding of models for future progress.This paper presents a thorough empirical study, a unified view of ED models, and a better unified baseline. For fair evaluation, we compare 12 representative methods on three datasets, which are roughly grouped into prompt-based and prototype-based models for detailed analysis. Experiments consistently demonstrate that prompt-based methods, including ChatGPT, still significantly trail prototype-based methods in terms of overall performance. To investigate their superior performance, we break down their design elements along several dimensions and build a unified framework on prototype-based methods. Under such unified view, each prototype-method can be viewed a combination of different modules from these design elements. We further combine all advantageous modules and propose a simple yet effective baseline, which outperforms existing methods by a large margin (e.g., 2.7% F1 gains under low-resource setting).
Abstract
Multimodal sentiment analysis has gained significant attention due to the proliferation of multimodal content on social media. However, existing studies in this area rely heavily on large-scale supervised data, which is time-consuming and labor-intensive to collect. Thus, there is a need to address the challenge of few-shot multimodal sentiment analysis. To tackle this problem, we propose a novel method called Multimodal Probabilistic Fusion Prompts (MultiPoint) that leverages diverse cues from different modalities for multimodal sentiment detection in the few-shot scenario. Specifically, we start by introducing a Consistently Distributed Sampling approach called CDS, which ensures that the few-shot dataset has the same category distribution as the full dataset. Unlike previous approaches primarily using prompts based on the text modality, we design unified multimodal prompts to reduce discrepancies between different modalities and dynamically incorporate multimodal demonstrations into the context of each multimodal instance. To enhance the model's robustness, we introduce a probabilistic fusion method to fuse output predictions from multiple diverse prompts for each input. Our extensive experiments on six datasets demonstrate the effectiveness of our approach. First, our method outperforms strong baselines in the multimodal few-shot setting. Furthermore, under the same amount of data (1% of the full dataset), our CDS-based experimental results significantly outperform those based on previously sampled datasets constructed from the same number of instances of each class.
Abstract
Recently, prompt learning has become a new paradigm to utilize pre-trainedlanguage models (PLMs) and achieves promising results in downstream tasks witha negligible increase of parameters. The current usage of discrete andcontinuous prompts assumes that the prompt is fixed for a specific task and allsamples in the task share the same prompt. However, a task may contain quitediverse samples in which some are easy and others are difficult, and diverseprompts are desirable. In this paper, we propose an instance-aware promptlearning method that learns a different prompt for each instance. Specifically,we suppose that each learnable prompt token has a different contribution todifferent instances, and we learn the contribution by calculating the relevancescore between an instance and each prompt token. The contribution weightedprompt would be instance aware. We apply our method to both unidirectional andbidirectional PLMs on both language understanding and generation tasks.Extensive experiments demonstrate that our method obtains considerableimprovements compared to strong baselines. Especially, our method achieves thestate-of-the-art on the SuperGLUE few-shot learning benchmark.
Abstract
As an effective tool for eliciting the power of Large Language Models (LLMs), prompting has recently demonstrated unprecedented abilities across a variety of complex tasks. To further improve the performance, prompt ensemble has attracted substantial interest for tackling the hallucination and instability of LLMs. However, existing methods usually adopt a two-stage paradigm, which requires a pre-prepared set of prompts with substantial manual effort, and is unable to perform directed optimization for different weak learners. In this paper, we propose a simple, universal, and automatic method named PREFER (Pompt Ensemble learning via Feedback-Reflect-Refine) to address the stated limitations. Specifically, given the fact that weak learners are supposed to focus on hard examples during boosting, PREFER builds a feedback mechanism for reflecting on the inadequacies of existing weak learners. Based on this, the LLM is required to automatically synthesize new prompts for iterative refinement. Moreover, to enhance stability of the prompt effect evaluation, we propose a novel prompt bagging method involving forward and backward thinking, which is superior to majority voting and is beneficial for both feedback and weight calculation in boosting. Extensive experiments demonstrate that our PREFER achieves state-of-the-art performance in multiple types of tasks by a significant margin. We have made our code publicly available.
Abstract
Stance detection aims to identify whether the author of a text is in favorof, against, or neutral to a given target. The main challenge of this taskcomes two-fold: few-shot learning resulting from the varying targets and thelack of contextual information of the targets. Existing works mainly focus onsolving the second issue by designing attention-based models or introducingnoisy external knowledge, while the first issue remains under-explored. In thispaper, inspired by the potential capability of pre-trained language models(PLMs) serving as knowledge bases and few-shot learners, we propose tointroduce prompt-based fine-tuning for stance detection. PLMs can provideessential contextual information for the targets and enable few-shot learningvia prompts. Considering the crucial role of the target in stance detectiontask, we design target-aware prompts and propose a novel verbalizer. Instead ofmapping each label to a concrete word, our verbalizer maps each label to avector and picks the label that best captures the correlation between thestance and the target. Moreover, to alleviate the possible defect of dealingwith varying targets with a single hand-crafted prompt, we propose to distillthe information learned from multiple prompts. Experimental results show thesuperior performance of our proposed model in both full-data and few-shotscenarios.
Documents
- understanding stereotypes in language models towards robust measurement and zeroshot debiasing
- optr exploring the role of explanations in finetuning and prompting for reasoning skills of large language models
- casteist but not racist quantifying disparities in large language model bias between india and the west
- beyond task performance evaluating and reducing the flaws of large multimodal models with incontext learning
- how are prompts different in terms of sensitivity
Abstract
Generated texts from large pretrained language models have been shown to exhibit a variety of harmful, human-like biases about various demographics. These findings prompted large efforts aiming to understand and measure such effects, with the goal of providing benchmarks that can guide the development of techniques mitigating these stereotypical associations. However, as recent research has pointed out, the current benchmarks lack a robust experimental setup, consequently hindering the inference of meaningful conclusions from their evaluation metrics. In this paper, we extend these arguments and demonstrate that existing techniques and benchmarks aiming to measure stereotypes tend to be inaccurate and consist of a high degree of experimental noise that severely limits the knowledge we can gain from benchmarking language models based on them. Accordingly, we propose a new framework for robustly measuring and quantifying biases exhibited by generative language models. Finally, we use this framework to investigate GPT-3's occupational gender bias and propose prompting techniques for mitigating these biases without the need for fine-tuning.
Abstract
We conduct a thorough investigation into the reasoning capabilities of Large Language Models (LLMs), focusing specifically on the Open Pretrained Transformers (OPT) models as a representative of such models. Our study entails finetuning three different sizes of OPT on a carefully curated reasoning corpus, resulting in two sets of finetuned models: OPT-R, finetuned without explanations, and OPT-RE, finetuned with explanations. We then evaluate all models on 57 out-of-domain tasks drawn from the Super-NaturalInstructions benchmark, covering 26 distinct reasoning skills, utilizing three prompting techniques. Through a comprehensive grid of 27 configurations and 6,156 test evaluations, we investigate the dimensions of finetuning, prompting, and scale to understand the role of explanations on different reasoning skills. Our findings reveal that having explanations in the fewshot exemplar has no significant impact on the model’s performance when the model is finetuned, while positively affecting the non-finetuned counterpart. Moreover, we observe a slight yet consistent increase in classification accuracy as we incorporate explanations during prompting and finetuning, respectively. Finally, we offer insights on which reasoning skills benefit the most from incorporating explanations during finetuning and prompting, such as Numerical (+20.4%) and Analogical (+13.9%) reasoning, as well as skills that exhibit negligible or negative effects.
Abstract
Large Language Models (LLMs), now used daily by millions of users, can encode societal biases, exposing their users to representational harms. A large body of scholarship on LLM bias exists but it predominantly adopts a Western-centric frame and attends comparatively less to bias levels and potential harms in the Global South. In this paper, we quantify stereotypical bias in popular LLMs according to an Indian-centric frame and compare bias levels between the Indian and Western contexts. To do this, we develop a novel dataset which we call Indian-BhED (Indian Bias Evaluation Dataset), containing stereotypical and anti-stereotypical examples for caste and religion contexts. We find that the majority of LLMs tested are strongly biased towards stereotypes in the Indian context, especially as compared to the Western context. We finally investigate Instruction Prompting as a simple intervention to mitigate such bias and find that it significantly reduces both stereotypical and anti-stereotypical biases in the majority of cases for GPT-3.5. The findings of this work highlight the need for including more diverse voices when evaluating LLMs.
Abstract
Following the success of Large Language Models (LLMs), Large MultimodalModels (LMMs), such as the Flamingo model and its subsequent competitors, havestarted to emerge as natural steps towards generalist agents. However,interacting with recent LMMs reveals major limitations that are hardly capturedby the current evaluation benchmarks. Indeed, task performances (e.g., VQAaccuracy) alone do not provide enough clues to understand their realcapabilities, limitations, and to which extent such models are aligned to humanexpectations. To refine our understanding of those flaws, we deviate from thecurrent evaluation paradigm and propose the EvALign-ICL framework, in which we(1) evaluate 8 recent open-source LMMs (based on the Flamingo architecture suchas OpenFlamingo and IDEFICS) on 5 different axes; hallucinations, abstention,compositionality, explainability and instruction following. Our evaluation onthese axes reveals major flaws in LMMs. To efficiently address these problems,and inspired by the success of in-context learning (ICL) in LLMs, (2) weexplore ICL as a solution and study how it affects these limitations. Based onour ICL study, (3) we push ICL further and propose new multimodal ICLapproaches such as; Multitask-ICL, Chain-of-Hindsight-ICL, andSelf-Correcting-ICL. Our findings are as follows; (1) Despite their success,LMMs have flaws that remain unsolved with scaling alone. (2) The effect of ICLon LMMs flaws is nuanced; despite its effectiveness for improvedexplainability, abstention, and instruction following, ICL does not improvecompositional abilities, and actually even amplifies hallucinations. (3) Theproposed ICL variants are promising as post-hoc approaches to efficientlytackle some of those flaws. The code is available here:https://evalign-icl.github.io/
Abstract
In-context learning (ICL) has become one of the most popular learningparadigms. While there is a growing body of literature focusing on promptengineering, there is a lack of systematic analysis comparing the effects ofprompts across different models and tasks. To address this gap, we present acomprehensive prompt analysis based on the sensitivity of a function. Ouranalysis reveals that sensitivity is an unsupervised proxy for modelperformance, as it exhibits a strong negative correlation with accuracy. We usegradient-based saliency scores to empirically demonstrate how different promptsaffect the relevance of input tokens to the output, resulting in differentlevels of sensitivity. Furthermore, we introduce sensitivity-aware decodingwhich incorporates sensitivity estimation as a penalty term in the standardgreedy decoding. We show that this approach is particularly helpful wheninformation in the input is scarce. Our work provides a fresh perspective onthe analysis of prompts, and contributes to a better understanding of themechanism of ICL.
Documents
- prompts matter insights and strategies for prompt engineering in automated software traceability
- retrievalaugmented chainofthought in semistructured domains
- building emotional support chatbots in the era of llms
- leveraging large language models for scalable vector graphicsdriven image understanding
- overprompt enhancing chatgpt capabilities through an efficient incontext learning approach
Abstract
Large Language Models (LLMs) have the potential to revolutionize automated traceability by overcoming the challenges faced by previous methods and introducing new possibilities. However, the optimal utilization of LLMs for automated traceability remains unclear. This paper explores the process of prompt engineering to extract link predictions from an LLM. We provide detailed insights into our approach for constructing effective prompts, offering our lessons learned. Additionally, we propose multiple strategies for leveraging LLMs to generate traceability links, improving upon previous zero-shot methods on the ranking of candidate links after prompt refinement. The primary objective of this paper is to inspire and assist future researchers and engineers by highlighting the process of constructing traceability prompts to effectively harness LLMs for advancing automatic traceability.
Abstract
Applying existing question answering (QA) systems to specialized domains likelaw and finance presents challenges that necessitate domain expertise. Althoughlarge language models (LLMs) have shown impressive language comprehension andin-context learning capabilities, their inability to handle very longinputs/contexts is well known. Tasks specific to these domains need significantbackground knowledge, leading to contexts that can often exceed the maximumlength that existing LLMs can process. This study explores leveraging thesemi-structured nature of legal and financial data to efficiently retrieverelevant context, enabling the use of LLMs for domain-specialized QA. Theresulting system outperforms contemporary models and also provides usefulexplanations for the answers, encouraging the integration of LLMs into legaland financial NLP systems for future research.
Abstract
The integration of emotional support into various conversational scenariospresents profound societal benefits, such as social interactions, mental healthcounseling, and customer service. However, there are unsolved challenges thathinder real-world applications in this field, including limited dataavailability and the absence of well-accepted model training paradigms. Thiswork endeavors to navigate these challenges by harnessing the capabilities ofLarge Language Models (LLMs). We introduce an innovative methodology thatsynthesizes human insights with the computational prowess of LLMs to curate anextensive emotional support dialogue dataset. Our approach is initiated with ameticulously designed set of dialogues spanning diverse scenarios as generativeseeds. By utilizing the in-context learning potential of ChatGPT, werecursively generate an ExTensible Emotional Support dialogue dataset, namedExTES. Following this, we deploy advanced tuning techniques on the LLaMA model,examining the impact of diverse training strategies, ultimately yielding an LLMmeticulously optimized for emotional support interactions. An exhaustiveassessment of the resultant model showcases its proficiency in offeringemotional support, marking a pivotal step in the realm of emotional supportbots and paving the way for subsequent research and implementations.
Abstract
Recently, large language models (LLMs) have made significant advancements innatural language understanding and generation. However, their potential incomputer vision remains largely unexplored. In this paper, we introduce a new,exploratory approach that enables LLMs to process images using the ScalableVector Graphics (SVG) format. By leveraging the XML-based textual descriptionsof SVG representations instead of raster images, we aim to bridge the gapbetween the visual and textual modalities, allowing LLMs to directly understandand manipulate images without the need for parameterized visual components. Ourmethod facilitates simple image classification, generation, and in-contextlearning using only LLM capabilities. We demonstrate the promise of ourapproach across discriminative and generative tasks, highlighting its (i)robustness against distribution shift, (ii) substantial improvements achievedby tapping into the in-context learning abilities of LLMs, and (iii) imageunderstanding and generation capabilities with human guidance. Our code, data,and models can be found here https://github.com/mu-cai/svg-llm.
Abstract
The exceptional performance of pre-trained large language models hasrevolutionised various applications, but their adoption in productionenvironments is hindered by prohibitive costs and inefficiencies, particularlywhen utilising long prompts. This paper proposes OverPrompt, an in-contextlearning method aimed at improving LLM efficiency and performance by processingmultiple inputs in parallel. Evaluated across diverse datasets, OverPromptenhances task efficiency and integrates a diverse range of examples forimproved performance. Particularly, it amplifies fact-checking and sentimentanalysis tasks when supplemented with contextual information. Synthetic datagrouping further enhances performance, suggesting a viable approach for dataaugmentation.
Documents
- generate rather than retrieve large language models are strong context generators
- retrieving supporting evidence for generative question answering
- beyond factuality a comprehensive evaluation of large language models as knowledge generators
- atlas fewshot learning with retrieval augmented language models
- fewshot incontext learning for knowledge base question answering
Abstract
Knowledge-intensive tasks, such as open-domain question answering (QA), require access to a large amount of world or domain knowledge. A common approach for knowledge-intensive tasks is to employ a retrieve-then-read pipeline that first retrieves a handful of relevant contextual documents from an external corpus such as Wikipedia and then predicts an answer conditioned on the retrieved documents. In this paper, we present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators. We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer. Furthermore, we propose a novel clustering-based prompting method that selects distinct prompts, resulting in the generated documents that cover different perspectives, leading to better recall over acceptable answers. We conduct extensive experiments on three different knowledge-intensive tasks, including open-domain QA, fact checking, and dialogue system. Notably, GenRead achieves 71.6 and 54.4 exact match scores on TriviaQA and WebQ, significantly outperforming the state-of-the-art retrieve-then-read pipeline DPR-FiD by +4.0 and +3.9, without retrieving any documents from any external knowledge source. Lastly, we demonstrate the model performance can be further improved by combining retrieval and generation. Our code and generated documents can be found at https://github.com/wyu97/GenRead.
Abstract
Current large language models (LLMs) can exhibit near-human levels of performance on many natural language-based tasks, including open-domain question answering. Unfortunately, at this time, they also convincingly hallucinate incorrect answers, so that responses to questions must be verified against external sources before they can be accepted at face value. In this paper, we report two simple experiments to automatically validate generated answers against a corpus. We base our experiments on questions and passages from the MS MARCO (V1) test collection, and a retrieval pipeline consisting of sparse retrieval, dense retrieval and neural rerankers. In the first experiment, we validate the generated answer in its entirety. After presenting a question to an LLM and receiving a generated answer, we query the corpus with the combination of the question + generated answer. We then present the LLM with the combination of the question + generated answer + retrieved answer, prompting it to indicate if the generated answer can be supported by the retrieved answer. In the second experiment, we consider the generated answer at a more granular level, prompting the LLM to extract a list of factual statements from the answer and verifying each statement separately. We query the corpus with each factual statement and then present the LLM with the statement and the corresponding retrieved evidence. The LLM is prompted to indicate if the statement can be supported and make necessary edits using the retrieved material. With an accuracy of over 80%, we find that an LLM is capable of verifying its generated answer when a corpus of supporting material is provided. However, manual assessment of a random sample of questions reveals that incorrect generated answers are missed by this verification process. While this verification process can reduce hallucinations, it can not entirely eliminate them.
Abstract
Large language models (LLMs) outperform information retrieval techniques for downstream knowledge-intensive tasks when being prompted to generate world knowledge. However, community concerns abound regarding the factuality and potential implications of using this uncensored knowledge. In light of this, we introduce CONNER, a COmpreheNsive kNowledge Evaluation fRamework, designed to systematically and automatically evaluate generated knowledge from six important perspectives -- Factuality, Relevance, Coherence, Informativeness, Helpfulness and Validity. We conduct an extensive empirical analysis of the generated knowledge from three different types of LLMs on two widely studied knowledge-intensive tasks, i.e., open-domain question answering and knowledge-grounded dialogue. Surprisingly, our study reveals that the factuality of generated knowledge, even if lower, does not significantly hinder downstream tasks. Instead, the relevance and coherence of the outputs are more important than small factual mistakes. Further, we show how to use CONNER to improve knowledge-intensive tasks by designing two strategies: Prompt Engineering and Knowledge Selection. Our evaluation code and LLM-generated knowledge with human annotations will be released to facilitate future research.
Abstract
Large language models have shown impressive few-shot results on a wide rangeof tasks. However, when knowledge is key for such results, as is the case fortasks such as question answering and fact checking, massive parameter counts tostore knowledge seem to be needed. Retrieval augmented models are known toexcel at knowledge intensive tasks without the need for as many parameters, butit is unclear whether they work in few-shot settings. In this work we presentAtlas, a carefully designed and pre-trained retrieval augmented language modelable to learn knowledge intensive tasks with very few training examples. Weperform evaluations on a wide range of tasks, including MMLU, KILT andNaturalQuestions, and study the impact of the content of the document index,showing that it can easily be updated. Notably, Atlas reaches over 42% accuracyon Natural Questions using only 64 examples, outperforming a 540B parametersmodel by 3% despite having 50x fewer parameters.
Abstract
Question answering over knowledge bases is considered a difficult problem dueto the challenge of generalizing to a wide variety of possible natural languagequestions. Additionally, the heterogeneity of knowledge base schema itemsbetween different knowledge bases often necessitates specialized training fordifferent knowledge base question-answering (KBQA) datasets. To handlequestions over diverse KBQA datasets with a unified training-free framework, wepropose KB-BINDER, which for the first time enables few-shot in-contextlearning over KBQA tasks. Firstly, KB-BINDER leverages large language modelslike Codex to generate logical forms as the draft for a specific question byimitating a few demonstrations. Secondly, KB-BINDER grounds on the knowledgebase to bind the generated draft to an executable one with BM25 score matching.The experimental results on four public heterogeneous KBQA datasets show thatKB-BINDER can achieve a strong performance with only a few in-contextdemonstrations. Especially on GraphQA and 3-hop MetaQA, KB-BINDER can evenoutperform the state-of-the-art trained models. On GrailQA and WebQSP, ourmodel is also on par with other fully-trained models. We believe KB-BINDER canserve as an important baseline for future research. Our code is available athttps://github.com/ltl3A87/KB-BINDER.
Documents
- towards zerolabel language learning
- selfalignment with instruction backtranslation
- towards practical fewshot federated nlp
- prompting to distill boosting datafree knowledge distillation via reinforced prompt
- multistage collaborative knowledge distillation from large language models
Abstract
This paper explores zero-label learning in Natural Language Processing (NLP),whereby no human-annotated data is used anywhere during training and models aretrained purely on synthetic data. At the core of our framework is a novelapproach for better leveraging the powerful pretrained language models.Specifically, inspired by the recent success of few-shot inference on GPT-3, wepresent a training data creation procedure named Unsupervised Data Generation(UDG), which leverages few-shot prompts to synthesize high-quality trainingdata without real human annotations. Our method enables zero-label learning aswe train task-specific models solely on the synthetic data, yet we achievebetter or comparable results from strong baseline models trained onhuman-labeled data. Furthermore, when mixed with labeled data, our approachserves as a highly effective data augmentation procedure, achieving newstate-of-the-art results on the SuperGLUE benchmark.
Abstract
We present a scalable method to build a high quality instruction following language model by automatically labelling human-written text with corresponding instructions. Our approach, named instruction backtranslation, starts with a language model finetuned on a small amount of seed data, and a given web corpus. The seed model is used to construct training examples by generating instruction prompts for web documents (self-augmentation), and then selecting high quality examples from among these candidates (self-curation). This data is then used to finetune a stronger model. Finetuning LLaMa on two iterations of our approach yields a model that outperforms all other LLaMa-based models on the Alpaca leaderboard not relying on distillation data, demonstrating highly effective self-alignment.
Abstract
Transformer-based pre-trained models have emerged as the predominant solution for natural language processing (NLP). Fine-tuning such pre-trained models for downstream tasks often requires a considerable amount of labeled private data. In practice, private data is often distributed across heterogeneous mobile devices and may be prohibited from being uploaded. Moreover, well-curated labeled data is often scarce, presenting an additional challenge. To address these challenges, we first introduce a data generator for federated few-shot learning tasks, which encompasses the quantity and skewness of scarce labeled data in a realistic setting. Subsequently, we propose AUG-FedPrompt, a prompt-based federated learning system that exploits abundant unlabeled data for data augmentation. Our experiments indicate that AUG-FedPrompt can perform on par with full-set fine-tuning with a limited amount of labeled data. However, such competitive performance comes at a significant system cost.
Abstract
Data-free knowledge distillation (DFKD) conducts knowledge distillation via eliminating the dependence of original training data, and has recently achieved impressive results in accelerating pre-trained language models. At the heart of DFKD is to reconstruct a synthetic dataset by inverting the parameters of the uncompressed model. Prior DFKD approaches, however, have largely relied on hand-crafted priors of the target data distribution for the reconstruction, which can be inevitably biased and often incompetent to capture the intrinsic distributions. To address this problem, we propose a prompt-based method, termed as PromptDFD, that allows us to take advantage of learned language priors, which effectively harmonizes the synthetic sentences to be semantically and grammatically correct. Specifically, PromptDFD leverages a pre-trained generative model to provide language priors and introduces a reinforced topic prompter to control data synthesis, making the generated samples thematically relevant and semantically plausible, and thus friendly to downstream tasks. As shown in our experiments, the proposed method substantially improves the synthesis quality and achieves considerable improvements on distillation performance. In some cases, PromptDFD even gives rise to results on par with those from the data-driven knowledge distillation with access to the original training data.
Abstract
We study semi-supervised sequence prediction tasks where labeled data are tooscarce to effectively finetune a model and at the same time few-shot promptingof a large language model (LLM) has suboptimal performance. This happens when atask, such as parsing, is expensive to annotate and also unfamiliar to apretrained LLM. In this paper, we present a discovery that student modelsdistilled from a prompted LLM can often generalize better than their teacher onsuch tasks. Leveraging this finding, we propose a new distillation method,multistage collaborative knowledge distillation from an LLM (MCKD), for suchtasks. MCKD first prompts an LLM using few-shot in-context learning to producepseudolabels for unlabeled data. Then, at each stage of distillation, a pair ofstudents are trained on disjoint partitions of the pseudolabeled data. Eachstudent subsequently produces new and improved pseudolabels for the unseenpartition to supervise the next round of student(s) with. We show the benefitof multistage cross-partition labeling on two constituency parsing tasks. OnCRAFT biomedical parsing, 3-stage MCKD with 50 labeled examples matches theperformance of supervised finetuning with 500 examples and outperforms theprompted LLM and vanilla KD by 7.5% and 3.7% parsing F1, respectively.
Documents
- regionblip a unified multimodal pretraining framework for holistic and regional comprehension
- vast a visionaudiosubtitletext omnimodality foundation model and dataset
- frozen clip model is an efficient point cloud backbone
- pointclip point cloud understanding by clip
- qaclims questionanswer cross language image matching for weakly supervised semantic segmentation
Abstract
In this work, we investigate extending the comprehension of Multi-modal Large Language Models (MLLMs) to regional objects. To this end, we propose to extract features corresponding to regional objects as soft prompts for LLM, which provides a straightforward and scalable approach and eliminates the need for LLM fine-tuning. To effectively extract regional features from regular image features and irregular point cloud features, we present a novel and unified position-assisted feature extraction module. Furthermore, training an MLLM from scratch is highly time-consuming. Thus, we propose incrementally extending existing pre-trained MLLMs to comprehend more modalities and the regional objects of those modalities. Specifically, we freeze the Q-Former from BLIP-2, an impressive MLLM, and optimize the modality-specific Lora parameters in Q-Former and LLM for each newly introduced modality. The freezing of the Q-Former eliminates the need for extensive pre-training on massive image-text data. The freezed Q-Former pre-trained from massive image-text data is also beneficial for the pre-training on image-region-text data. We name our framework RegionBLIP. We pre-train RegionBLIP on image-region-text, point-cloud-text, and point-cloud-region-text data. Experimental results verify that \Ours{} can preserve the image comprehension capability of BILP-2 and further gain a comprehension of the newly introduced point cloud modality and regional objects. The Data, Code, and Pre-trained models will be available at https://github.com/mightyzau/RegionBLIP.
Abstract
Vision and text have been fully explored in contemporary video-text foundational models, while other modalities such as audio and subtitles in videos have not received sufficient attention. In this paper, we resort to establish connections between multi-modality video tracks, including Vision, Audio, and Subtitle, and Text by exploring an automatically generated large-scale omni-modality video caption dataset called VAST-27M. Specifically, we first collect 27 million open-domain video clips and separately train a vision and an audio captioner to generate vision and audio captions. Then, we employ an off-the-shelf Large Language Model (LLM) to integrate the generated captions, together with subtitles and instructional prompts into omni-modality captions. Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA). Extensive experiments have been conducted to demonstrate the effectiveness of our proposed VAST-27M corpus and VAST foundation model. VAST achieves 22 new state-of-the-art results on various cross-modality benchmarks. Code, model and dataset will be released at https://github.com/TXH-mercury/VAST.
Abstract
The pretraining-finetuning paradigm has demonstrated great success in NLP and2D image fields because of the high-quality representation ability andtransferability of their pretrained models. However, pretraining such a strongmodel is difficult in the 3D point cloud field since the training data islimited and point cloud collection is expensive. This paper introducesEfficient Point Cloud Learning (EPCL), an effective and efficient point cloudlearner for directly training high-quality point cloud models with a frozenCLIP model. Our EPCL connects the 2D and 3D modalities by semantically aligningthe 2D features and point cloud features without paired 2D-3D data.Specifically, the input point cloud is divided into a sequence of tokens anddirectly fed into the frozen CLIP model to learn point cloud representation.Furthermore, we design a task token to narrow the gap between 2D images and 3Dpoint clouds. Comprehensive experiments on 3D detection, semantic segmentation,classification and few-shot learning demonstrate that the 2D CLIP model can bean efficient point cloud backbone and our method achieves state-of-the-artaccuracy on both real-world and synthetic downstream tasks. Code will beavailable.
Abstract
Recently, zero-shot and few-shot learning via Contrastive Vision-LanguagePre-training (CLIP) have shown inspirational performance on 2D visualrecognition, which learns to match images with their corresponding texts inopen-vocabulary settings. However, it remains under explored that whether CLIP,pre-trained by large-scale image-text pairs in 2D, can be generalized to 3Drecognition. In this paper, we identify such a setting is feasible by proposingPointCLIP, which conducts alignment between CLIP-encoded point cloud and 3Dcategory texts. Specifically, we encode a point cloud by projecting it intomulti-view depth maps without rendering, and aggregate the view-wise zero-shotprediction to achieve knowledge transfer from 2D to 3D. On top of that, wedesign an inter-view adapter to better extract the global feature andadaptively fuse the few-shot knowledge learned from 3D into CLIP pre-trained in2D. By just fine-tuning the lightweight adapter in the few-shot settings, theperformance of PointCLIP could be largely improved. In addition, we observe thecomplementary property between PointCLIP and classical 3D-supervised networks.By simple ensembling, PointCLIP boosts baseline's performance and evensurpasses state-of-the-art models. Therefore, PointCLIP is a promisingalternative for effective 3D point cloud understanding via CLIP under lowresource cost and data regime. We conduct thorough experiments onwidely-adopted ModelNet10, ModelNet40 and the challenging ScanObjectNN todemonstrate the effectiveness of PointCLIP. The code is released athttps://github.com/ZrrSkywalker/PointCLIP.
Abstract
Class Activation Map (CAM) has emerged as a popular tool for weakly supervised semantic segmentation (WSSS), allowing the localization of object regions in an image using only image-level labels. However, existing CAM methods suffer from under-activation of target object regions and false-activation of background regions due to the fact that a lack of detailed supervision can hinder the model's ability to understand the image as a whole. In this paper, we propose a novel Question-Answer Cross-Language-Image Matching framework for WSSS (QA-CLIMS), leveraging the vision-language foundation model to maximize the text-based understanding of images and guide the generation of activation maps. First, a series of carefully designed questions are posed to the VQA (Visual Question Answering) model with Question-Answer Prompt Engineering (QAPE) to generate a corpus of both foreground target objects and backgrounds that are adaptive to query images. We then employ contrastive learning in a Region Image Text Contrastive (RITC) network to compare the obtained foreground and background regions with the generated corpus. Our approach exploits the rich textual information from the open vocabulary as additional supervision, enabling the model to generate high-quality CAMs with a more complete object region and reduce false-activation of background regions. We conduct extensive analysis to validate the proposed method and show that our approach performs state-of-the-art on both PASCAL VOC 2012 and MS COCO datasets.
Documents
- adaptive machine translation with large language models
- alexatm 20b fewshot learning using a largescale multilingual seq2seq model
- chainofdictionary prompting elicits translation in large language models
- democratizing llms for lowresource languages by leveraging their english dominant abilities with linguisticallydiverse prompts
- genderspecific machine translation with large language models
Abstract
Consistency is a key requirement of high-quality translation. It isespecially important to adhere to pre-approved terminology and adapt tocorrected translations in domain-specific projects. Machine translation (MT)has achieved significant progress in the area of domain adaptation. However,real-time adaptation remains challenging. Large-scale language models (LLMs)have recently shown interesting capabilities of in-context learning, where theylearn to replicate certain input-output text generation patterns, withoutfurther fine-tuning. By feeding an LLM at inference time with a prompt thatconsists of a list of translation pairs, it can then simulate the domain andstyle characteristics. This work aims to investigate how we can utilizein-context learning to improve real-time adaptive MT. Our extensive experimentsshow promising results at translation time. For example, LLMs can adapt to aset of in-domain sentence pairs and/or terminology while translating a newsentence. We observe that the translation quality with few-shot in-contextlearning can surpass that of strong encoder-decoder MT systems, especially forhigh-resource languages. Moreover, we investigate whether we can combine MTfrom strong encoder-decoder models with fuzzy matches, which can furtherimprove translation quality, especially for less supported languages. Weconduct our experiments across five diverse language pairs, namelyEnglish-to-Arabic (EN-AR), English-to-Chinese (EN-ZH), English-to-French(EN-FR), English-to-Kinyarwanda (EN-RW), and English-to-Spanish (EN-ES).
Abstract
In this work, we demonstrate that multilingual large-scalesequence-to-sequence (seq2seq) models, pre-trained on a mixture of denoisingand Causal Language Modeling (CLM) tasks, are more efficient few-shot learnersthan decoder-only models on various tasks. In particular, we train a 20 billionparameter multilingual seq2seq model called Alexa Teacher Model (AlexaTM 20B)and show that it achieves state-of-the-art (SOTA) performance on 1-shotsummarization tasks, outperforming a much larger 540B PaLM decoder model.AlexaTM 20B also achieves SOTA in 1-shot machine translation, especially forlow-resource languages, across almost all language pairs supported by the model(Arabic, English, French, German, Hindi, Italian, Japanese, Marathi,Portuguese, Spanish, Tamil, and Telugu) on Flores-101 dataset. We also show inzero-shot setting, AlexaTM 20B outperforms GPT3 (175B) on SuperGLUE and SQuADv2datasets and provides SOTA performance on multilingual tasks such as XNLI,XCOPA, Paws-X, and XWinograd. Overall, our results present a compelling casefor seq2seq models as a powerful alternative to decoder-only models forLarge-scale Language Model (LLM) training.
Abstract
Large language models (LLMs) have shown surprisingly good performance inmultilingual neural machine translation (MNMT) even when trained withoutparallel data. Yet, despite the fact that the amount of training data isgigantic, they still struggle with translating rare words, particularly forlow-resource languages. Even worse, it is usually unrealistic to retrieverelevant demonstrations for in-context learning with low-resource languages onLLMs, which restricts the practical use of LLMs for translation -- how shouldwe mitigate this problem? To this end, we present a novel method, CoD, whichaugments LLMs with prior knowledge with the chains of multilingual dictionariesfor a subset of input words to elicit translation abilities for LLMs. Extensiveexperiments indicate that augmenting ChatGPT with CoD elicits large gains by upto 13x chrF++ points for MNMT (3.08 to 42.63 for English to Serbian written inCyrillic script) on FLORES-200 full devtest set. We further demonstrate theimportance of chaining the multilingual dictionaries, as well as thesuperiority of CoD to few-shot demonstration for low-resource languages.
Abstract
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars. However, in low-resource languages, obtaining such hand-picked exemplars can still be challenging, where unsupervised techniques may be necessary. Moreover, competent generative capabilities of LLMs are observed only in high-resource languages, while their performances among under-represented languages fall behind due to pre-training data imbalance. To elicit LLMs' ability onto low-resource languages without any supervised data, we propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English. These prompts are then used to create intra-lingual exemplars to perform tasks in the target languages. Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages. We also show that fine-tuning a 7B model on data generated from our method helps it perform competitively with a 175B model. In non-English translation tasks, our method even outperforms supervised prompting by up to 3 chrF++ in many low-resource languages. When evaluated on zero-shot multilingual summarization, our method surpasses other English-pivoting baselines by up to 4 ROUGE-L and is also favored by GPT-4.
Abstract
Decoder-only Large Language Models (LLMs) have demonstrated potential inmachine translation (MT), albeit with performance slightly lagging behindtraditional encoder-decoder Neural Machine Translation (NMT) systems. However,LLMs offer a unique advantage: the ability to control the properties of theoutput through prompts. In this study, we harness this flexibility to exploreLLaMa's capability to produce gender-specific translations for languages withgrammatical gender. Our results indicate that LLaMa can generategender-specific translations with competitive accuracy and gender biasmitigation when compared to NLLB, a state-of-the-art multilingual NMT system.Furthermore, our experiments reveal that LLaMa's translations are robust,showing significant performance drops when evaluated against opposite-genderreferences in gender-ambiguous datasets but maintaining consistency in lessambiguous contexts. This research provides insights into the potential andchallenges of using LLMs for gender-specific translations and highlights theimportance of in-context learning to elicit new tasks in LLMs.
Documents
- developing prompts from large language model for extracting clinical information from pathology and ultrasound reports in breast cancer
- evaluation of gpt35 and gpt4 for supporting realworld information needs in healthcare delivery
- text2cohort democratizing the nci imaging data commons with natural language cohort discovery
- cxrllava multimodal large language model for interpreting chest xray images
- the student becomes the master matching gpt3 on scientific factual error correction
Abstract
Purpose We aimed to evaluate the time and cost of developing prompts using large language model (LLM), tailored to extract clinical factors in breast cancer patients and their accuracy. Materials and Methods We collected data from reports of surgical pathology and ultrasound from breast cancer patients who underwent radiotherapy from 2020 to 2022. We extracted the information using the Generative Pre-trained Transformer (GPT) for Sheets and Docs extension plugin and termed this the “LLM” method. The time and cost of developing the prompts with LLM methods were assessed and compared with those spent on collecting information with “full manual” and “LLM-assisted manual” methods. To assess accuracy, 340 patients were randomly selected, and the extracted information by LLM method were compared with those collected by “full manual” method. Results Data from 2,931 patients were collected. We developed 12 prompts for Extract function and 12 for Format function to extract and standardize the information. The overall accuracy was 87.7%. For lymphovascular invasion, it was 98.2%. Developing and processing the prompts took 3.5 hours and 15 minutes, respectively. Utilizing the ChatGPT application programming interface cost US $65.8 and when factoring in the estimated wage, the total cost was US $95.4. In an estimated comparison, “LLM-assisted manual” and “LLM” methods were time- and cost-efficient compared to the “full manual” method. Conclusion Developing and facilitating prompts for LLM to derive clinical factors was efficient to extract crucial information from huge medical records. This study demonstrated the potential of the application of natural language processing using LLM model in breast cancer patients. Prompts from the current study can be re-used for other research to collect clinical information.
Abstract
Despite growing interest in using large language models (LLMs) in healthcare, current explorations do not assess the real-world utility and safety of LLMs in clinical settings. Our objective was to determine whether two LLMs can serve information needs submitted by physicians as questions to an informatics consultation service in a safe and concordant manner. Sixty six questions from an informatics consult service were submitted to GPT-3.5 and GPT-4 via simple prompts. 12 physicians assessed the LLM responses' possibility of patient harm and concordance with existing reports from an informatics consultation service. Physician assessments were summarized based on majority vote. For no questions did a majority of physicians deem either LLM response as harmful. For GPT-3.5, responses to 8 questions were concordant with the informatics consult report, 20 discordant, and 9 were unable to be assessed. There were 29 responses with no majority on"Agree","Disagree", and"Unable to assess". For GPT-4, responses to 13 questions were concordant, 15 discordant, and 3 were unable to be assessed. There were 35 responses with no majority. Responses from both LLMs were largely devoid of overt harm, but less than 20% of the responses agreed with an answer from an informatics consultation service, responses contained hallucinated references, and physicians were divided on what constitutes harm. These results suggest that while general purpose LLMs are able to provide safe and credible responses, they often do not meet the specific information need of a given question. A definitive evaluation of the usefulness of LLMs in healthcare settings will likely require additional research on prompt engineering, calibration, and custom-tailoring of general purpose models.
Abstract
The Imaging Data Commons (IDC) is a cloud-based database that provides researchers with open access to cancer imaging data, with the goal of facilitating collaboration in medical imaging research. However, querying the IDC database for cohort discovery and access to imaging data has a significant learning curve for researchers due to its complex nature. We developed Text2Cohort, a large language model (LLM) based toolkit to facilitate user-friendly and intuitive natural language cohort discovery in the IDC. Text2Cohorts translates user input into IDC database queries using prompt engineering and autocorrection and returns the query's response to the user. Autocorrection resolves errors in queries by passing the errors back to the model for interpretation and correction. We evaluate Text2Cohort on 50 natural language user inputs ranging from information extraction to cohort discovery. The resulting queries and outputs were verified by two computer scientists to measure Text2Cohort's accuracy and F1 score. Text2Cohort successfully generated queries and their responses with an 88% accuracy and F1 score of 0.94. However, it failed to generate queries for 6/50 (12%) user inputs due to syntax and semantic errors. Our results indicate that Text2Cohort succeeded at generating queries with correct responses, but occasionally failed due to a lack of understanding of the data schema. Despite these shortcomings, Text2Cohort demonstrates the utility of LLMs to enable researchers to discover and curate cohorts using data hosted on IDC with high levels of accuracy using natural language in a more intuitive and user-friendly way.
Abstract
Purpose: Recent advancements in large language models (LLMs) have expandedtheir capabilities in a multimodal fashion, potentially replicating the imageinterpretation of human radiologists. This study aimed to develop open-sourcemultimodal large language model for interpreting chest X-ray images(CXR-LLaVA). We also examined the effect of prompt engineering and modelparameters such as temperature and nucleus sampling. Materials and Methods: For training, we collected 659,287 publicly availableCXRs: 417,336 CXRs had labels for certain radiographic abnormalities (dataset1); 241,951 CXRs provided free-text radiology reports (dataset 2). Afterpre-training the Resnet50 as an image encoder, the contrastive language-imagepre-training was used to align CXRs and corresponding radiographicabnormalities. Then, the Large Language Model Meta AI-2 was fine-tuned usingdataset 2, which were refined using GPT-4, with generating various questionanswering scenarios. The code can be found athttps://github.com/ECOFRI/CXR_LLaVA. Results: In the test set, we observed that the model's performance fluctuatedbased on its parameters. On average, it achieved F1 score of 0.34 for fivepathologic findings (atelectasis, cardiomegaly, consolidation, edema, andpleural effusion), which was improved to 0.46 through prompt engineering. Inthe independent set, the model achieved an average F1 score of 0.30 for thesame pathologic findings. Notably, for the pediatric chest radiograph dataset,which was unseen during training, the model differentiated abnormal radiographswith an F1 score ranging from 0.84 to 0.85. Conclusion: CXR-LLaVA demonstrates promising potential in CXR interpretation.Both prompt engineering and model parameter adjustments can play pivotal rolesin interpreting CXRs.
Abstract
Due to the prohibitively high cost of creating error correction datasets, most Factual Claim Correction methods rely on a powerful verification model to guide the correction process. This leads to a significant drop in performance in domains like Scientific Claim Correction, where good verification models do not always exist. In this work, we introduce a claim correction system that makes no domain assumptions and does not require a verifier but is able to outperform existing methods by an order of magnitude — achieving 94% correction accuracy on the SciFact dataset, and 62.5% on the SciFact-Open dataset, compared to the next best meth-ods 0.5% and 1.50% respectively. Our method leverages the power of prompting with LLMs during training to create a richly annotated dataset that can be used for fully supervised training and regularization. We additionally use a claim-aware decoding procedure to improve the quality of corrected claims. Our method is competitive with the very LLM that was used to generate the annotated dataset — with GPT3.5 achieving 89.5% and 60% correction accuracy on SciFact and SciFact-Open, despite using 1250 times as many parameters as our model.
Documents
- do anything now characterizing and evaluating inthewild jailbreak prompts on large language models
- defending against alignmentbreaking attacks via robustly aligned llm
- latent jailbreak a benchmark for evaluating text safety and output robustness of large language models
- scalable and transferable blackbox jailbreaks for language models via persona modulation
- llm self defense by self examination, llms know they are being tricked
Abstract
The misuse of large language models (LLMs) has garnered significant attention from the general public and LLM vendors. In response, efforts have been made to align LLMs with human values and intent use. However, a particular type of adversarial prompts, known as jailbreak prompt, has emerged and continuously evolved to bypass the safeguards and elicit harmful content from LLMs. In this paper, we conduct the first measurement study on jailbreak prompts in the wild, with 6,387 prompts collected from four platforms over six months. Leveraging natural language processing technologies and graph-based community detection methods, we discover unique characteristics of jailbreak prompts and their major attack strategies, such as prompt injection and privilege escalation. We also observe that jailbreak prompts increasingly shift from public platforms to private ones, posing new challenges for LLM vendors in proactive detection. To assess the potential harm caused by jailbreak prompts, we create a question set comprising 46,800 samples across 13 forbidden scenarios. Our experiments show that current LLMs and safeguards cannot adequately defend jailbreak prompts in all scenarios. Particularly, we identify two highly effective jailbreak prompts which achieve 0.99 attack success rates on ChatGPT (GPT-3.5) and GPT-4, and they have persisted online for over 100 days. Our work sheds light on the severe and evolving threat landscape of jailbreak prompts. We hope our study can facilitate the research community and LLM vendors in promoting safer and regulated LLMs.
Abstract
Recently, Large Language Models (LLMs) have made significant advancements and are now widely used across various domains. Unfortunately, there has been a rising concern that LLMs can be misused to generate harmful or malicious content. Though a line of research has focused on aligning LLMs with human values and preventing them from producing inappropriate content, such alignments are usually vulnerable and can be bypassed by alignment-breaking attacks via adversarially optimized or handcrafted jailbreaking prompts. In this work, we introduce a Robustly Aligned LLM (RA-LLM) to defend against potential alignment-breaking attacks. RA-LLM can be directly constructed upon an existing aligned LLM with a robust alignment checking function, without requiring any expensive retraining or fine-tuning process of the original LLM. Furthermore, we also provide a theoretical analysis for RA-LLM to verify its effectiveness in defending against alignment-breaking attacks. Through real-world experiments on open-source large language models, we demonstrate that RA-LLM can successfully defend against both state-of-the-art adversarial prompts and popular handcrafted jailbreaking prompts by reducing their attack success rates from nearly 100\% to around 10\% or less.
Abstract
Considerable research efforts have been devoted to ensuring that large language models (LLMs) align with human values and generate safe text. However, an excessive focus on sensitivity to certain topics can compromise the model's robustness in following instructions, thereby impacting its overall performance in completing tasks. Previous benchmarks for jailbreaking LLMs have primarily focused on evaluating the safety of the models without considering their robustness. In this paper, we propose a benchmark that assesses both the safety and robustness of LLMs, emphasizing the need for a balanced approach. To comprehensively study text safety and output robustness, we introduce a latent jailbreak prompt dataset, each involving malicious instruction embedding. Specifically, we instruct the model to complete a regular task, such as translation, with the text to be translated containing malicious instructions. To further analyze safety and robustness, we design a hierarchical annotation framework. We present a systematic analysis of the safety and robustness of LLMs regarding the position of explicit normal instructions, word replacements (verbs in explicit normal instructions, target groups in malicious instructions, cue words for explicit normal instructions), and instruction replacements (different explicit normal instructions). Our results demonstrate that current LLMs not only prioritize certain instruction verbs but also exhibit varying jailbreak rates for different instruction verbs in explicit normal instructions. Code and data are available at https://github.com/qiuhuachuan/latent-jailbreak.
Abstract
Despite efforts to align large language models to produce harmless responses,they are still vulnerable to jailbreak prompts that elicit unrestrictedbehaviour. In this work, we investigate persona modulation as a black-boxjailbreaking method to steer a target model to take on personalities that arewilling to comply with harmful instructions. Rather than manually craftingprompts for each persona, we automate the generation of jailbreaks using alanguage model assistant. We demonstrate a range of harmful completions madepossible by persona modulation, including detailed instructions forsynthesising methamphetamine, building a bomb, and laundering money. Theseautomated attacks achieve a harmful completion rate of 42.5% in GPT-4, which is185 times larger than before modulation (0.23%). These prompts also transfer toClaude 2 and Vicuna with harmful completion rates of 61.0% and 35.9%,respectively. Our work reveals yet another vulnerability in commercial largelanguage models and highlights the need for more comprehensive safeguards.
Abstract
Large language models (LLMs) are popular for high-quality text generation but can produce harmful content, even when aligned with human values through reinforcement learning. Adversarial prompts can bypass their safety measures. We propose LLM Self Defense, a simple approach to defend against these attacks by having an LLM screen the induced responses. Our method does not require any fine-tuning, input preprocessing, or iterative output generation. Instead, we incorporate the generated content into a pre-defined prompt and employ another instance of an LLM to analyze the text and predict whether it is harmful. We test LLM Self Defense on GPT 3.5 and Llama 2, two of the current most prominent LLMs against various types of attacks, such as forcefully inducing affirmative responses to prompts and prompt engineering attacks. Notably, LLM Self Defense succeeds in reducing the attack success rate to virtually 0 using both GPT 3.5 and Llama 2.
Documents
- gpt3 models are poor fewshot learners in the biomedical domain
- can gpt3 perform statutory reasoning
- how good are commercial large language models on african languages
- instruction induction from few examples to natural language task descriptions
- controlled text generation with natural language instructions
Abstract
Deep neural language models have set new breakthroughs in many tasks ofNatural Language Processing (NLP). Recent work has shown that deep transformerlanguage models (pretrained on large amounts of texts) can achieve high levelsof task-specific few-shot performance comparable to state-of-the-art models.However, the ability of these large language models in few-shot transferlearning has not yet been explored in the biomedical domain. We investigatedthe performance of two powerful transformer language models, i.e. GPT-3 andBioBERT, in few-shot settings on various biomedical NLP tasks. The experimentalresults showed that, to a great extent, both the models underperform a languagemodel fine-tuned on the full training data. Although GPT-3 had already achievednear state-of-the-art results in few-shot knowledge transfer on open-domain NLPtasks, it could not perform as effectively as BioBERT, which is orders ofmagnitude smaller than GPT-3. Regarding that BioBERT was already pretrained onlarge biomedical text corpora, our study suggests that language models maylargely benefit from in-domain pretraining in task-specific few-shot learning.However, in-domain pretraining seems not to be sufficient; novel pretrainingand few-shot learning strategies are required in the biomedical NLP domain.
Abstract
Statutory reasoning is the task of reasoning with facts and statutes, which are rules written in natural language by a legislature. It is a basic legal skill. In this paper we explore the capabilities of the most capable GPT-3 model, text-davinci-003, on an established statutory-reasoning dataset called SARA. We consider a variety of approaches, including dynamic few-shot prompting, chain-of-thought prompting, and zero-shot prompting. While we achieve results with GPT-3 that are better than the previous best published results, we also identify several types of clear errors it makes. We investigate why these errors happen. We discover that GPT-3 has imperfect prior knowledge of the actual U.S. statutes on which SARA is based. More importantly, we create simple synthetic statutes, which GPT-3 is guaranteed not to have seen during training. We find GPT-3 performs poorly at answering straightforward questions about these simple synthetic statutes.
Abstract
Recent advancements in Natural Language Processing (NLP) has led to theproliferation of large pretrained language models. These models have been shownto yield good performance, using in-context learning, even on unseen tasks andlanguages. They have also been exposed as commercial APIs as a form oflanguage-model-as-a-service, with great adoption. However, their performance onAfrican languages is largely unknown. We present a preliminary analysis ofcommercial large language models on two tasks (machine translation and textclassification) across eight African languages, spanning different languagefamilies and geographical areas. Our results suggest that commercial languagemodels produce below-par performance on African languages. We also find thatthey perform better on text classification than machine translation. Ingeneral, our findings present a call-to-action to ensure African languages arewell represented in commercial large language models, given their growingpopularity.
Abstract
Large language models are able to perform a task by conditioning on a fewinput-output demonstrations - a paradigm known as in-context learning. We showthat language models can explicitly infer an underlying task from a fewdemonstrations by prompting them to generate a natural language instructionthat fits the examples. To explore this ability, we introduce the instructioninduction challenge, compile a dataset consisting of 24 tasks, and define anovel evaluation metric based on executing the generated instruction. Wediscover that, to a large extent, the ability to generate instructions doesindeed emerge when using a model that is both large enough and aligned tofollow instructions; InstructGPT achieves 65.7% of human performance in ourexecution-based metric, while the original GPT-3 model reaches only 9.8% ofhuman performance. This surprising result suggests that instruction inductionmight be a viable learning paradigm in and of itself, where instead of fittinga set of latent continuous parameters to the data, one searches for the bestdescription in the natural language hypothesis space.
Abstract
Large language models generate fluent texts and can follow natural languageinstructions to solve a wide range of tasks without task-specific training.Nevertheless, it is notoriously difficult to control their generation tosatisfy the various constraints required by different applications. In thiswork, we present InstructCTG, a controlled text generation framework thatincorporates different constraints by conditioning on natural languagedescriptions and demonstrations of the constraints. In particular, we firstextract the underlying constraints of natural texts through a combination ofoff-the-shelf NLP tools and simple heuristics. We then verbalize theconstraints into natural language instructions to form weakly supervisedtraining data. By prepending natural language descriptions of the constraintsand a few demonstrations, we fine-tune a pre-trained language model toincorporate various types of constraints. Compared to existing search-based orscore-based methods, InstructCTG is more flexible to different constraint typesand has a much smaller impact on the generation quality and speed because itdoes not modify the decoding procedure. Additionally, InstructCTG allows themodel to adapt to new constraints without re-training through the use offew-shot task generalization and in-context learning abilities ofinstruction-tuned language models.
Documents
- sparks of gpts in edge intelligence for metaverse caching and inference for mobile aigc services
- joint foundation model caching and inference of generative ai services for edge intelligence
- plum prompt learning using metaheuristic
- emerging technology in acute resuscitation monitoring
- optimizing mobileedge aigenerated everything (aigx) services by prompt engineering fundamental, framework, and case study
Abstract
Aiming at achieving artificial general intelligence (AGI) for Metaverse,pretrained foundation models (PFMs), e.g., generative pretrained transformers(GPTs), can effectively provide various AI services, such as autonomousdriving, digital twins, and AI-generated content (AIGC) for extended reality.With the advantages of low latency and privacy-preserving, serving PFMs ofmobile AI services in edge intelligence is a viable solution for caching andexecuting PFMs on edge servers with limited computing resources and GPU memory.However, PFMs typically consist of billions of parameters that are computationand memory-intensive for edge servers during loading and execution. In thisarticle, we investigate edge PFM serving problems for mobile AIGC services ofMetaverse. First, we introduce the fundamentals of PFMs and discuss theircharacteristic fine-tuning and inference methods in edge intelligence. Then, wepropose a novel framework of joint model caching and inference for managingmodels and allocating resources to satisfy users' requests efficiently.Furthermore, considering the in-context learning ability of PFMs, we propose anew metric to evaluate the freshness and relevance between examples indemonstrations and executing tasks, namely the Age of Context (AoC). Finally,we propose a least context algorithm for managing cached models at edge serversby balancing the tradeoff among latency, energy consumption, and accuracy.
Abstract
With the rapid development of artificial general intelligence (AGI), variousmultimedia services based on pretrained foundation models (PFMs) need to beeffectively deployed. With edge servers that have cloud-level computing power,edge intelligence can extend the capabilities of AGI to mobile edge networks.However, compared with cloud data centers, resource-limited edge servers canonly cache and execute a small number of PFMs, which typically consist ofbillions of parameters and require intensive computing power and GPU memoryduring inference. To address this challenge, in this paper, we propose a jointfoundation model caching and inference framework that aims to balance thetradeoff among inference latency, accuracy, and resource consumption bymanaging cached PFMs and user requests efficiently during the provisioning ofgenerative AI services. Specifically, considering the in-context learningability of PFMs, a new metric named the Age of Context (AoC), is proposed tomodel the freshness and relevance between examples in past demonstrations andcurrent service requests. Based on the AoC, we propose a least context cachingalgorithm to manage cached PFMs at edge servers with historical prompts andinference results. The numerical results demonstrate that the proposedalgorithm can reduce system costs compared with existing baselines byeffectively utilizing contextual information.
Abstract
Since the emergence of large language models, prompt learning has become apopular method for optimizing and customizing these models. Special prompts,such as Chain-of-Thought, have even revealed previously unknown reasoningcapabilities within these models. However, the progress of discoveringeffective prompts has been slow, driving a desire for general promptoptimization methods. Unfortunately, few existing prompt learning methodssatisfy the criteria of being truly "general", i.e., automatic, discrete,black-box, gradient-free, and interpretable all at once. In this paper, weintroduce metaheuristics, a branch of discrete non-convex optimization methodswith over 100 options, as a promising approach to prompt learning. Within ourparadigm, we test six typical methods: hill climbing, simulated annealing,genetic algorithms with/without crossover, tabu search, and harmony search,demonstrating their effectiveness in black-box prompt learning andChain-of-Thought prompt tuning. Furthermore, we show that these methods can beused to discover more human-understandable prompts that were previouslyunknown, opening the door to a cornucopia of possibilities in promptoptimization. We release all the codes in\url{https://github.com/research4pan/Plum}.
Abstract
Fluid optimization in the resuscitation of shock became the mainstay of treatment following the advent of Early Goal-Directed Therapy (EGDT) by Rivers et al. in 2001 [1]. Patients presenting in shock require prompt optimization of volume status and cardiac out- put to ensure adequate perfusion. Poor optimization may be associated with prolonged hospital and intensive care unit stays. The prior gold standard, pulmonary artery catheterization, is rarely available in the emergency department setting and its invasive nature has led to recent re-evaluation of its clinical utility. However, there are new monitoring technologies that are being studied in the intensive care unit setting that may soon be available in emergency departments to aid in nursing and physician decision making to improve acute resuscitation.
Abstract
As the next-generation paradigm for content creation, AI-Generated Content (AIGC), i.e., generating content automatically by Generative AI (GAI) based on user prompts, has gained great attention and success recently. With the ever-increasing power of GAI, especially the emergence of Pretrained Foundation Models (PFMs) that contain billions of parameters and prompt engineering methods (i.e., finding the best prompts for the given task), the application range of AIGC is rapidly expanding, covering various forms of information for human, systems, and networks, such as network designs, channel coding, and optimization solutions. In this article, we present the concept of mobile-edge AI-Generated Everything (AIGX). Specifically, we first review the building blocks of AIGX, the evolution from AIGC to AIGX, as well as practical AIGX applications. Then, we present a unified mobile-edge AIGX framework, which employs edge devices to provide PFM-empowered AIGX services and optimizes such services via prompt engineering. More importantly, we demonstrate that suboptimal prompts lead to poor generation quality, which adversely affects user satisfaction, edge network performance, and resource utilization. Accordingly, we conduct a case study, showcasing how to train an effective prompt optimizer using ChatGPT and investigating how much improvement is possible with prompt engineering in terms of user experience, quality of generation, and network performance.
Documents
- thespian multicharacter text roleplaying game agents
- reward design with language models
- robotic interestingness via humaninformed fewshot object detection
- roco dialectic multirobot collaboration with large language models
- adaplanner adaptive planning from feedback with language models
Abstract
Text-adventure games and text role-playing games are grand challenges forreinforcement learning game playing agents. Text role-playing games areopen-ended environments where an agent must faithfully play a particularcharacter. We consider the distinction between characters and actors, where anactor agent has the ability to play multiple characters. We present a frameworkwe call a thespian agent that can learn to emulate multiple characters alongwith a soft prompt that can be used to direct it as to which character to playat any time. We further describe an attention mechanism that allows the agentto learn new characters that are based on previously learned characters in afew-shot fashion. We show that our agent outperforms the state of the art agentframework in multi-character learning and few-shot learning.
Abstract
Reward design in reinforcement learning (RL) is challenging since specifying human notions of desired behavior may be difficult via reward functions or require many expert demonstrations. Can we instead cheaply design rewards using a natural language interface? This paper explores how to simplify reward design by prompting a large language model (LLM) such as GPT-3 as a proxy reward function, where the user provides a textual prompt containing a few examples (few-shot) or a description (zero-shot) of the desired behavior. Our approach leverages this proxy reward function in an RL framework. Specifically, users specify a prompt once at the beginning of training. During training, the LLM evaluates an RL agent's behavior against the desired behavior described by the prompt and outputs a corresponding reward signal. The RL agent then uses this reward to update its behavior. We evaluate whether our approach can train agents aligned with user objectives in the Ultimatum Game, matrix games, and the DealOrNoDeal negotiation task. In all three tasks, we show that RL agents trained with our framework are well-aligned with the user's objectives and outperform RL agents trained with reward functions learned via supervised learning
Abstract
Interestingness recognition is crucial for decision making in autonomousexploration for mobile robots. Previous methods proposed an unsupervised onlinelearning approach that can adapt to environments and detect interesting scenesquickly, but lack the ability to adapt to human-informed interesting objects.To solve this problem, we introduce a human-interactive framework,AirInteraction, that can detect human-informed objects via few-shot onlinelearning. To reduce the communication bandwidth, we first apply an onlineunsupervised learning algorithm on the unmanned vehicle for interestingnessrecognition and then only send the potential interesting scenes to abase-station for human inspection. The human operator is able to draw andprovide bounding box annotations for particular interesting objects, which aresent back to the robot to detect similar objects via few-shot learning. Onlyusing few human-labeled examples, the robot can learn novel interesting objectcategories during the mission and detect interesting scenes that contain theobjects. We evaluate our method on various interesting scene recognitiondatasets. To the best of our knowledge, it is the first human-informed few-shotobject detection framework for autonomous exploration.
Abstract
We propose a novel approach to multi-robot collaboration that harnesses the power of pre-trained large language models (LLMs) for both high-level communication and low-level path planning. Robots are equipped with LLMs to discuss and collectively reason task strategies. They then generate sub-task plans and task space waypoint paths, which are used by a multi-arm motion planner to accelerate trajectory planning. We also provide feedback from the environment, such as collision checking, and prompt the LLM agents to improve their plan and waypoints in-context. For evaluation, we introduce RoCoBench, a 6-task benchmark covering a wide range of multi-robot collaboration scenarios, accompanied by a text-only dataset for agent representation and reasoning. We experimentally demonstrate the effectiveness of our approach -- it achieves high success rates across all tasks in RoCoBench and adapts to variations in task semantics. Our dialog setup offers high interpretability and flexibility -- in real world experiments, we show RoCo easily incorporates human-in-the-loop, where a user can communicate and collaborate with a robot agent to complete tasks together. See project website https://project-roco.github.io for videos and code.
Abstract
Large language models (LLMs) have recently demonstrated the potential in acting as autonomous agents for sequential decision-making tasks. However, most existing methods either take actions greedily without planning or rely on static plans that are not adaptable to environmental feedback. Consequently, the sequential decision-making performance of LLM agents degenerates with problem complexity and plan horizons increase. We propose a closed-loop approach, AdaPlanner, which allows the LLM agent to refine its self-generated plan adaptively in response to environmental feedback. In AdaPlanner, the LLM agent adaptively refines its plan from feedback with both in-plan and out-of-plan refinement strategies. To mitigate hallucination, we develop a code-style LLM prompt structure that facilitates plan generation across a variety of tasks, environments, and agent capabilities. Furthermore, we propose a skill discovery mechanism that leverages successful plans as few-shot exemplars, enabling the agent to plan and refine with fewer task demonstrations. Our experiments in the ALFWorld and MiniWoB++ environments demonstrate that AdaPlanner outperforms state-of-the-art baselines by 3.73% and 4.11% while utilizing 2x and 600x fewer samples, respectively.
Documents
- review of large vision models and visual prompt engineering
- prompt engineering for healthcare methodologies and applications
- how understanding large language models can inform their use in physics education
- unleashing the potential of prompt engineering in large language models a comprehensive review
- geotechnical parrot tales (gpt) harnessing large language models in geotechnical engineering
Abstract
Visual prompt engineering is a fundamental technology in the field of visual and image Artificial General Intelligence, serving as a key component for achieving zero-shot capabilities. As the development of large vision models progresses, the importance of prompt engineering becomes increasingly evident. Designing suitable prompts for specific visual tasks has emerged as a meaningful research direction. This review aims to summarize the methods employed in the computer vision domain for large vision models and visual prompt engineering, exploring the latest advancements in visual prompt engineering. We present influential large models in the visual domain and a range of prompt engineering methods employed on these models. It is our hope that this review provides a comprehensive and systematic description of prompt engineering methods based on large visual models, offering valuable insights for future researchers in their exploration of this field.
Abstract
This review will introduce the latest advances in prompt engineering in the field of natural language processing (NLP) for the medical domain. First, we will provide a brief overview of the development of prompt engineering and emphasize its significant contributions to healthcare NLP applications such as question-answering systems, text summarization, and machine translation. With the continuous improvement of general large language models, the importance of prompt engineering in the healthcare domain is becoming increasingly prominent. The aim of this article is to provide useful resources and bridges for healthcare NLP researchers to better explore the application of prompt engineering in this field. We hope that this review can provide new ideas and inspire ample possibilities for research and application in medical NLP.
Abstract
The paper aims to fulfil three main functions: (1) to serve as anintroduction for the physics education community to the functioning of LargeLanguage Models (LLMs), (2) to present a series of illustrative examplesdemonstrating how prompt-engineering techniques can impact LLMs performance onconceptual physics tasks and (3) to discuss potential implications of theunderstanding of LLMs and prompt engineering for physics teaching and learning.We first summarise existing research on the performance of a popular LLM-basedchatbot (ChatGPT) on physics tasks. We then give a basic account of how LLMswork, illustrate essential features of their functioning, and discuss theirstrengths and limitations. Equipped with this knowledge, we discuss somechallenges with generating useful output with ChatGPT-4 in the context ofintroductory physics, paying special attention to conceptual questions andproblems. We then provide a condensed overview of relevant literature on promptengineering and demonstrate through illustrative examples how selectedprompt-engineering techniques can be employed to improve ChatGPT-4's output onconceptual introductory physics problems. Qualitatively studying these examplesprovides additional insights into ChatGPT's functioning and its utility inphysics problem solving. Finally, we consider how insights from the paper caninform the use of LMMs in the teaching and learning of physics.
Abstract
This paper delves into the pivotal role of prompt engineering in unleashingthe capabilities of Large Language Models (LLMs). Prompt engineering is theprocess of structuring input text for LLMs and is a technique integral tooptimizing the efficacy of LLMs. This survey elucidates foundational principlesof prompt engineering, such as role-prompting, one-shot, and few-shotprompting, as well as more advanced methodologies such as the chain-of-thoughtand tree-of-thoughts prompting. The paper sheds light on how externalassistance in the form of plugins can assist in this task, and reduce machinehallucination by retrieving external knowledge. We subsequently delineateprospective directions in prompt engineering research, emphasizing the need fora deeper understanding of structures and the role of agents in ArtificialIntelligence-Generated Content (AIGC) tools. We discuss how to assess theefficacy of prompt methods from different perspectives and using differentmethods. Finally, we gather information about the application of promptengineering in such fields as education and programming, showing itstransformative potential. This comprehensive survey aims to serve as a friendlyguide for anyone venturing through the big world of LLMs and promptengineering.
Abstract
The widespread adoption of large language models (LLMs), such as OpenAI's ChatGPT, could revolutionize various industries, including geotechnical engineering. However, GPT models can sometimes generate plausible-sounding but false outputs, leading to hallucinations. In this article, we discuss the importance of prompt engineering in mitigating these risks and harnessing the full potential of GPT for geotechnical applications. We explore the challenges and pitfalls associated with LLMs and highlight the role of context in ensuring accurate and valuable responses. Furthermore, we examine the development of context-specific search engines and the potential of LLMs to become a natural interface for complex tasks, such as data analysis and design. We also develop a unified interface using natural language to handle complex geotechnical engineering tasks and data analysis. By integrating GPT into geotechnical engineering workflows, professionals can streamline their work and develop sustainable and resilient infrastructure systems for the future.
Documents
- coveragebased example selection for incontext learning
- exploring demonstration ensembling for incontext learning
- larger language models do incontext learning differently
- incontext learning learns label relationships but is not conventional learning
- compositional exemplars for incontext learning
Abstract
In-context learning (ICL), the ability of large language models to performnovel tasks by conditioning on a prompt with a few task examples, requiresthese examples to be informative about the test instance. The standard approachof independently ranking and selecting the most similar examples selectsredundant examples while omitting important information. In this work, we showthat BERTScore-Recall (BSR) selects better examples that demonstrate more ofthe salient aspects, e.g. reasoning patterns, of the test input. We furtherextend BSR and many standard metrics to easily optimizable set-level metrics,giving still better coverage of those salient aspects. On 15 datasets spanning6 tasks and with 7 diverse LLMs, we show that (1) BSR is the superior metricfor in-context example selection across the board, and (2) for compositionaltasks, set selection using Set-BSR outperforms independent ranking by up to 17points on average and, despite being training-free, surpasses methods thatleverage task or LLM-specific training.
Abstract
In-context learning (ICL) operates by showing language models (LMs) examplesof input-output pairs for a given task, i.e., demonstrations. The standardapproach for ICL is to prompt the LM with concatenated demonstrations followedby the test input. This approach suffers from some issues. First, concatenationoffers almost no control over the contribution of each demo to the modelprediction. This can be sub-optimal when some demonstrations are irrelevant tothe test example. Second, due to the input length limit of some transformermodels, it might be infeasible to fit many examples into the context,especially when dealing with long-input tasks. In this work, we exploreDemonstration Ensembling (DENSE) as an alternative to simple concatenation.DENSE predicts outputs using subsets (i.e., buckets) of the demonstrations andthen combines the output probabilities resulting from each subset to producethe final prediction. We study different ensembling methods using GPT-j andexperiment on 12 language tasks. Our experiments show weighted max ensemblingto outperform vanilla concatenation by as large as 2.4 average points. Codeavailable at https://github.com/mukhal/icl-ensembling.
Abstract
We study how in-context learning (ICL) in language models is affected by semantic priors versus input-label mappings. We investigate two setups-ICL with flipped labels and ICL with semantically-unrelated labels-across various model families (GPT-3, InstructGPT, Codex, PaLM, and Flan-PaLM). First, experiments on ICL with flipped labels show that overriding semantic priors is an emergent ability of model scale. While small language models ignore flipped labels presented in-context and thus rely primarily on semantic priors from pretraining, large models can override semantic priors when presented with in-context exemplars that contradict priors, despite the stronger semantic priors that larger models may hold. We next study semantically-unrelated label ICL (SUL-ICL), in which labels are semantically unrelated to their inputs (e.g., foo/bar instead of negative/positive), thereby forcing language models to learn the input-label mappings shown in in-context exemplars in order to perform the task. The ability to do SUL-ICL also emerges primarily with scale, and large-enough language models can even perform linear classification in a SUL-ICL setting. Finally, we evaluate instruction-tuned models and find that instruction tuning strengthens both the use of semantic priors and the capacity to learn input-label mappings, but more of the former.
Abstract
The predictions of Large Language Models (LLMs) on downstream tasks oftenimprove significantly when including examples of the input--label relationshipin the context. However, there is currently no consensus about how thisin-context learning (ICL) ability of LLMs works. For example, while Xie et al.(2021) liken ICL to a general-purpose learning algorithm, Min et al. (2022)argue ICL does not even learn label relationships from in-context examples. Inthis paper, we provide novel insights into how ICL leverages label information,revealing both capabilities and limitations. To ensure we obtain acomprehensive picture of ICL behavior, we study probabilistic aspects of ICLpredictions and thoroughly examine the dynamics of ICL as more examples areprovided. Our experiments show that ICL predictions almost always depend onin-context labels, and that ICL can learn truly novel tasks in-context.However, we also find that ICL struggles to fully overcome predictionpreferences acquired from pre-training data, and, further, that ICL does notconsider all in-context information equally.
Abstract
Large pretrained language models (LMs) have shown impressive In-ContextLearning (ICL) ability, where the model learns to do an unseen task via aprompt consisting of input-output examples as the demonstration, without anyparameter updates. The performance of ICL is highly dominated by the quality ofthe selected in-context examples. However, previous selection methods aremostly based on simple heuristics, leading to sub-optimal performance. In thiswork, we formulate in-context example selection as a subset selection problem.We propose CEIL (Compositional Exemplars for In-context Learning), which isinstantiated by Determinantal Point Processes (DPPs) to model the interactionbetween the given input and in-context examples, and optimized through acarefully-designed contrastive learning objective to obtain preference fromLMs. We validate CEIL on 12 classification and generation datasets from 7distinct NLP tasks, including sentiment analysis, paraphrase detection, naturallanguage inference, commonsense reasoning, open-domain question answering, codegeneration, and semantic parsing. Extensive experiments demonstrate not onlythe state-of-the-art performance but also the transferability andcompositionality of CEIL, shedding new light on effective and efficientin-context learning. Our code is released athttps://github.com/HKUNLP/icl-ceil.
Documents
- one step of gradient descent is provably the optimal incontext learner with one layer of linear selfattention
- incontext learning through the bayesian prism
- pretraining task diversity and the emergence of nonbayesian incontext learning for regression
- trained transformers learn linear models incontext
- transformers as statisticians provable incontext learning with incontext algorithm selection
Abstract
Recent works have empirically analyzed in-context learning and shown thattransformers trained on synthetic linear regression tasks can learn toimplement ridge regression, which is the Bayes-optimal predictor, givensufficient capacity [Aky\"urek et al., 2023], while one-layer transformers withlinear self-attention and no MLP layer will learn to implement one step ofgradient descent (GD) on a least-squares linear regression objective [vonOswald et al., 2022]. However, the theory behind these observations remainspoorly understood. We theoretically study transformers with a single layer oflinear self-attention, trained on synthetic noisy linear regression data.First, we mathematically show that when the covariates are drawn from astandard Gaussian distribution, the one-layer transformer which minimizes thepre-training loss will implement a single step of GD on the least-squareslinear regression objective. Then, we find that changing the distribution ofthe covariates and weight vector to a non-isotropic Gaussian distribution has astrong impact on the learned algorithm: the global minimizer of thepre-training loss now implements a single step of $\textit{pre-conditioned}$GD. However, if only the distribution of the responses is changed, then thisdoes not have a large effect on the learned algorithm: even when the responsecomes from a more general family of $\textit{nonlinear}$ functions, the globalminimizer of the pre-training loss still implements a single step of GD on aleast-squares linear regression objective.
Abstract
In-context learning is one of the surprising and useful features of largelanguage models. How it works is an active area of research. Recently, stylizedmeta-learning-like setups have been devised that train these models on asequence of input-output pairs $(x, f(x))$ from a function class using thelanguage modeling loss and observe generalization to unseen functions from thesame class. One of the main discoveries in this line of research has been thatfor several problems such as linear regression, trained transformers learnalgorithms for learning functions in context. However, the inductive biases ofthese models resulting in this behavior are not clearly understood. A modelwith unlimited training data and compute is a Bayesian predictor: it learns thepretraining distribution. It has been shown that high-capacity transformersmimic the Bayesian predictor for linear regression. In this paper, we showempirical evidence of transformers exhibiting the behavior of this ideallearner across different linear and non-linear function classes. We also extendthe previous setups to work in the multitask setting and verify thattransformers can do in-context learning in this setup as well and the Bayesianperspective sheds light on this setting also. Finally, via the example oflearning Fourier series, we study the inductive bias for in-context learning.We find that in-context learning may or may not have simplicity bias dependingon the pretraining data distribution.
Abstract
Pretrained transformers exhibit the remarkable ability of in-context learning(ICL): they can learn tasks from just a few examples provided in the promptwithout updating any weights. This raises a foundational question: can ICLsolve fundamentally $\textit{new}$ tasks that are very different from thoseseen during pretraining? To probe this question, we examine ICL's performanceon linear regression while varying the diversity of tasks in the pretrainingdataset. We empirically demonstrate a $\textit{task diversity threshold}$ forthe emergence of ICL. Below this threshold, the pretrained transformer cannotsolve unseen regression tasks, instead behaving like a Bayesian estimator withthe $\textit{non-diverse pretraining task distribution}$ as the prior. Beyondthis threshold, the transformer significantly outperforms this estimator; itsbehavior aligns with that of ridge regression, corresponding to a Gaussianprior over $\textit{all tasks}$, including those not seen during pretraining.Thus, when pretrained on data with task diversity greater than the threshold,transformers $\textit{can}$ optimally solve fundamentally new tasks in-context.Importantly, this capability hinges on it deviating from the Bayes optimalestimator with the pretraining distribution as the prior. This study alsoexplores the effect of regularization, model capacity and task structure andunderscores, in a concrete example, the critical role of task diversity,alongside data and model scale, in the emergence of ICL. Code is available athttps://github.com/mansheej/icl-task-diversity.
Abstract
Attention-based neural networks such as transformers have demonstrated aremarkable ability to exhibit in-context learning (ICL): Given a short promptsequence of tokens from an unseen task, they can formulate relevant per-tokenand next-token predictions without any parameter updates. By embedding asequence of labeled training data and unlabeled test data as a prompt, thisallows for transformers to behave like supervised learning algorithms. Indeed,recent work has shown that when training transformer architectures over randominstances of linear regression problems, these models' predictions mimic thoseof ordinary least squares. Towards understanding the mechanisms underlying this phenomenon, weinvestigate the dynamics of ICL in transformers with a single linearself-attention layer trained by gradient flow on linear regression tasks. Weshow that despite non-convexity, gradient flow with a suitable randominitialization finds a global minimum of the objective function. At this globalminimum, when given a test prompt of labeled examples from a new predictiontask, the transformer achieves prediction error competitive with the bestlinear predictor over the test prompt distribution. We additionallycharacterize the robustness of the trained transformer to a variety ofdistribution shifts and show that although a number of shifts are tolerated,shifts in the covariate distribution of the prompts are not. Motivated by this,we consider a generalized ICL setting where the covariate distributions canvary across prompts. We show that although gradient flow succeeds at finding aglobal minimum in this setting, the trained transformer is still brittle undermild covariate shifts. We complement this finding with experiments on large,nonlinear transformer architectures which we show are more robust undercovariate shifts.
Abstract
Neural sequence models based on the transformer architecture havedemonstrated remarkable \emph{in-context learning} (ICL) abilities, where theycan perform new tasks when prompted with training and test examples, withoutany parameter update to the model. This work first provides a comprehensivestatistical theory for transformers to perform ICL. Concretely, we show thattransformers can implement a broad class of standard machine learningalgorithms in context, such as least squares, ridge regression, Lasso, learninggeneralized linear models, and gradient descent on two-layer neural networks,with near-optimal predictive power on various in-context data distributions.Using an efficient implementation of in-context gradient descent as theunderlying mechanism, our transformer constructions admit mild size bounds, andcan be learned with polynomially many pretraining sequences. Building on these ``base'' ICL algorithms, intriguingly, we show thattransformers can implement more complex ICL procedures involving\emph{in-context algorithm selection}, akin to what a statistician can do inreal life -- A \emph{single} transformer can adaptively select different baseICL algorithms -- or even perform qualitatively different tasks -- on differentinput sequences, without any explicit prompting of the right algorithm or task.We both establish this in theory by explicit constructions, and also observethis phenomenon experimentally. In theory, we construct two general mechanismsfor algorithm selection with concrete examples: pre-ICL testing, and post-ICLvalidation. As an example, we use the post-ICL validation mechanism toconstruct a transformer that can perform nearly Bayes-optimal ICL on achallenging task -- noisy linear models with mixed noise levels.Experimentally, we demonstrate the strong in-context algorithm selectioncapabilities of standard transformer architectures.
Documents
- prompt tuning large language models on personalized aspect extraction for recommendations
- s3dst structured opendomain dialogue segmentation and state tracking in the era of llms
- reasoning before responding integrating commonsensebased causality explanation for empathetic response generation
- fewshot adaptation for parsing contextual utterances with llms
- chatrec towards interactive and explainable llmsaugmented recommender system
Abstract
Existing aspect extraction methods mostly rely on explicit or ground truth aspect information, or using data mining or machine learning approaches to extract aspects from implicit user feedback such as user reviews. It however remains under-explored how the extracted aspects can help generate more meaningful recommendations to the users. Meanwhile, existing research on aspect-based recommendations often relies on separate aspect extraction models or assumes the aspects are given, without accounting for the fact the optimal set of aspects could be dependent on the recommendation task at hand. In this work, we propose to combine aspect extraction together with aspect-based recommendations in an end-to-end manner, achieving the two goals together in a single framework. For the aspect extraction component, we leverage the recent advances in large language models and design a new prompt learning mechanism to generate aspects for the end recommendation task. For the aspect-based recommendation component, the extracted aspects are concatenated with the usual user and item features used by the recommendation model. The recommendation task mediates the learning of the user embeddings and item embeddings, which are used as soft prompts to generate aspects. Therefore, the extracted aspects are personalized and contextualized by the recommendation task. We showcase the effectiveness of our proposed method through extensive experiments on three industrial datasets, where our proposed framework significantly outperforms state-of-the-art baselines in both the personalized aspect extraction and aspect-based recommendation tasks. In particular, we demonstrate that it is necessary and beneficial to combine the learning of aspect extraction and aspect-based recommendation together. We also conduct extensive ablation studies to understand the contribution of each design component in our framework.
Abstract
The traditional Dialogue State Tracking (DST) problem aims to track user preferences and intents in user-agent conversations. While sufficient for task-oriented dialogue systems supporting narrow domain applications, the advent of Large Language Model (LLM)-based chat systems has introduced many real-world intricacies in open-domain dialogues. These intricacies manifest in the form of increased complexity in contextual interactions, extended dialogue sessions encompassing a diverse array of topics, and more frequent contextual shifts. To handle these intricacies arising from evolving LLM-based chat systems, we propose joint dialogue segmentation and state tracking per segment in open-domain dialogue systems. Assuming a zero-shot setting appropriate to a true open-domain dialogue system, we propose S3-DST, a structured prompting technique that harnesses Pre-Analytical Recollection, a novel grounding mechanism we designed for improving long context tracking. To demonstrate the efficacy of our proposed approach in joint segmentation and state tracking, we evaluate S3-DST on a proprietary anonymized open-domain dialogue dataset, as well as publicly available DST and segmentation datasets. Across all datasets and settings, S3-DST consistently outperforms the state-of-the-art, demonstrating its potency and robustness the next generation of LLM-based chat systems.
Abstract
Recent approaches to empathetic response generation try to incorporatecommonsense knowledge or reasoning about the causes of emotions to betterunderstand the user's experiences and feelings. However, these approachesmainly focus on understanding the causalities of context from the user'sperspective, ignoring the system's perspective. In this paper, we propose acommonsense-based causality explanation approach for diverse empatheticresponse generation that considers both the user's perspective (user's desiresand reactions) and the system's perspective (system's intentions andreactions). We enhance ChatGPT's ability to reason for the system's perspectiveby integrating in-context learning with commonsense knowledge. Then, weintegrate the commonsense-based causality explanation with both ChatGPT and aT5-based model. Experimental evaluations demonstrate that our methodoutperforms other comparable methods on both automatic and human evaluations.
Abstract
We evaluate the ability of semantic parsers based on large language models(LLMs) to handle contextual utterances. In real-world settings, there typicallyexists only a limited number of annotated contextual utterances due toannotation cost, resulting in an imbalance compared to non-contextualutterances. Therefore, parsers must adapt to contextual utterances with a fewtraining examples. We examine four major paradigms for doing so inconversational semantic parsing i.e., Parse-with-Utterance-History,Parse-with-Reference-Program, Parse-then-Resolve, and Rewrite-then-Parse. Tofacilitate such cross-paradigm comparisons, we constructSMCalFlow-EventQueries, a subset of contextual examples from SMCalFlow withadditional annotations. Experiments with in-context learning and fine-tuningsuggest that Rewrite-then-Parse is the most promising paradigm whenholistically considering parsing accuracy, annotation cost, and error types.
Abstract
Large language models (LLMs) have demonstrated their significant potential to be applied for addressing various application tasks. However, traditional recommender systems continue to face great challenges such as poor interactivity and explainability, which actually also hinder their broad deployment in real-world systems. To address these limitations, this paper proposes a novel paradigm called Chat-Rec (ChatGPT Augmented Recommender System) that innovatively augments LLMs for building conversational recommender systems by converting user profiles and historical interactions into prompts. Chat-Rec is demonstrated to be effective in learning user preferences and establishing connections between users and products through in-context learning, which also makes the recommendation process more interactive and explainable. What's more, within the Chat-Rec framework, user's preferences can transfer to different products for cross-domain recommendations, and prompt-based injection of information into LLMs can also handle the cold-start scenarios with new items. In our experiments, Chat-Rec effectively improve the results of top-k recommendations and performs better in zero-shot rating prediction task. Chat-Rec offers a novel approach to improving recommender systems and presents new practical scenarios for the implementation of AIGC (AI generated content) in recommender system studies.
Documents
- promptbased zeroshot relation extraction with semantic knowledge augmentation
- ccprompt counterfactual contrastive prompttuning for manyclass classification
- relationprompt leveraging prompts to generate synthetic data for zeroshot relation triplet extraction
- knowprompt knowledgeaware prompttuning with synergistic optimization for relation extraction
- continuous prompt tuning based textual entailment model for ecommerce entity typing
Abstract
In relation triplet extraction (RTE), recognizing unseen (new) relations forwhich there are no training instances is a challenging task. Efforts have beenmade to recognize unseen relations based on question-answering models orrelation descriptions. However, these approaches miss the semantic informationabout connections between seen and unseen relations. In this paper, We proposea prompt-based model with semantic knowledge augmentation (ZS-SKA) to recognizeunseen relations under the zero-shot setting. We present a new word-levelanalogy-based sentence translation rule and generate augmented instances withunseen relations from instances with seen relations using that new rule. Wedesign prompts with weighted virtual label construction based on an externalknowledge graph to integrate semantic knowledge information learned from seenrelations. Instead of using the actual label sets in the prompt template, weconstruct weighted virtual label words. We learn the representations of bothseen and unseen relations with augmented instances and prompts. We thencalculate the distance between the generated representations using prototypicalnetworks to predict unseen relations. Extensive experiments conducted on threepublic datasets FewRel, Wiki-ZSL, and NYT, show that ZS-SKA outperformsstate-of-the-art methods under the zero-shot scenarios. Our experimentalresults also demonstrate the effectiveness and robustness of ZS-SKA.
Abstract
With the success of the prompt-tuning paradigm in Natural Language Processing (NLP), various prompt templates have been proposed to further stimulate specific knowledge for serving downstream tasks, e.g., machine translation, text generation, relation extraction, and so on. Existing prompt templates are mainly shared among all training samples with the information of task description. However, training samples are quite diverse. The sharing task description is unable to stimulate the unique task-related information in each training sample, especially for tasks with the finite-label space. To exploit the unique task-related information, we imitate the human decision process which aims to find the contrastive attributes between the objective factual and their potential counterfactuals. Thus, we propose the \textbf{C}ounterfactual \textbf{C}ontrastive \textbf{Prompt}-Tuning (CCPrompt) approach for many-class classification, e.g., relation classification, topic classification, and entity typing. Compared with simple classification tasks, these tasks have more complex finite-label spaces and are more rigorous for prompts. First of all, we prune the finite label space to construct fact-counterfactual pairs. Then, we exploit the contrastive attributes by projecting training instances onto every fact-counterfactual pair. We further set up global prototypes corresponding with all contrastive attributes for selecting valid contrastive attributes as additional tokens in the prompt template. Finally, a simple Siamese representation learning is employed to enhance the robustness of the model. We conduct experiments on relation classification, topic classification, and entity typing tasks in both fully supervised setting and few-shot setting. The results indicate that our model outperforms former baselines.
Abstract
Despite the importance of relation extraction in building and representing knowledge, less research is focused on generalizing to unseen relations types. We introduce the task setting of Zero-Shot Relation Triplet Extraction (ZeroRTE) to encourage further research in low-resource relation extraction methods. Given an input sentence, each extracted triplet consists of the head entity, relation label, and tail entity where the relation label is not seen at the training stage. To solve ZeroRTE, we propose to synthesize relation examples by prompting language models to generate structured texts. Concretely, we unify language model prompts and structured text approaches to design a structured prompt template for generating synthetic relation samples when conditioning on relation label prompts (RelationPrompt). To overcome the limitation for extracting multiple relation triplets in a sentence, we design a novel Triplet Search Decoding method. Experiments on FewRel and Wiki-ZSL datasets show the efficacy of RelationPrompt for the ZeroRTE task and zero-shot relation classification. Our code and data are available at github.com/declare-lab/RelationPrompt.
Abstract
Recently, prompt-tuning has achieved promising results for specific few-shot classification tasks. The core idea of prompt-tuning is to insert text pieces (i.e., templates) into the input and transform a classification task into a masked language modeling problem. However, for relation extraction, determining an appropriate prompt template requires domain expertise, and it is cumbersome and time-consuming to obtain a suitable label word. Furthermore, there exists abundant semantic and prior knowledge among the relation labels that cannot be ignored. To this end, we focus on incorporating knowledge among relation labels into prompt-tuning for relation extraction and propose a Knowledge-aware Prompt-tuning approach with synergistic optimization (KnowPrompt). Specifically, we inject latent knowledge contained in relation labels into prompt construction with learnable virtual type words and answer words. Then, we synergistically optimize their representation with structured constraints. Extensive experimental results on five datasets with standard and low-resource settings demonstrate the effectiveness of our approach. Our code and datasets are available in GitHub1 for reproducibility.
Abstract
The explosion of e-commerce has caused the need for processing and analysis of product titles, like entity typing in product titles. However, the rapid activity in e-commerce has led to the rapid emergence of new entities, which is difficult for general entity typing. Besides, product titles in e-commerce have very different language styles from text data in general domain. In order to handle new entities in product titles and address the special language styles of product titles in e-commerce domain, we propose our textual entailment model with continuous prompt tuning based hypotheses and fusion embeddings for e-commerce entity typing. First, we reformulate entity typing into a textual entailment problem to handle new entities that are not present during training. Second, we design a model to automatically generate textual entailment hypotheses using a continuous prompt tuning method, which can generate better textual entailment hypotheses without manual design. Third, we utilize the fusion embeddings of BERT embedding and Char-acterBERT embedding to solve the problem that the language styles of product titles in e-commerce are different from that of general domain. To analyze the effect of each contribution, we compare the performance of entity typing and textual entailment model, and conduct ablation studies on continuous prompt tuning and fusion embeddings. We also evaluate the impact of different prompt template initialization for the continuous prompt tuning. We show our proposed model improves the average F1 score by around 2% compared to the baseline BERT entity typing model.