20.12 |
Google |
USENIX Security 2021 |
Extracting Training Data from Large Language Models |
Verbatim Text Sequences&Rank Likelihood |
22.11 |
AE Studio |
NIPS2022(ML Safety Workshop) |
Ignore Previous Prompt: Attack Techniques For Language Models |
Prompt Injection&Misaligned |
23.02 |
Saarland University |
arxiv |
Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection |
Adversarial Prompting&Indirect Prompt Injection&LLM-Integrated Applications |
23.04 |
Hong Kong University of Science and Technology |
EMNLP2023(findings) |
Multi-step Jailbreaking Privacy Attacks on ChatGPT |
Privacy&Jailbreaks |
23.05 |
Jinan University, Hong Kong University of Science and Technology, Nanyang Technological University, Zhejiang University |
EMNLP 2023 |
Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models |
Backdoor Attacks |
23.05 |
Nanyang Technological University, University of New South Wales, Virginia Tech |
arXiv |
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study |
Large Jailbreak&Prompt Engineering |
23.06 |
Princeton University |
ICML2023(Workshop) |
Visual Adversarial Examples Jailbreak Aligned Large Language Models |
Visual Language Models&Adversarial Attacks&AI Alignment |
23.06 |
Nanyang Technological University, University of New South Wales, Huazhong University of Science and Technology, Southern University of Science and Technology, Tianjin University |
arxiv |
Prompt Injection attack against LLM-integrated Applications |
&LLM-integrated Applications&Security Risks&Prompt Injection Attacks |
23.06 |
Google |
arxiv |
Are aligned neural networks adversarially aligned? |
Multimodal&Jailbreak |
23.07 |
CMU |
arxiv |
Universal and Transferable Adversarial Attacks on Aligned Language Models |
Jailbreak&Transferable Attack&Adversarial Attack |
23.07 |
Language Technologies Institute Carnegie Mellon University |
arXiv |
Prompts Should not be Seen as Secrets: Systematically Measuring Prompt Extraction Attack Success |
Prompt Extraction&Attack Success Measurement&Defensive Strategies |
23.07 |
Nanyang Technological University |
NDSS2023 |
MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots |
Jailbreak&Reverse-Engineering&Automatic Generation |
23.07 |
Cornell Tech |
arxiv |
Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs |
Multi-Modal LLMs&Indirect Instruction Injection&Adversarial Perturbations |
23.07 |
UNC Chapel Hill, Google DeepMind, ETH Zurich |
AdvML Frontiers Workshop 2023 |
Backdoor Attacks for In-Context Learning with Language Models |
Backdoor Attacks&In-Context Learning |
23.07 |
Google DeepMind |
arXiv |
Large language models (LLMs) are now highly capable at a diverse range of tasks |
Adversarial Machine Learning&AI-Guardian&Defense Robustness |
23.08 |
CISPA Helmholtz Center for Information Security; NetApp |
arxiv |
“Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models |
Jailbreak Prompts&Adversarial Prompts&Proactive Detection |
23.09 |
Ben-Gurion University, DeepKeep |
arxiv |
OPEN SESAME! UNIVERSAL BLACK BOX JAILBREAKING OF LARGE LANGUAGE MODELS |
Genetic Algorithm&Adversarial Prompt&Black Box Jailbreak |
23.10 |
Princeton University, Virginia Tech, IBM Research, Stanford University |
arxiv |
FINE-TUNING ALIGNED LANGUAGE MODELS COMPROMISES SAFETY EVEN WHEN USERS DO NOT INTEND TO! |
Fine-tuning****Safety Risks&Adversarial Training |
23.10 |
University of California Santa Barbara, Fudan University, Shanghai AI Laboratory |
arxiv |
SHADOW ALIGNMENT: THE EASE OF SUBVERTING SAFELY-ALIGNED LANGUAGE MODELS |
AI Safety&Malicious Use&Fine-tuning |
23.10 |
Peking University |
arxiv |
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations |
In-Context Learning&Adversarial Attacks&In-Context Demonstrations |
23.10 |
University of Maryland College Park, Adobe Research |
arxiv |
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models |
Adversarial Attacks&Interpretabilty&Jailbreaking |
23.11 |
MBZUAI |
arxiv |
Robustness Tests for Automatic Machine Translation Metrics with Adversarial Attacks |
Adversarially-synthesized Texts&Word-level Attacks&Evaluation |
23.11 |
Palisade Research |
arxiv |
BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B |
Remove Safety Fine-tuning |
23.11 |
University of Twente |
ICNLSP 2023 |
Efficient Black-Box Adversarial Attacks on Neural Text Detectors |
Misclassification&Adversarial attacks |
23.11 |
PRISM AI&Harmony Intelligenc&Leap Laboratories |
arxiv |
Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation |
Persona-modulation Attacks&Jailbreaks&Automated Prompt |
23.11 |
Tsinghua University |
arxiv |
Jailbreaking Large Vision-Language Models via Typographic Visual Prompts |
Typographic Attack&Multi-modal&Safety Evaluation |
23.11 |
Huazhong University of Science and Technology, Tsinghua University |
arxiv |
Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration |
Membership Inference Attacks&Privacy and Security |
23.11 |
Nanjing University, Meituan Inc |
arxiv |
A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily |
Jailbreak Prompts&Safety Alignment&Safeguard Effectiveness |
23.11 |
Google DeepMind |
arxiv |
Frontier Language Models Are Not Robust to Adversarial Arithmetic or "What Do I Need To Say So You Agree 2+2=5?" |
Adversarial Arithmetic&Model Robustness&Adversarial Attacks |
23.11 |
University of Illinois Chicago, Texas A&M University |
arxiv |
DALA: A Distribution-Aware LoRA-Based Adversarial Attack against Pre-trained Language Models |
Adversarial Attack&Distribution-Aware&LoRA-Based Attack |
23.11 |
Illinois Institute of Technology |
arxiv |
Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignment |
Backdoor Activation Attack&Large Language Models&AI Safety&Activation Steering&Trojan Steering Vectors |
23.11 |
Wayne State University |
arXiv |
Hijacking Large Language Models via Adversarial In-Context Learning |
Adversarial Attacks&Gradient-Based Prompt Search&Adversarial Suffixes |
23.11 |
Hong Kong Baptist University, Shanghai Jiao Tong University, Shanghai AI Laboratory, The University of Sydney |
arXiv |
DeepInception: Hypnotize Large Language Model to Be Jailbreaker |
Jailbreak&DeepInception |
23.11 |
Xi’an Jiaotong-Liverpool University |
arxiv |
Generating Valid and Natural Adversarial Examples with Large Language Models |
Adversarial examples&Text classification |
23.11 |
Michigan State University |
arxiv |
Beyond Boundaries: A Comprehensive Survey of Transferable Attacks on AI Systems |
Transferable Attacks&AI Systems&Adversarial Attacks |
23.11 |
Tsinghua University & Kuaishou Technology |
arxiv |
Evil Geniuses: Delving into the Safety of LLM-based Agents |
LLM-based Agents&Safety&Malicious Attacks |
23.11 |
Cornell University |
arxiv |
Language Model Inversion |
Model Inversion&Prompt Reconstruction&Privacy |
23.11 |
ETH Zurich |
arxiv |
Universal Jailbreak Backdoors from Poisoned Human Feedback |
RLHF&Backdoor Attacks |
23.11 |
UC Santa Cruz, UNC-Chapel Hill |
arxiv |
How Many Are in This Image? A Safety Evaluation Benchmark for Vision LLMs |
Vision Large Language Models&Safety Evaluation&Adversarial Robustness |
23.11 |
Texas Tech University |
arxiv |
Exploiting Large Language Models (LLMs) through Deception Techniques and Persuasion Principles |
Social Engineering&Security&Prompt Engineering |
23.11 |
Johns Hopkins University |
arxiv |
Instruct2Attack: Language-Guided Semantic Adversarial Attacks |
Language-guided Attacks&Latent Diffusion Models&Adversarial Attack |
23.11 |
Google DeepMind, University of Washington, Cornell, CMU, UC Berkeley, ETH Zurich |
arxiv |
Scalable Extraction of Training Data from (Production) Language Models |
Extractable Memorization&Data Extraction&Adversary Attacks |
23.11 |
University of Maryland, Mila, Towards AI, Stanford, Technical University of Sofia, University of Milan, NYU |
arxiv |
Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition |
Prompt Hacking&Security Threats |
23.11 |
University of Washington, UIUC, Pennsylvania State University, University of Chicago |
arxiv |
IDENTIFYING AND MITIGATING VULNERABILITIES IN LLM-INTEGRATED APPLICATIONS |
LLM-Integrated Applications&Attack Surfaces |
23.11 |
Jinan University, Guangzhou Xuanyuan Research Institute Co. Ltd., The Hong Kong Polytechnic University |
arxiv |
TARGET: Template-Transferable Backdoor Attack Against Prompt-based NLP Models via GPT4 |
Prompt-based Learning&Backdoor Attack |
23.12 |
The Pennsylvania State University |
arxiv |
Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections |
Backdoor Injection&Safety Alignment |
23.12 |
Drexel University |
arXiv |
A Survey on Large Language Model (LLM) Security and Privacy: The Good the Bad and the Ugly |
Security&Privacy&Attacks |
23.12 |
Yale University, Robust Intelligence |
arXiv |
Tree of Attacks: Jailbreaking Black-Box LLMs Automatically |
Tree of Attacks with Pruning (TAP)&Jailbreaking&Prompt Generation |
23.12 |
Independent (Now at Google DeepMind) |
arXiv |
Scaling Laws for Adversarial Attacks on Language Model Activations |
Adversarial Attacks&Language Model Activations&Scaling Laws |
23.12 |
Harbin Institute of Technology |
arxiv |
Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak |
Jailbreak Attack&Inherent Response Tendency&Affirmation Tendency |
23.12 |
University of Wisconsin-Madison |
arxiv |
DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions |
Code Generation&Adversarial Attacks&Cybersecurity |
23.12 |
Carnegie Melon University, IBM Research |
arxiv |
Forcing Generative Models to Degenerate Ones: The Power of Data Poisoning Attacks |
Data Poisoning Attacks&Natural Language Generation&Cybersecurity |
23.12 |
Purdue University |
NIPS2023(Workshop) |
Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs |
Knowledge Extraction&Interrogation Techniques&Cybersecurity |
23.12 |
Sungkyunkwan University, University of Tennessee |
arXiv |
Poisoned ChatGPT Finds Work for Idle Hands: Exploring Developers’ Coding Practices with Insecure Suggestions from Poisoned AI Models |
Poisoning Attacks&Software Development |
23.12 |
North Carolina State University, New York University, Stanford University |
arXiv |
BEYOND GRADIENT AND PRIORS IN PRIVACY ATTACKS: LEVERAGING POOLER LAYER INPUTS OF LANGUAGE MODELS IN FEDERATED LEARNING |
Federated Learning&Privacy Attacks |
23.12 |
Korea Advanced Institute of Science, Graduate School of AI |
arxiv |
Hijacking Context in Large Multi-modal Models |
Large Multi-modal Models&Context Hijacking |
23.12 |
Xi’an Jiaotong University, Nanyang Technological University, Singapore Management University |
arXiv |
A Mutation-Based Method for Multi-Modal Jailbreaking Attack Detection |
Jailbreaking Detection&Multi-Modal |
23.12 |
Logistic and Supply Chain MultiTech R&D Centre (LSCM) |
UbiSec-2023 |
A Comprehensive Survey of Attack Techniques Implementation and Mitigation Strategies in Large Language Models |
Cybersecurity Attacks&Defense Strategies |
23.12 |
University of Illinois Urbana-Champaign, VMware Research |
arXiv |
BYPASSING THE SAFETY TRAINING OF OPEN-SOURCE LLMS WITH PRIMING ATTACKS |
Safety Training&Priming Attacks |
23.12 |
Delft University of Technology |
ICSE 2024 |
Traces of Memorisation in Large Language Models for Code |
Code Memorisation&Data Extraction Attacks |
23.12 |
University of Science and Technology of China, Hong Kong University of Science and Technology, Microsoft |
arxiv |
Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models |
Indirect Prompt Injection Attacks&BIPIA Benchmark&Defense |
23.12 |
Nanjing University of Aeronautics and Astronautics |
NLPCC2023 |
Punctuation Matters! Stealthy Backdoor Attack for Language Models |
Backdoor Attack&PuncAttack&Stealthiness |
23.12 |
FAR AI, McGill University, MILA, Jagiellonian University |
arXiv |
Exploiting Novel GPT-4 APIs |
Fine-Tuning&Knowledge Retrieval&Security Vulnerabilities |
23.12 |
EPFL |
|
Adversarial Attacks on GPT-4 via Simple Random Search |
Adversarial Attacks&Random Search&Jailbreak |
24.01 |
Logistic and Supply Chain MultiTech R&D Centre (LSCM) |
CSDE2023 |
A Novel Evaluation Framework for Assessing Resilience Against Prompt Injection Attacks in Large Language Models |
Evaluation&Prompt Injection&Cyber Security |
24.01 |
University of Southern California |
arxiv |
The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect Large Language Model Performance |
Prompt Engineering&Text Classification&Jailbreaks |
24.01 |
Virginia Tech, Renmin University of China, UC Davis, Stanford University |
arxiv |
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs |
AI Safety&Persuasion Adversarial Prompts&Jailbreak |
24.01 |
Anthropic, Redwood Research, Mila Quebec AI Institute, University of Oxford |
arxiv |
SLEEPER AGENTS: TRAINING DECEPTIVE LLMS THAT PERSIST THROUGH SAFETY TRAINING |
Deceptive Behavior&Safety Training&Backdoored Behavior&Adversarial Training |
24.01 |
Jinan University,Nanyang Technological University, Beijing Institute of Technology, Pazhou Lab |
arxiv |
UNIVERSAL VULNERABILITIES IN LARGE LANGUAGE MODELS: IN-CONTEXT LEARNING BACKDOOR ATTACKS |
In-context Learning&Security&Backdoor Attacks |
24.01 |
Carnegie Mellon University |
arxiv |
Combating Adversarial Attacks with Multi-Agent Debate |
Adversarial Attacks&Multi-Agent Debate&Red Team |
24.01 |
Fudan University |
arxiv |
Open the Pandora’s Box of LLMs: Jailbreaking LLMs through Representation Engineering |
LLM Security&Representation Engineering |
24.01 |
Northwestern University, New York University, University of Liverpool, Rutgers University |
arxiv |
AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models |
Jailbreak Attack&Evaluation Frameworks&Ground Truth Dataset |
24.01 |
Kyushu Institute of Technology |
arxiv |
ALL IN HOW YOU ASK FOR IT: SIMPLE BLACK-BOX METHOD FOR JAILBREAK ATTACKS |
Jailbreak Attacks&Black-box Method |
24.01 |
MIT |
arXiv |
Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning |
Jailbreaking&Model Safety |
24.01 |
Aalborg University |
arxiv |
Text Embedding Inversion Attacks on Multilingual Language Models |
Text Embedding&Inversion Attacks&Multilingual Language Models |
24.01 |
University of Illinois Urbana-Champaign, University of Washington, Western Washington University |
arxiv |
BADCHAIN: BACKDOOR CHAIN-OF-THOUGHT PROMPTING FOR LARGE LANGUAGE MODELS |
Chain-of-Thought Prompting&Backdoor Attacks |
24.01 |
The University of Hong Kong, Zhejiang University |
arxiv |
Red Teaming Visual Language Models |
Vision-Language Models&Red Teaming |
24.01 |
University of California Santa Barbara,Sea AI Lab Singapore, Carnegie Mellon University |
arxiv |
Weak-to-Strong Jailbreaking on Large Language Models |
Jailbreaking&Adversarial Prompts&AI Safety |
24.02 |
Boston University |
arxiv |
Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks |
Large Vision-Language Models&Typographic Attacks&Self-Generated Attacks |
24.02 |
Copenhagen Business School, Temple University |
arxiv |
An Early Categorization of Prompt Injection Attacks on Large Language Models |
Prompt Injection&Categorization |
24.02 |
Michigan State University, Okinawa Institute of Science and Technology (OIST) |
arxiv |
Data Poisoning for In-context Learning |
In-context learning&Data poisoning&Security |
24.02 |
CISPA Helmholtz Center for Information Security |
arxiv |
Conversation Reconstruction Attack Against GPT Models |
Conversation Reconstruction Attack&Privacy risks&Security |
24.02 |
University of Illinois Urbana-Champaign, Center for AI Safety, Carnegie Mellon University, UC Berkeley, Microsoft |
arxiv |
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal |
Automated Red Teaming&Robust Refusal |
24.02 |
University of Washington, University of Virginia, Allen Institute for Artificial Intelligence |
arxiv |
Do Membership Inference Attacks Work on Large Language Models? |
Membership Inference Attacks&Privacy&Security |
24.02 |
Pennsylvania State University, Wuhan University, Illinois Institute of Technology |
arxiv |
PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models |
Knowledge Poisoning Attacks&Retrieval-Augmented Generation |
24.02 |
Purdue University, University of Massachusetts at Amherst |
arxiv |
RAPID OPTIMIZATION FOR JAILBREAKING LLMS VIA SUBCONSCIOUS EXPLOITATION AND ECHOPRAXIA |
Jailbreaking LLM&Optimization |
24.02 |
CISPA Helmholtz Center for Information Security |
arxiv |
Comprehensive Assessment of Jailbreak Attacks Against LLMs |
Jailbreak Attacks&Attack Methods&Policy Alignment |
24.02 |
UC Berkeley |
arxiv |
StruQ: Defending Against Prompt Injection with Structured Queries |
Prompt Injection Attacks&Structured Queries&Defense Mechanisms |
24.02 |
Nanyang Technological University, Huazhong University of Science and Technology, University of New South Wales |
arxiv |
PANDORA: Jailbreak GPTs by Retrieval Augmented Generation Poisoning |
Jailbreak Attacks&Retrieval Augmented Generation (RAG) |
24.02 |
Sea AI Lab, Southern University |
arxiv |
Test-Time Backdoor Attacks on Multimodal Large Language Models |
Backdoor Attacks&Multimodal Large Language Models (MLLMs)&Adversarial Test Images |
24.02 |
University of Illinois at Urbana–Champaign, University of California, San Diego, Allen Institute for AI |
arxiv |
COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability |
Jailbreaks&Controllable Attack Generation |
24.02 |
ISCAS, NTU |
arxiv |
Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues |
Jailbreak Attacks&Indirect Attack&Puzzler |
24.02 |
École Polytechnique Fédérale de Lausanne, University of Wisconsin-Madison |
arxiv |
Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks |
Jailbreaking Attacks&Contextual Interaction&Multi-Round Interactions |
24.02 |
University of Electronic Science and Technology of China, CISPA Helmholtz Center for Information Security, NetApp |
arxiv |
Rapid Adoption, Hidden Risks: The Dual Impact of Large Language Model Customization |
Customization&Instruction Backdoor Attacks&GPTs |
24.02 |
Shanghai Artificial Intelligence Laboratory |
arxiv |
Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey |
LLM Conversation Safety&Attacks&Defenses |
24.02 |
UC Berkeley, New York University |
arxiv |
PAL: Proxy-Guided Black-Box Attack on Large Language Models |
Black-Box Attack&Proxy-Guided Attack&PAL |
24.02 |
Center for Human-Compatible AI, UC Berkeley |
arxiv |
A STRONGREJECT for Empty Jailbreaks |
Jailbreaks&Benchmarking&StrongREJECT |
24.02 |
Arizona State University |
arxiv |
Jailbreaking Proprietary Large Language Models using Word Substitution Cipher |
Jailbreak&Word Substitution Cipher&Attack Success Rate |
24.02 |
Renmin University of China, Beijing, Peking University, WeChat AI |
arxiv |
Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents |
Backdoor Attacks&Agent Safety&Framework |
24.02 |
University of Washington, UIUC, Western Washington University, University of Chicago |
arxiv |
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs |
ASCII Art&Jailbreak Attacks&Safety Alignment |
24.02 |
Jinan University, Nanyang Technological University, Zhejiang University, Hong Kong University of Science and Technology, Beijing Institute of Technology, Sony Research |
arxiv |
Defending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuning |
Weight-Poisoning Backdoor Attacks&Parameter-Efficient Fine-Tuning (PEFT)&Poisoned Sample Identification Module (PSIM) |
24.02 |
CISPA Helmholtz Center for Information Security |
arxiv |
Prompt Stealing Attacks Against Large Language Models |
Prompt Engineering&Security |
24.02 |
University of New South Wales Australia, Delft University of Technology The Netherlands&Nanyang Technological University Singapore |
arxiv |
LLM Jailbreak Attack versus Defense Techniques - A Comprehensive Study |
Jailbreak Attacks&Defense Techniques |
24.02 |
Wayne State University, University of Michigan-Flint |
arxiv |
Learning to Poison Large Language Models During Instruction Tuning |
Data Poisoning&Backdoor Attacks |
24.02 |
Nanyang Technological University, Zhejiang University, The Chinese University of Hong Kong |
arxiv |
Backdoor Attacks on Dense Passage Retrievers for Disseminating Misinformation |
Dense Passage Retrieval&Backdoor Attacks&Misinformation |
24.02 |
University of Michigan |
arxiv |
PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails |
Universal Adversarial Prefixes&Guard Models |
24.02 |
Meta |
arxiv |
Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts |
Adversarial Prompts&Quality-Diversity&Safety |
24.02 |
Fudan University |
arxiv |
CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models |
Personalized Encryption&Safety Mechanisms |
24.02 |
Carnegie Mellon University |
arxiv |
Attacking LLM Watermarks by Exploiting Their Strengths |
LLM Watermarks&Adversarial Attacks |
24.02 |
Beihang University |
arxiv |
From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings |
Adversarial Suffix&Text Embedding Translation |
24.02 |
University of Maryland College Park |
arxiv |
Fast Adversarial Attacks on Language Models In One GPU Minute |
Adversarial Attacks&BEAST&Computational Efficiency |
24.02 |
Shanghai Artificial Intelligence Laboratory |
arxiv |
Attacks Defenses and Evaluations for LLM Conversation Safety: A Survey |
Conversation Safety&Survey |
24.02 |
Beijing University of Posts and Telecommunications, University of Michigan |
arxiv |
Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue |
Multi-turn Dialogue&Safety Vulnerability |
24.02 |
University of California, The Hongkong University of Science and Technology, University of Maryland |
arxiv |
DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers |
Jailbreaking Attacks&Prompt Decomposition |
24.02 |
Massachusetts Institute of Technology, MIT-IBM Watson AI Lab |
arxiv |
CURIOSITY-DRIVEN RED-TEAMING FOR LARGE LANGUAGE MODELS |
Curiosity-Driven Exploration&Red Teaming |
24.02 |
SKLOIS Institute of Information Engineering Chinese Academy of Science, School of Cyber Security University of Chinese Academy of Sciences,Tsinghua University,RealAI |
arxiv |
Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction |
Jailbreaking&Large Language Models&Adversarial Attacks |
24.03 |
Rice University, Samsung Electronics America |
arxiv |
LoRA-as-an-Attack! Piercing LLM Safety Under The Share-and-Play Scenario |
Low-Rank Adaptation (LoRA)&Backdoor Attacks&Model Security |
24.03 |
The University of Hong Kong |
arxiv |
ImgTrojan: Jailbreaking Vision-Language Models with ONE Image |
Vision-Language Models&Data Poisoning&Jailbreaking Attack |
24.03 |
SPRING Lab EPFL |
arxiv |
Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks |
Prompt Injection Attacks&Optimization-Based Approach&Security |
24.03 |
Shanghai University of Finance and Economics, Southern University of Science and Technology |
arxiv |
Tastle: Distract Large Language Models for Automatic Jailbreak Attack |
Jailbreak Attack&Black-box Framework |
24.03 |
Google DeepMind, ETH Zurich, University of Washington, OpenAI, McGill University |
arxiv |
Stealing Part of a Production Language Model |
Model Stealing&Language Models&Security |
24.03 |
University of Edinburgh |
arxiv |
Scaling Behavior of Machine Translation with Large Language Models under Prompt Injection Attacks |
Prompt Injection Attacks&Machine Translation&Inverse Scaling |
24.03 |
Nanyang Technological University |
arxiv |
BADEDIT: BACKDOORING LARGE LANGUAGE MODELS BY MODEL EDITING |
Backdoor Attacks&Model Editing&Security |
24.03 |
Fudan University, Shanghai AI Laboratory |
arxiv |
EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models |
Jailbreak Attacks&Security&Framework |
24.03 |
Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai Engineering Research Center of AI & Robotics |
arxiv |
Improving Adversarial Transferability of Visual-Language Pre-training Models through Collaborative Multimodal Interaction |
Vision-Language Pre-trained Model&Adversarial Transferability&Black-Box Attack |
24.03 |
Microsoft |
arxiv |
Securing Large Language Models: Threats, Vulnerabilities, and Responsible Practices |
Security Risks&Vulnerabilities |
24.03 |
Carnegie Mellon University |
arxiv |
Jailbreaking is Best Solved by Definition |
Jailbreak Attacks&Adaptive Attacks |
24.03 |
ShanghaiTech University |
arxiv |
LinkPrompt: Natural and Universal Adversarial Attacks on Prompt-based Language Models |
Universal Adversarial Triggers&Prompt-based Learning&Natural Language Attack |
24.03 |
Huazhong University of Science and Technology, Lehigh University, University of Notre Dame & Duke University |
arxiv |
Optimization-based Prompt Injection Attack to LLM-as-a-Judge |
Prompt Injection Attack&LLM-as-a-Judge&Optimization |
24.03 |
Washington University in St. Louis, University of Wisconsin - Madison, John Burroughs School |
USENIX Security 2024 |
Don’t Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models |
Jailbreak Prompts&Security |
24.04 |
University of Pennsylvania, ETH Zurich, EPFL, Sony AI |
arxiv |
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models |
Jailbreaking Attacks&Robustness Benchmark |
24.04 |
Microsoft Azure, Microsoft, Microsoft Research |
arxiv |
The Crescendo Multi-Turn LLM Jailbreak Attack |
Jailbreak Attacks&Multi-Turn Interaction |
24.04 |
EPFL |
arxiv |
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks |
Adaptive Attacks&Jailbreaking |
24.04 |
The Ohio State University, University of Wisconsin-Madison |
arxiv |
JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks |
Multimodal Large Language Models&Jailbreak Attacks&Benchmark |