Skip to content

Latest commit

 

History

History
169 lines (156 loc) · 70.4 KB

Jailbreaks&Attack.md

File metadata and controls

169 lines (156 loc) · 70.4 KB

Jailbreaks&Attack

Different from the main README🕵️

  • Within this subtopic, we will be updating with the latest articles. This will help researchers in this area to quickly understand recent trends.
  • In addition to providing the most recent updates, we will also add keywords to each subtopic to help you find content of interest more quickly.
  • Within each subtopic, we will also update with profiles of scholars we admire and endorse in the field. Their work is often of high quality and forward-looking!"

📑Papers

Date Institute Publication Paper Keywords
20.12 Google USENIX Security 2021 Extracting Training Data from Large Language Models Verbatim Text Sequences&Rank Likelihood
22.11 AE Studio NIPS2022(ML Safety Workshop) Ignore Previous Prompt: Attack Techniques For Language Models Prompt Injection&Misaligned
23.02 Saarland University arxiv Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection Adversarial Prompting&Indirect Prompt Injection&LLM-Integrated Applications
23.04 Hong Kong University of Science and Technology EMNLP2023(findings) Multi-step Jailbreaking Privacy Attacks on ChatGPT Privacy&Jailbreaks
23.05 Jinan University, Hong Kong University of Science and Technology, Nanyang Technological University, Zhejiang University EMNLP 2023 Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models Backdoor Attacks
23.05 Nanyang Technological University, University of New South Wales, Virginia Tech arXiv Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study Large Jailbreak&Prompt Engineering
23.06 Princeton University ICML2023(Workshop) Visual Adversarial Examples Jailbreak Aligned Large Language Models Visual Language Models&Adversarial Attacks&AI Alignment
23.06 Nanyang Technological University, University of New South Wales, Huazhong University of Science and Technology, Southern University of Science and Technology, Tianjin University arxiv Prompt Injection attack against LLM-integrated Applications &LLM-integrated Applications&Security Risks&Prompt Injection Attacks
23.06 Google arxiv Are aligned neural networks adversarially aligned? Multimodal&Jailbreak
23.07 CMU arxiv Universal and Transferable Adversarial Attacks on Aligned Language Models Jailbreak&Transferable Attack&Adversarial Attack
23.07 Language Technologies Institute Carnegie Mellon University arXiv Prompts Should not be Seen as Secrets: Systematically Measuring Prompt Extraction Attack Success Prompt Extraction&Attack Success Measurement&Defensive Strategies
23.07 Nanyang Technological University NDSS2023 MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots Jailbreak&Reverse-Engineering&Automatic Generation
23.07 Cornell Tech arxiv Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs Multi-Modal LLMs&Indirect Instruction Injection&Adversarial Perturbations
23.07 UNC Chapel Hill, Google DeepMind, ETH Zurich AdvML Frontiers Workshop 2023 Backdoor Attacks for In-Context Learning with Language Models Backdoor Attacks&In-Context Learning
23.07 Google DeepMind arXiv Large language models (LLMs) are now highly capable at a diverse range of tasks Adversarial Machine Learning&AI-Guardian&Defense Robustness
23.08 CISPA Helmholtz Center for Information Security; NetApp arxiv “Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models Jailbreak Prompts&Adversarial Prompts&Proactive Detection
23.09 Ben-Gurion University, DeepKeep arxiv OPEN SESAME! UNIVERSAL BLACK BOX JAILBREAKING OF LARGE LANGUAGE MODELS Genetic Algorithm&Adversarial Prompt&Black Box Jailbreak
23.10 Princeton University, Virginia Tech, IBM Research, Stanford University arxiv FINE-TUNING ALIGNED LANGUAGE MODELS COMPROMISES SAFETY EVEN WHEN USERS DO NOT INTEND TO! Fine-tuning****Safety Risks&Adversarial Training
23.10 University of California Santa Barbara, Fudan University, Shanghai AI Laboratory arxiv SHADOW ALIGNMENT: THE EASE OF SUBVERTING SAFELY-ALIGNED LANGUAGE MODELS AI Safety&Malicious Use&Fine-tuning
23.10 Peking University arxiv Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations In-Context Learning&Adversarial Attacks&In-Context Demonstrations
23.10 University of Maryland College Park, Adobe Research arxiv AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models Adversarial Attacks&Interpretabilty&Jailbreaking
23.11 MBZUAI arxiv Robustness Tests for Automatic Machine Translation Metrics with Adversarial Attacks Adversarially-synthesized Texts&Word-level Attacks&Evaluation
23.11 Palisade Research arxiv BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B Remove Safety Fine-tuning
23.11 University of Twente ICNLSP 2023 Efficient Black-Box Adversarial Attacks on Neural Text Detectors Misclassification&Adversarial attacks
23.11 PRISM AI&Harmony Intelligenc&Leap Laboratories arxiv Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation Persona-modulation Attacks&Jailbreaks&Automated Prompt
23.11 Tsinghua University arxiv Jailbreaking Large Vision-Language Models via Typographic Visual Prompts Typographic Attack&Multi-modal&Safety Evaluation
23.11 Huazhong University of Science and Technology, Tsinghua University arxiv Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration Membership Inference Attacks&Privacy and Security
23.11 Nanjing University, Meituan Inc arxiv A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily Jailbreak Prompts&Safety Alignment&Safeguard Effectiveness
23.11 Google DeepMind arxiv Frontier Language Models Are Not Robust to Adversarial Arithmetic or "What Do I Need To Say So You Agree 2+2=5?" Adversarial Arithmetic&Model Robustness&Adversarial Attacks
23.11 University of Illinois Chicago, Texas A&M University arxiv DALA: A Distribution-Aware LoRA-Based Adversarial Attack against Pre-trained Language Models Adversarial Attack&Distribution-Aware&LoRA-Based Attack
23.11 Illinois Institute of Technology arxiv Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignment Backdoor Activation Attack&Large Language Models&AI Safety&Activation Steering&Trojan Steering Vectors
23.11 Wayne State University arXiv Hijacking Large Language Models via Adversarial In-Context Learning Adversarial Attacks&Gradient-Based Prompt Search&Adversarial Suffixes
23.11 Hong Kong Baptist University, Shanghai Jiao Tong University, Shanghai AI Laboratory, The University of Sydney arXiv DeepInception: Hypnotize Large Language Model to Be Jailbreaker Jailbreak&DeepInception
23.11 Xi’an Jiaotong-Liverpool University arxiv Generating Valid and Natural Adversarial Examples with Large Language Models Adversarial examples&Text classification
23.11 Michigan State University arxiv Beyond Boundaries: A Comprehensive Survey of Transferable Attacks on AI Systems Transferable Attacks&AI Systems&Adversarial Attacks
23.11 Tsinghua University & Kuaishou Technology arxiv Evil Geniuses: Delving into the Safety of LLM-based Agents LLM-based Agents&Safety&Malicious Attacks
23.11 Cornell University arxiv Language Model Inversion Model Inversion&Prompt Reconstruction&Privacy
23.11 ETH Zurich arxiv Universal Jailbreak Backdoors from Poisoned Human Feedback RLHF&Backdoor Attacks
23.11 UC Santa Cruz, UNC-Chapel Hill arxiv How Many Are in This Image? A Safety Evaluation Benchmark for Vision LLMs Vision Large Language Models&Safety Evaluation&Adversarial Robustness
23.11 Texas Tech University arxiv Exploiting Large Language Models (LLMs) through Deception Techniques and Persuasion Principles Social Engineering&Security&Prompt Engineering
23.11 Johns Hopkins University arxiv Instruct2Attack: Language-Guided Semantic Adversarial Attacks Language-guided Attacks&Latent Diffusion Models&Adversarial Attack
23.11 Google DeepMind, University of Washington, Cornell, CMU, UC Berkeley, ETH Zurich arxiv Scalable Extraction of Training Data from (Production) Language Models Extractable Memorization&Data Extraction&Adversary Attacks
23.11 University of Maryland, Mila, Towards AI, Stanford, Technical University of Sofia, University of Milan, NYU arxiv Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition Prompt Hacking&Security Threats
23.11 University of Washington, UIUC, Pennsylvania State University, University of Chicago arxiv IDENTIFYING AND MITIGATING VULNERABILITIES IN LLM-INTEGRATED APPLICATIONS LLM-Integrated Applications&Attack Surfaces
23.11 Jinan University, Guangzhou Xuanyuan Research Institute Co. Ltd., The Hong Kong Polytechnic University arxiv TARGET: Template-Transferable Backdoor Attack Against Prompt-based NLP Models via GPT4 Prompt-based Learning&Backdoor Attack
23.12 The Pennsylvania State University arxiv Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections Backdoor Injection&Safety Alignment
23.12 Drexel University arXiv A Survey on Large Language Model (LLM) Security and Privacy: The Good the Bad and the Ugly Security&Privacy&Attacks
23.12 Yale University, Robust Intelligence arXiv Tree of Attacks: Jailbreaking Black-Box LLMs Automatically Tree of Attacks with Pruning (TAP)&Jailbreaking&Prompt Generation
23.12 Independent (Now at Google DeepMind) arXiv Scaling Laws for Adversarial Attacks on Language Model Activations Adversarial Attacks&Language Model Activations&Scaling Laws
23.12 Harbin Institute of Technology arxiv Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak Jailbreak Attack&Inherent Response Tendency&Affirmation Tendency
23.12 University of Wisconsin-Madison arxiv DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions Code Generation&Adversarial Attacks&Cybersecurity
23.12 Carnegie Melon University, IBM Research arxiv Forcing Generative Models to Degenerate Ones: The Power of Data Poisoning Attacks Data Poisoning Attacks&Natural Language Generation&Cybersecurity
23.12 Purdue University NIPS2023(Workshop) Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs Knowledge Extraction&Interrogation Techniques&Cybersecurity
23.12 Sungkyunkwan University, University of Tennessee arXiv Poisoned ChatGPT Finds Work for Idle Hands: Exploring Developers’ Coding Practices with Insecure Suggestions from Poisoned AI Models Poisoning Attacks&Software Development
23.12 North Carolina State University, New York University, Stanford University arXiv BEYOND GRADIENT AND PRIORS IN PRIVACY ATTACKS: LEVERAGING POOLER LAYER INPUTS OF LANGUAGE MODELS IN FEDERATED LEARNING Federated Learning&Privacy Attacks
23.12 Korea Advanced Institute of Science, Graduate School of AI arxiv Hijacking Context in Large Multi-modal Models Large Multi-modal Models&Context Hijacking
23.12 Xi’an Jiaotong University, Nanyang Technological University, Singapore Management University arXiv A Mutation-Based Method for Multi-Modal Jailbreaking Attack Detection Jailbreaking Detection&Multi-Modal
23.12 Logistic and Supply Chain MultiTech R&D Centre (LSCM) UbiSec-2023 A Comprehensive Survey of Attack Techniques Implementation and Mitigation Strategies in Large Language Models Cybersecurity Attacks&Defense Strategies
23.12 University of Illinois Urbana-Champaign, VMware Research arXiv BYPASSING THE SAFETY TRAINING OF OPEN-SOURCE LLMS WITH PRIMING ATTACKS Safety Training&Priming Attacks
23.12 Delft University of Technology ICSE 2024 Traces of Memorisation in Large Language Models for Code Code Memorisation&Data Extraction Attacks
23.12 University of Science and Technology of China, Hong Kong University of Science and Technology, Microsoft arxiv Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models Indirect Prompt Injection Attacks&BIPIA Benchmark&Defense
23.12 Nanjing University of Aeronautics and Astronautics NLPCC2023 Punctuation Matters! Stealthy Backdoor Attack for Language Models Backdoor Attack&PuncAttack&Stealthiness
23.12 FAR AI, McGill University, MILA, Jagiellonian University arXiv Exploiting Novel GPT-4 APIs Fine-Tuning&Knowledge Retrieval&Security Vulnerabilities
23.12 EPFL Adversarial Attacks on GPT-4 via Simple Random Search Adversarial Attacks&Random Search&Jailbreak
24.01 Logistic and Supply Chain MultiTech R&D Centre (LSCM) CSDE2023 A Novel Evaluation Framework for Assessing Resilience Against Prompt Injection Attacks in Large Language Models Evaluation&Prompt Injection&Cyber Security
24.01 University of Southern California arxiv The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect Large Language Model Performance Prompt Engineering&Text Classification&Jailbreaks
24.01 Virginia Tech, Renmin University of China, UC Davis, Stanford University arxiv How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs AI Safety&Persuasion Adversarial Prompts&Jailbreak
24.01 Anthropic, Redwood Research, Mila Quebec AI Institute, University of Oxford arxiv SLEEPER AGENTS: TRAINING DECEPTIVE LLMS THAT PERSIST THROUGH SAFETY TRAINING Deceptive Behavior&Safety Training&Backdoored Behavior&Adversarial Training
24.01 Jinan University,Nanyang Technological University, Beijing Institute of Technology, Pazhou Lab arxiv UNIVERSAL VULNERABILITIES IN LARGE LANGUAGE MODELS: IN-CONTEXT LEARNING BACKDOOR ATTACKS In-context Learning&Security&Backdoor Attacks
24.01 Carnegie Mellon University arxiv Combating Adversarial Attacks with Multi-Agent Debate Adversarial Attacks&Multi-Agent Debate&Red Team
24.01 Fudan University arxiv Open the Pandora’s Box of LLMs: Jailbreaking LLMs through Representation Engineering LLM Security&Representation Engineering
24.01 Northwestern University, New York University, University of Liverpool, Rutgers University arxiv AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models Jailbreak Attack&Evaluation Frameworks&Ground Truth Dataset
24.01 Kyushu Institute of Technology arxiv ALL IN HOW YOU ASK FOR IT: SIMPLE BLACK-BOX METHOD FOR JAILBREAK ATTACKS Jailbreak Attacks&Black-box Method
24.01 MIT arXiv Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning Jailbreaking&Model Safety
24.01 Aalborg University arxiv Text Embedding Inversion Attacks on Multilingual Language Models Text Embedding&Inversion Attacks&Multilingual Language Models
24.01 University of Illinois Urbana-Champaign, University of Washington, Western Washington University arxiv BADCHAIN: BACKDOOR CHAIN-OF-THOUGHT PROMPTING FOR LARGE LANGUAGE MODELS Chain-of-Thought Prompting&Backdoor Attacks
24.01 The University of Hong Kong, Zhejiang University arxiv Red Teaming Visual Language Models Vision-Language Models&Red Teaming
24.01 University of California Santa Barbara,Sea AI Lab Singapore, Carnegie Mellon University arxiv Weak-to-Strong Jailbreaking on Large Language Models Jailbreaking&Adversarial Prompts&AI Safety
24.02 Boston University arxiv Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks Large Vision-Language Models&Typographic Attacks&Self-Generated Attacks
24.02 Copenhagen Business School, Temple University arxiv An Early Categorization of Prompt Injection Attacks on Large Language Models Prompt Injection&Categorization
24.02 Michigan State University, Okinawa Institute of Science and Technology (OIST) arxiv Data Poisoning for In-context Learning In-context learning&Data poisoning&Security
24.02 CISPA Helmholtz Center for Information Security arxiv Conversation Reconstruction Attack Against GPT Models Conversation Reconstruction Attack&Privacy risks&Security
24.02 University of Illinois Urbana-Champaign, Center for AI Safety, Carnegie Mellon University, UC Berkeley, Microsoft arxiv HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal Automated Red Teaming&Robust Refusal
24.02 University of Washington, University of Virginia, Allen Institute for Artificial Intelligence arxiv Do Membership Inference Attacks Work on Large Language Models? Membership Inference Attacks&Privacy&Security
24.02 Pennsylvania State University, Wuhan University, Illinois Institute of Technology arxiv PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models Knowledge Poisoning Attacks&Retrieval-Augmented Generation
24.02 Purdue University, University of Massachusetts at Amherst arxiv RAPID OPTIMIZATION FOR JAILBREAKING LLMS VIA SUBCONSCIOUS EXPLOITATION AND ECHOPRAXIA Jailbreaking LLM&Optimization
24.02 CISPA Helmholtz Center for Information Security arxiv Comprehensive Assessment of Jailbreak Attacks Against LLMs Jailbreak Attacks&Attack Methods&Policy Alignment
24.02 UC Berkeley arxiv StruQ: Defending Against Prompt Injection with Structured Queries Prompt Injection Attacks&Structured Queries&Defense Mechanisms
24.02 Nanyang Technological University, Huazhong University of Science and Technology, University of New South Wales arxiv PANDORA: Jailbreak GPTs by Retrieval Augmented Generation Poisoning Jailbreak Attacks&Retrieval Augmented Generation (RAG)
24.02 Sea AI Lab, Southern University arxiv Test-Time Backdoor Attacks on Multimodal Large Language Models Backdoor Attacks&Multimodal Large Language Models (MLLMs)&Adversarial Test Images
24.02 University of Illinois at Urbana–Champaign, University of California, San Diego, Allen Institute for AI arxiv COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability Jailbreaks&Controllable Attack Generation
24.02 ISCAS, NTU arxiv Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues Jailbreak Attacks&Indirect Attack&Puzzler
24.02 École Polytechnique Fédérale de Lausanne, University of Wisconsin-Madison arxiv Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks Jailbreaking Attacks&Contextual Interaction&Multi-Round Interactions
24.02 University of Electronic Science and Technology of China, CISPA Helmholtz Center for Information Security, NetApp arxiv Rapid Adoption, Hidden Risks: The Dual Impact of Large Language Model Customization Customization&Instruction Backdoor Attacks&GPTs
24.02 Shanghai Artificial Intelligence Laboratory arxiv Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey LLM Conversation Safety&Attacks&Defenses
24.02 UC Berkeley, New York University arxiv PAL: Proxy-Guided Black-Box Attack on Large Language Models Black-Box Attack&Proxy-Guided Attack&PAL
24.02 Center for Human-Compatible AI, UC Berkeley arxiv A STRONGREJECT for Empty Jailbreaks Jailbreaks&Benchmarking&StrongREJECT
24.02 Arizona State University arxiv Jailbreaking Proprietary Large Language Models using Word Substitution Cipher Jailbreak&Word Substitution Cipher&Attack Success Rate
24.02 Renmin University of China, Beijing, Peking University, WeChat AI arxiv Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents Backdoor Attacks&Agent Safety&Framework
24.02 University of Washington, UIUC, Western Washington University, University of Chicago arxiv ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs ASCII Art&Jailbreak Attacks&Safety Alignment
24.02 Jinan University, Nanyang Technological University, Zhejiang University, Hong Kong University of Science and Technology, Beijing Institute of Technology, Sony Research arxiv Defending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuning Weight-Poisoning Backdoor Attacks&Parameter-Efficient Fine-Tuning (PEFT)&Poisoned Sample Identification Module (PSIM)
24.02 CISPA Helmholtz Center for Information Security arxiv Prompt Stealing Attacks Against Large Language Models Prompt Engineering&Security
24.02 University of New South Wales Australia, Delft University of Technology The Netherlands&Nanyang Technological University Singapore arxiv LLM Jailbreak Attack versus Defense Techniques - A Comprehensive Study Jailbreak Attacks&Defense Techniques
24.02 Wayne State University, University of Michigan-Flint arxiv Learning to Poison Large Language Models During Instruction Tuning Data Poisoning&Backdoor Attacks
24.02 Nanyang Technological University, Zhejiang University, The Chinese University of Hong Kong arxiv Backdoor Attacks on Dense Passage Retrievers for Disseminating Misinformation Dense Passage Retrieval&Backdoor Attacks&Misinformation
24.02 University of Michigan arxiv PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails Universal Adversarial Prefixes&Guard Models
24.02 Meta arxiv Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts Adversarial Prompts&Quality-Diversity&Safety
24.02 Fudan University arxiv CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models Personalized Encryption&Safety Mechanisms
24.02 Carnegie Mellon University arxiv Attacking LLM Watermarks by Exploiting Their Strengths LLM Watermarks&Adversarial Attacks
24.02 Beihang University arxiv From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings Adversarial Suffix&Text Embedding Translation
24.02 University of Maryland College Park arxiv Fast Adversarial Attacks on Language Models In One GPU Minute Adversarial Attacks&BEAST&Computational Efficiency
24.02 Shanghai Artificial Intelligence Laboratory arxiv Attacks Defenses and Evaluations for LLM Conversation Safety: A Survey Conversation Safety&Survey
24.02 Beijing University of Posts and Telecommunications, University of Michigan arxiv Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue Multi-turn Dialogue&Safety Vulnerability
24.02 University of California, The Hongkong University of Science and Technology, University of Maryland arxiv DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers Jailbreaking Attacks&Prompt Decomposition
24.02 Massachusetts Institute of Technology, MIT-IBM Watson AI Lab arxiv CURIOSITY-DRIVEN RED-TEAMING FOR LARGE LANGUAGE MODELS Curiosity-Driven Exploration&Red Teaming
24.02 SKLOIS Institute of Information Engineering Chinese Academy of Science, School of Cyber Security University of Chinese Academy of Sciences,Tsinghua University,RealAI arxiv Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction Jailbreaking&Large Language Models&Adversarial Attacks
24.03 Rice University, Samsung Electronics America arxiv LoRA-as-an-Attack! Piercing LLM Safety Under The Share-and-Play Scenario Low-Rank Adaptation (LoRA)&Backdoor Attacks&Model Security
24.03 The University of Hong Kong arxiv ImgTrojan: Jailbreaking Vision-Language Models with ONE Image Vision-Language Models&Data Poisoning&Jailbreaking Attack
24.03 SPRING Lab EPFL arxiv Neural Exec: Learning (and Learning from) Execution Triggers for Prompt Injection Attacks Prompt Injection Attacks&Optimization-Based Approach&Security
24.03 Shanghai University of Finance and Economics, Southern University of Science and Technology arxiv Tastle: Distract Large Language Models for Automatic Jailbreak Attack Jailbreak Attack&Black-box Framework
24.03 Google DeepMind, ETH Zurich, University of Washington, OpenAI, McGill University arxiv Stealing Part of a Production Language Model Model Stealing&Language Models&Security
24.03 University of Edinburgh arxiv Scaling Behavior of Machine Translation with Large Language Models under Prompt Injection Attacks Prompt Injection Attacks&Machine Translation&Inverse Scaling
24.03 Nanyang Technological University arxiv BADEDIT: BACKDOORING LARGE LANGUAGE MODELS BY MODEL EDITING Backdoor Attacks&Model Editing&Security
24.03 Fudan University, Shanghai AI Laboratory arxiv EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models Jailbreak Attacks&Security&Framework
24.03 Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai Engineering Research Center of AI & Robotics arxiv Improving Adversarial Transferability of Visual-Language Pre-training Models through Collaborative Multimodal Interaction Vision-Language Pre-trained Model&Adversarial Transferability&Black-Box Attack
24.03 Microsoft arxiv Securing Large Language Models: Threats, Vulnerabilities, and Responsible Practices Security Risks&Vulnerabilities
24.03 Carnegie Mellon University arxiv Jailbreaking is Best Solved by Definition Jailbreak Attacks&Adaptive Attacks
24.03 ShanghaiTech University arxiv LinkPrompt: Natural and Universal Adversarial Attacks on Prompt-based Language Models Universal Adversarial Triggers&Prompt-based Learning&Natural Language Attack
24.03 Huazhong University of Science and Technology, Lehigh University, University of Notre Dame & Duke University arxiv Optimization-based Prompt Injection Attack to LLM-as-a-Judge Prompt Injection Attack&LLM-as-a-Judge&Optimization
24.03 Washington University in St. Louis, University of Wisconsin - Madison, John Burroughs School USENIX Security 2024 Don’t Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models Jailbreak Prompts&Security
24.04 University of Pennsylvania, ETH Zurich, EPFL, Sony AI arxiv JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models Jailbreaking Attacks&Robustness Benchmark
24.04 Microsoft Azure, Microsoft, Microsoft Research arxiv The Crescendo Multi-Turn LLM Jailbreak Attack Jailbreak Attacks&Multi-Turn Interaction
24.04 EPFL arxiv Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks Adaptive Attacks&Jailbreaking
24.04 The Ohio State University, University of Wisconsin-Madison arxiv JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks Multimodal Large Language Models&Jailbreak Attacks&Benchmark

💻Presentations & Talks

📖Tutorials & Workshops

Date Type Title URL
23.01 Community Reddit/ChatGPTJailbrek link
23.02 Resource&Tutorials Jailbreak Chat link
23.10 Tutorials Awesome-LLM-Safety link
23.10 Article Adversarial Attacks on LLMs(Author: Lilian Weng) link
23.11 Video [1hr Talk] Intro to Large Language Models
From 45:45(Author: Andrej Karpathy)
link

📰News & Articles

Date Type Title Author URL
23.10 Article Adversarial Attacks on LLMs Lilian Weng link

🧑‍🏫Scholars