ML Security Papers

Latest papers

4,030 papers

defense arXiv Apr 30, 2026 · 5w ago

VOW: Verifiable and Oblivious Watermark Detection for Large Language Models

Xiaokun Luan, Yihao Zhang, Pengcheng Su et al. · Peking University

Privacy-preserving watermark detection protocol using VOPRF that verifies LLM-generated text without revealing content to provider

Output Integrity Attack nlp

PDF

survey arXiv Apr 30, 2026 · 5w ago

Security Attack and Defense Strategies for Autonomous Agent Frameworks: A Layered Review with OpenClaw as a Case Study

Luyao Xu, Xiang Chen · Nantong University · Nanjing University

Layered security review of LLM agent frameworks covering prompt injection, tool misuse, state persistence attacks, and ecosystem vulnerabilities

Prompt Injection Insecure Plugin Design Excessive Agency nlp

PDF

tool arXiv Apr 30, 2026 · 5w ago

Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

Hongliang Liu, Tung-Ling Li, Yuhao Wu · Palo Alto Networks

Two-pass perturbation probing identifies 50-neuron safety refusal circuits in aligned LLMs, enabling precision ablation interventions

Prompt Injection nlp

PDF

defense arXiv Apr 30, 2026 · 5w ago

PuzzleMark: Implicit Jigsaw Learning for Robust Code Dataset Watermarking in Neural Code Completion Models

Haocheng Huang, Yuchen Chen, Weisong Sun et al. · Soochow University · Nanjing University +1 more

Dataset watermarking scheme embedding stealth marks in code via variable name patterns to prove training data ownership

Output Integrity Attack nlp

PDF

defense arXiv Apr 30, 2026 · 5w ago

TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning

Bowen Sun, Chaozhuo Li, Yaodong Yang et al. · Johns Hopkins University · Microsoft Research Asia +2 more

Dual-encoder defense that clusters fragmented malicious prompts across anonymous LLM requests using asymmetric contrastive learning

Prompt Injection nlp

PDF

defense arXiv Apr 30, 2026 · 5w ago

Secure Cross-Silo Synthetic Genomic Data Generation

Daniil Filienko, Martine De Cock, Sikha Pentyala · University of Washington Tacoma

Privacy-preserving federated synthetic genomic data generation using MPC for input privacy and differential privacy for output privacy

Model Inversion Attack federated-learningtabular

PDF

tool arXiv Apr 30, 2026 · 5w ago

FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

Yanting Wang, Chenlong Yin, Ying Chen et al. · The Pennsylvania State University

Efficient red-teaming framework achieving 2-7x speedup for optimization-based prompt injection and knowledge corruption attacks on long-context LLMs

Prompt Injection Red-Team Agents Benchmarks & Evaluation nlp

PDF Code

defense arXiv Apr 30, 2026 · 5w ago

MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks

Jona te Lintelo, Lichao Wu, Marina Krček et al. · Radboud University · University of Bristol +2 more

Reconfigures MoE LLM safety behavior by steering expert routing at inference time without retraining, defending against jailbreaks

Prompt Injection nlp

PDF

defense arXiv Apr 30, 2026 · 5w ago

AdaBFL: Multi-Layer Defensive Adaptive Aggregation for Bzantine-Robust Federated Learning

Zehui Tang, Yuchen Liu, Feihu Huang · Nanjing University of Aeronautics and Astronautics · MIIT Key Laboratory of Pattern Analysis and Machine Intelligence

Adaptive aggregation defense for federated learning that dynamically adjusts weights across multiple defense layers to counter Byzantine poisoning attacks

Data Poisoning Attack federated-learning

PDF

defense arXiv Apr 30, 2026 · 5w ago

Are DeepFakes Realistic Enough? Exploring Semantic Mismatch as a Novel Challenge

Sharayu Nilesh Deshmukh, Kailash A. Hambarde, Joana C. Costa et al. · Instituto de Telecomunicações · Universidade da Beira Interior

Extends deepfake detection with semantic mismatch detection, revealing vulnerabilities when authentic audio-video pairs are semantically inconsistent

Output Integrity Attack multimodalvisionaudio

PDF Code

defense arXiv Apr 30, 2026 · 5w ago

Frequency-Aware Semantic Fusion with Gated Injection for AI-generated Image Detection

Shuchang Zhou, Shangkun Wu, Jiwei Wei et al. · University of Electronic Science and Technology of China · Harbin Institute of Technology

Detects AI-generated images by fusing frequency-domain artifacts with semantic features via gated injection and hyperspherical learning

Output Integrity Attack visiongenerative

PDF

attack arXiv Apr 30, 2026 · 5w ago

Secret Stealing Attacks on Local LLM Fine-Tuning through Supply-Chain Model Code Backdoors

Zi Li, Tian Zhou, Wenze Li et al. · Nanjing University

Malicious model code backdoors that hijack fine-tuning to force memorization and extraction of high-entropy secrets like API keys

AI Supply Chain Attacks Model Inversion Attack Model Poisoning Sensitive Information Disclosure nlp

PDF

benchmark arXiv Apr 29, 2026 · 5w ago

Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry

Wenhao Lan, Shan Li, Junbin Yang et al. · University of Chinese Academy of Sciences · Inner Mongolia University of Technology +1 more

Mechanistic analysis showing adversarial fine-tuning reorganizes LLM refusal representations across layers while navigating robustness-utility tradeoffs

Prompt Injection nlp

PDF

defense arXiv Apr 29, 2026 · 5w ago

Robust Alignment: Harmonizing Clean Accuracy and Adversarial Robustness in Adversarial Training

Yanyun Wang, Qingqing Ye, Li Liu et al. · Hong Kong Polytechnic University · Hong Kong University of Science and Technology

Adversarial training method that harmonizes clean accuracy and robustness by aligning input perturbations with latent space representations

Input Manipulation Attack vision

PDF

benchmark arXiv Apr 29, 2026 · 5w ago

Tatemae: Detecting Alignment Faking via Tool Selection in LLMs

Matteo Leonesi, Francesco Belardinelli, Flavio Corradini et al. · University of Camerino · Imperial College London

Detects LLM alignment faking via tool selection mismatches between monitored and unmonitored contexts in enterprise IT scenarios

Prompt Injection Excessive Agency nlp

PDF Code

defense arXiv Apr 29, 2026 · 5w ago

Which Face and Whose Identity? Solving the Dual Challenge of Deepfake Proactive Forensics in Multi-Face Scenarios

Lei Zhang, Zhiqing Guo, Dan Ma et al. · Xinjiang University · Hunan University

Embeds identity watermarks in multi-face images to localize deepfake-manipulated regions and trace forged identities in group photos

Output Integrity Attack visionmultimodal

PDF

attack arXiv Apr 29, 2026 · 5w ago

Quantamination: Dynamic Quantization Leaks Your Data Across the Batch

Hanna Foerster, Ilia Shumailov, Cheng Zhang et al. · University of Cambridge · AI Sequrity Company +1 more

Side-channel attack exploiting dynamic quantization in ML frameworks to extract sensitive user data from batched inference requests

AI Supply Chain Attacks

PDF

defense arXiv Apr 29, 2026 · 5w ago

TAP into the Patch Tokens: Leveraging Vision Foundation Model Features for AI-Generated Image Detection

Ahmed Abdullah, Nikolas Ebert, Oliver Wasenmüller · Mannheim University of Applied Sciences

Benchmarks vision foundation models for AI-generated image detection, achieving 12% accuracy improvement over CLIP with tunable attention pooling

Output Integrity Attack visionmultimodal

PDF

defense arXiv Apr 29, 2026 · 5w ago

SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts

Yuan Xin, Yixuan Weng, Minjun Zhu et al. · CISPA · Westlake University +3 more

GAN-inspired co-evolutionary framework training attack generators and defenders to protect LLM review systems from hidden prompt injection

Prompt Injection nlp

PDF

tool arXiv Apr 29, 2026 · 5w ago

DSIPA: Detecting LLM-Generated Texts via Sentiment-Invariant Patterns Divergence Analysis

Siyuan Li, Aodu Wulianghai, Guangyan Li et al. · Shanghai Jiao Tong University · Chinese Academy of Sciences

Detects LLM-generated text by analyzing sentiment distribution stability, achieving 49.89% F1 improvement over baselines

Output Integrity Attack nlp

PDF

Loading more papers…

Latest papers

VOW: Verifiable and Oblivious Watermark Detection for Large Language Models

Security Attack and Defense Strategies for Autonomous Agent Frameworks: A Layered Review with OpenClaw as a Case Study

Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

PuzzleMark: Implicit Jigsaw Learning for Robust Code Dataset Watermarking in Neural Code Completion Models

TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning

Secure Cross-Silo Synthetic Genomic Data Generation

FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks

AdaBFL: Multi-Layer Defensive Adaptive Aggregation for Bzantine-Robust Federated Learning

Are DeepFakes Realistic Enough? Exploring Semantic Mismatch as a Novel Challenge

Frequency-Aware Semantic Fusion with Gated Injection for AI-generated Image Detection

Secret Stealing Attacks on Local LLM Fine-Tuning through Supply-Chain Model Code Backdoors

Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry

Robust Alignment: Harmonizing Clean Accuracy and Adversarial Robustness in Adversarial Training

Tatemae: Detecting Alignment Faking via Tool Selection in LLMs

Which Face and Whose Identity? Solving the Dual Challenge of Deepfake Proactive Forensics in Multi-Face Scenarios

Quantamination: Dynamic Quantization Leaks Your Data Across the Batch

TAP into the Patch Tokens: Leveraging Vision Foundation Model Features for AI-Generated Image Detection

SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts

DSIPA: Detecting LLM-Generated Texts via Sentiment-Invariant Patterns Divergence Analysis

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue