Article,

Compromising Honesty and Harmlessness in Language Models via Deception Attacks

L. Vaugrante, F. Carlon, M. Menke, and T. Hagendorff.
(2025)

Meta data

BibTeX key: vaugrante2025compromisinghonestyharmlessnesslanguage
entry type: article
year: 2025
eprint: 2502.08301
archiveprefix: arXiv
primaryclass: cs.CL
url: https://arxiv.org/abs/2502.08301

Tags

iris

Users

Comments and Reviewsshow / hide

Please log in to take part in the discussion (add own reviews or comments).

Cite this publication

search on