Preview

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Advanced search

Developing a Defence for Large Language Models Against Adversarial Attacks Based on Paraphrasing in a Black-Box Scenario

https://doi.org/10.15514/ISPRAS-2025-37(5)-15

Abstract

Recently, the relevance of generative models has increased significantly, and their scope of application is becoming increasingly larger. However, the main problem with modern large language models is that there are jailbreak attacks that can force the model to produce prohibited information. Recent studies have presented adversarial vulnerabilities in the class of "jailbreak" attacks on large language models in a black-box, paraphrase-based scenario. We aim to continue and expand this research and develop models that are secure against such attacks using a "red-teaming" procedure. Moreover, we conduct extensive experiments that evaluate the quality of text generation of defended models on various benchmarks.

About the Authors

Irina Sergeevna ALEKSEEVSKAIA
Institute for System Programming of the Russian Academy of Sciences
Russian Federation

Programmer at the Trusted Artificial Intelligence Research Center, postgraduate student at the ISP RAS in the field of artificial intelligence and machine learning. Research interests: large language models, adversarial attacks, backdoor attacks, alignment of large language models.



Denis Vladimirovich KHAIBULLIN
Institute for System Programming of the Russian Academy of Sciences, Lomonosov Moscow State University
Russian Federation

Laboratory assistant at the Trusted Artificial Intelligence Research Center, student at Lomonosov Moscow State University. Research interests: large language models.



Denis Yuryevich TURDAKOV
Institute for System Programming of the Russian Academy of Sciences, Lomonosov Moscow State University
Russian Federation

Cand. Sci. (Phys.-Math.), Head of the Information Systems Department at the Institute of System Programming since 2017. Research interests: natural language analysis, cloud computing, machine learning, social network analysis.



References

1. Achiam J. et al. Gpt-4 technical report //arXiv preprint, 2023. Available at: arXiv:2303.08774, accessed 07.10.2025.

2. Roziere B. et al. Code llama: Open foundation models for code //arXiv preprint, 2023. Available at: arXiv:2308.12950, accessed 07.10.2025.

3. Qian J. et al. A Liver Cancer Question-Answering System Based on Next-Generation Intelligence and the Large Model Med-PaLM 2. International Journal of Computer Science and Information Technology, vol. 2(1), 2024, pp. 28-35.

4. Ebrahimi J. et al. Hotflip: White-box adversarial examples for text classification //arXiv preprint, 2017. Available at: arXiv:1712.06751, accessed 07.10.2025.

5. Zou A. et al. Universal and transferable adversarial attacks on aligned language models //arXiv preprint, 2023. Available at: arXiv:2307.15043, accessed 07.10.2025.

6. Jones E. et al. Automatically auditing large language models via discrete optimization //International Conference on Machine Learning PMLR, 2023, pp. 15307-15329.

7. Alekseevskaia I., Arkhipenko K. OrderBkd: Textual backdoor attack through repositioning //2023 Ivannikov Ispras Open Conference (ISPRAS), 2023. – IEEE. – pp. 1-6.

8. Xu J. et al. Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models //arXiv preprint, 2023. Available at: arXiv:2305.14710, accessed 07.10.2025.

9. Li Y. et al. Badedit: Backdooring large language models by model editing //arXiv preprint, 2024. Available at: arXiv:2403.13355, accessed 07.10.2025.

10. Kshetri N. Cybercrime and privacy threats of large language models //IT Professional. 2023, vol. 25, no. 3, pp. 9-13.

11. Lyu H. et al. Llm-rec: Personalized recommendation via prompting large language models //arXiv preprint, 2023. Available at: arXiv:2307.15780, accessed 07.10.2025.

12. Azeem R. et al. LLM-Driven Robots Risk Enacting Discrimination, Violence, and Unlawful Actions //arXiv preprint, 2024. Available at: arXiv:2406.08824, accessed 07.10.2025.

13. Liu X. et al. Autodan: Generating stealthy jailbreak prompts on aligned large language models //arXiv preprint, 2023. Available at: arXiv:2310.04451, accessed 07.10.2025.

14. Harte J. et al. Leveraging large language models for sequential recommendation //Proceedings of the 17th ACM Conference on Recommender Systems. – 2023, pp. 1096-1102.

15. Bai Y. et al. Constitutional ai: Harmlessness from AI feedback //arXiv preprint, 2022. Available at: arXiv:2212.08073, accessed 07.10.2025.

16. Bai Y. et al. Training a helpful and harmless assistant with reinforcement learning from human feedback //arXiv preprint, 2022. Available at: arXiv:2204.05862, accessed 07.10.2025.

17. Dai J. et al. Safe rlhf: Safe reinforcement learning from human feedback //arXiv preprint, 2023. Available at: arXiv:2310.12773, accessed 07.10.2025.

18. Rafailov R. et al. Direct preference optimization: Your language model is secretly a reward model //Advances in Neural Information Processing Systems. – 2024, vol. 36.

19. Wang C. et al. Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints //arXiv preprint, 2023. Available at: arXiv:2309.16240, accessed 07.10.2025.

20. Azar M. G. et al. A general theoretical paradigm to understand learning from human preferences //International Conference on Artificial Intelligence and Statistics. – PMLR, 2024, pp. 4447-4455.

21. Ethayarajh K. et al. Kto: Model alignment as prospect theoretic optimization //arXiv preprint, 2024. Available at: arXiv:2402.01306, accessed 07.10.2025.

22. Hejna J. et al. Contrastive prefence learning: Learning from human feedback without rl //arXiv preprint, 2023. Available at: arXiv:2310.13639, accessed 07.10.2025.

23. Chao P. et al. Jailbreaking black box large language models in twenty queries //arXiv preprint, 2023. Available at: arXiv:2310.08419, accessed 07.10.2025.

24. Mehrotra A. et al. Tree of attacks: Jailbreaking black-box llms automatically //arXiv preprint, 2023. Available at: arXiv:2312.02119, accessed 07.10.2025.

25. Sitawarin C. et al. Pal: Proxy-guided black-box attack on large language models //arXiv preprint, 2024. Available at: arXiv:2402.09674, accessed 07.10.2025.

26. Hussein Abbass, Axel Bender, Svetoslav Gaidow, and Paul Whitbread. Computational red teaming: Past, present and future. IEEE Computational Intelligence Magazine, 6(1):30–42, 2011.

27. Shen X. et al. "Do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models //Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, 2024, pp. 1671-1685.

28. Ganguli D. et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned //arXiv preprint, 2022. Available at: arXiv:2209.07858, accessed 07.10.2025.

29. Radharapu B. et al. Aart: AI-assisted red-teaming with diverse data generation for new llm-powered applications //arXiv preprint, 2023. Available at: arXiv:2311.08592, accessed 07.10.2025.

30. Ji J. et al. Beavertails: Towards improved safety alignment of LLM via a human-preference dataset //Advances in Neural Information Processing Systems, 2024, vol. 36.

31. Bhardwaj R., Poria S. Red-teaming large language models using chain of utterances for safety-alignment //arXiv preprint, 2023. Available at: arXiv:2308.09662, accessed 07.10.2025.

32. Shlens J. Notes on kullback-leibler divergence and likelihood //arXiv preprint, 2014. Available at: arXiv:1404.2000, accessed 07.10.2025.

33. Schulman J. et al. Proximal policy optimization algorithms //arXiv preprint, 2017. Available at: arXiv:1707.06347, accessed 07.10.2025.

34. Lucht P. The Method of Lagrange Multipliers //Rimrock Digital Technology, Salt Lake City, Utah, vol. 84103.

35. Mazeika M. et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal //arXiv preprint, 2024. Available at: arXiv:2402.04249, accessed 07.10.2025.

36. Yang Y. et al. Can large multimodal models uncover deep semantics behind images? //arXiv preprint, 2024. Available at: arXiv:2402.11281, accessed 07.10.2025.


Review

For citations:


ALEKSEEVSKAIA I.S., KHAIBULLIN D.V., TURDAKOV D.Yu. Developing a Defence for Large Language Models Against Adversarial Attacks Based on Paraphrasing in a Black-Box Scenario. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2025;37(5):195-204. (In Russ.) https://doi.org/10.15514/ISPRAS-2025-37(5)-15



Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)