This study investigates how targeted attacks can compromise the reliability and applications of large language models (LLMs) in educational assessment, highlighting security vulnerabilities that are frequently underestimated in current AI-supported learning environments. As LLMs and other AI tools are increasingly being integrated into grading, providing feedback, and supporting the evaluation workflow, educators are adopting them for their potential to increase efficiency and scalability. However, this rapid adoption also introduces new risks. An unexplored threat is prompt injection, whereby a student acting as an attacker embeds malicious instructions within seemingly regular assignment submissions to influence the model’s behaviour and obtain a more favourable evaluation. To the best of our knowledge, this is the first systematic comparative study to investigate the vulnerability of popular LLMs within a real-world educational context. We analyse a significant representative scenario involving prompt injection in exam assessment to highlight how easily such manipulations can bypass the teacher’s oversight and distort results, thereby disrupting the entire evaluation process. By modelling the structure and behavioural patterns of LLMs under attack, we aim to clarify the underlying mechanisms and expose their limitations when used in educational settings.
When AI Is Fooled: Hidden Risks in LLM-Assisted Grading
Milani A.
;
2025-01-01
Abstract
This study investigates how targeted attacks can compromise the reliability and applications of large language models (LLMs) in educational assessment, highlighting security vulnerabilities that are frequently underestimated in current AI-supported learning environments. As LLMs and other AI tools are increasingly being integrated into grading, providing feedback, and supporting the evaluation workflow, educators are adopting them for their potential to increase efficiency and scalability. However, this rapid adoption also introduces new risks. An unexplored threat is prompt injection, whereby a student acting as an attacker embeds malicious instructions within seemingly regular assignment submissions to influence the model’s behaviour and obtain a more favourable evaluation. To the best of our knowledge, this is the first systematic comparative study to investigate the vulnerability of popular LLMs within a real-world educational context. We analyse a significant representative scenario involving prompt injection in exam assessment to highlight how easily such manipulations can bypass the teacher’s oversight and distort results, thereby disrupting the entire evaluation process. By modelling the structure and behavioural patterns of LLMs under attack, we aim to clarify the underlying mechanisms and expose their limitations when used in educational settings.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


