Design and Evaluation of an AI-Assisted Grading Tool for Introductory Programming Assignments: An Experience Report
In a typical introductory programming course, grading student- submitted programs involves an autograder which compiles and runs the programs and tests their functionality with predefined test cases, with no attention to the source code. However, in an educational setting, grading based on inspection of the source code is required for two main reasons (1) awarding partial marks to ‘partially correct’ code that may be failing the testcase check (2) awarding marks (or penalties) based on source code quality or spe- cific criteria that the instructor may have laid out in the problem statement (e.g. ‘implement sorting using bubble-sort’). However, grading based on studying the source code can be highly time con- suming when the course has a large enrollment. In this paper we present the design and evaluation of an AI Assistant for source code grading, which we have named TA Buddy. TA Buddy is powered by Code LLama, a large language model especially trained for code related tasks, which we fine-tuned using a graded programs dataset. Given a problem statement, student code submissions and a grading rubric, TA Buddy can be asked to generate suggested grades, i.e. rat- ings for the various rubric criteria, for each submission. The human teaching assistant (TA) can then accept or overrule these grades. We evaluated the TA Buddy-assisted manual grading against ‘pure’ manual grading and found that the time taken to grade reduced by 24% while maintaining grade agreement in the two cases at 90%.