Evaluating GPT for use in K-12 Block Based CS Instruction Using a Transpiler and Prompt EngineeringK12
This program is tentative and subject to change.
Though the increased availability of Large Language Models (LLMs) presents significant potential for change in the way students learn to program, the text-based nature of the available tools preclude block-based languages from much of that innovation. In an attempt to remedy this, we identify the strengths and weaknesses of using a transpiler to leverage the existing learning in commercially available LLMs and Scratch, a visual block-based programming language.
We evaluate an LLM’s performance on two common classroom tasks in a Scratch curriculum using only prompt engineering. We evaluate the LLM’s ability to: 1) create project solutions that compile and satisfy project requirements and 2) analyze student projects’ completion of project requirements.
In both cases, we find results indicating that prompt-engineering alone is insufficient to reliably produce high-quality results. For projects of medium complexity, the LLM-generated solutions consistently failed to follow correct syntax or, in the few instances with correct syntax, produce correct solutions. When used for auto-grading, we found a correlation between scores assigned by the autograder and those generated by the LLM, but the discrepancies between the `real’ scores and the scores assigned by the LLM remained too great for the tool to be reliable in a classroom setting.