Evaluating GPT for use in K-12 Block Based CS Instruction Using a Transpiler and Prompt EngineeringK12
Though the increased availability of Large Language Models (LLMs) presents significant potential for change in the way students learn to program, the text-based nature of the available tools preclude block-based languages from much of that innovation. In an attempt to remedy this, we identify the strengths and weaknesses of using a transpiler to leverage the existing learning in commercially available LLMs and Scratch, a visual block-based programming language.
We evaluate an LLM’s performance on two common classroom tasks in a Scratch curriculum using only prompt engineering. We evaluate the LLM’s ability to: 1) create project solutions that compile and satisfy project requirements and 2) analyze student projects’ completion of project requirements.
In both cases, we find results indicating that prompt-engineering alone is insufficient to reliably produce high-quality results. For projects of medium complexity, the LLM-generated solutions consistently failed to follow correct syntax or, in the few instances with correct syntax, produce correct solutions. When used for auto-grading, we found a correlation between scores assigned by the autograder and those generated by the LLM, but the discrepancies between the `real’ scores and the scores assigned by the LLM remained too great for the tool to be reliable in a classroom setting.