Analyzing Pedagogical Quality and Efficiency of LLM Responses with TA Feedback to Live Student Questions
The rapid enrollment growth in computing, coupled with the increasing integration of online learning, makes the utilization of technology for learning at scale all the more pertinent. While Large Language Models (LLMs) have emerged as promising avenues for automated student question-answering, guaranteeing consistent instructional effectiveness—relevance, factuality, and style—of the response remains a key challenge. Therefore, to develop better LLM educational assistants, there is a need for fine-grained analysis of the pedagogical qualities of human instructor answers and where State-Of-The-Art (SOTA) automated LLM-powered pipelines fall short.
In this work, we create EdBot: a Retrieval Augmented Generation (RAG) pipeline based on GPT-4 for answering student questions in the course’s online discussion forum. We determine the pedagogical effectiveness of EdBot’s responses in the discussion forum through expert Teaching Assistant (TA) evaluation of the answers. Our research goes one step further by having TAs edit and improve the response. We then thoroughly analyze both the LLM responses and the TA edits to ascertain the essential characteristics of a high-quality pedagogical response. Some key insights of our evaluation are as follows: (1) EdBot can give relevant and factual answers in an educational style for content and assignment questions; (2) We find that most TA edits are deletions made to improve the pedagogical style of the response, rather than address concerns related to factuality or relevance; and finally (3) Our analysis indicates that EdBot improves efficiency for TAs by reducing the amount of effort required to respond to student questions in large-scale courses.