Blogs (1) >>

Data scientists often need to read and understand messy and undocumented code that relies on large software libraries. What makes data science experts more effective than novices at this task? To understand expert practices, we conducted a think-aloud study where 4 novice and 5 expert data scientists reasoned about an unfamiliar data analysis script with realistic complexity that used the Python pandas library. Surprisingly, familiarity of the pandas package had relatively minor importance for experts. Instead, experts consistently performed three practices that novices did not: experts examined the data in detail rather than fixating on surface-level code features; experts consistently verified their assumptions about how the data was transformed; and experts navigated lengthy program outputs in a goal-directed way. Using these findings, we provide a practical set of guidelines for data science pedagogy and for future tools to support data science learners.