I posted a tweet recently that seemed to motivate folks to engage. I posted:
What if students learn, but can't perform on assessments?
— Dr. Gary Ackerman (@GaryAckermanPhD) November 23, 2023
The responses to my tweet suggest there are some educators have not yet abandoned the platform, and those who remain are thoughtful about the practices of teaching and how we know if our students have actually learned.
My tweet is grounded in two assumptions:
- As a result of school, students’ brains should be changed so they have greater capacity to participate in society; in brief, they should learn because of the lessons in which they participate.
- One responsibility of schools is to measure the degree to which students have learned.
I also implicitly posit that the two are not necessarily connected. It is possible (actually it is very likely) students are learning in all of their courses and classes, but they may not be learning the intended curriculum. It is also possible (again, it is very likely) that students can perform well on measures of intended learning without actually knowing the material.
When we design instruments to collect data (regardless of the field of study), we are hoping to be able to draw accurate and true conclusions based on the data. In the context of my tweet, we want to collect data that measures how students’ brains have changed.
If we have high-quality data collection tools, then we are more likely to be able to accurate and true conclusions. The characteristics of high quality data collection tools include:
Objectivity refers to the degree to which the results are unaffected by the biases of the collector. All data collection methods are biased if not by the way the data are gathered and interpreted, then by the design of the instrument.
Reliability refers to degree to which the instrument gives the similar results under similar conditions. Ideally, the same test will always return the same results, but variability in humans beings and measurement error are realities of measuring in educational settings.
Validity is the degree to which the collected data are what we claim they are. We are generally concerned with specific types of validity.
External validity is the degree to which one’s tools can be applied to other settings.
Internal validity is the degree to which your instrument measures what it is supposed to; all measurements contain error, but high internal validity minimizes those errors. If a test is supposed to measure “knowledge of photosynthesis,” but the questions are written so they confuse the students, then the test has low internal validity.
Construct validity is the degree to which we are measuring something that is real. In many fields, especially in education, we seek to measure phenomena that can only be indirectly measured. When I was a botany student, I was sometimes studied things that could be directly measured (for example the number of nodules that formed on the roots of plants). At other times, I was interested in how conditions affected plant health; my lab partners and I defined several proxies (such as leaf color, turgidity, height) for plant health. If we choose our proxies well, then our we could claim construct validity.
Related to external validity and construct validity is predictive validity which is the degree to which our measurements are correlated with what we claim. If my construct is valid and my tools have external validity, then it is likely there will be predictive validity. (The plants that I claim are more healthy are more productive in the garden, for example, or the students who I claim know more physics engineer functioning electrical systems.)
Neither my tweet not this blog post makes the claim that any existing practices are not of high-quality. I have been involved with educators long enough to make the claim that many tools we use to collect data (thus draw conclusions about learning) are of dubious quality.
I have written plenty of test questions (and I continue to use tests in my teaching). I know I have not taken the steps necessary to ensure any quality. Over time, as I use tests and questions in courses, I do believe I improve them, but I don’t take steps to verify it. One take-away from my tweet is that we should stop deluding ourselves that the tests we write are of high-quality unless we measure it. This doesn’t mean any educator should stop, it just means we should be honest with ourselves and our students.
In schools today, we also use many data collection tools that come from other sources; vendors and publishers sell tests to measure just about anything one cares to measure. We are told they are valid and reliable, but we are not allowed to review the methods by which those were established. Again, I do not claim we should stop using such tools, but I do claim we should question their quality and make accurate assessments of it.
One strategy researchers use to ensure their findings are accurate and true is to confirm observations with other evidence. If educators designed assessments schemes that begin with the answers to the prompt “If our students have in fact learned what we expect them to, then they should be able to score well on assessments and also do x” then they defined multiple x and methods for observing them, then our conclusions would be of much higher quality. Notice as well that x is plural in the preceding paragraph. If you look to only one way to measure students, then you can be sure the measurements are of low quality.