The Quiz and AI: Reconsidering what outcomes really matter

The emergence of generative AI has sparked ‘wicked problem’ questions among educators: What is a quiz for? Is it still useful?  Measuring what students know through quizzing them suddenly feels inadequate when an AI system can answer factual (and deliberative) questions faster and more accurately than a human. However, beneath this dilemma sits something more interesting: an opportunity to rethink the pedagogical foundation upon which quiz-based assessment rests. To understand this properly, we need to start not with AI, but with learning objectives themselves.

The testing effect

In its simplest form, students who engage in retrieval, through methods such as low-stakes quizzes, learn more and retain information longer than those who reread material. This is called the testing effect and occurs because attempting to recall information from memory prompts effortful reconstruction, forging durable neural pathways (Brown et al., 2014). The ability to recall, construct, and apply knowledge quickly across new contexts is also key to educational transfer.

Thus, if well designed, quizzes do not just measure learning; they produce learning. The cognitive science here warrants attention. When you attempt to retrieve information from memory to answer a quiz question, you are doing something qualitatively different from passively absorbing information. Your brain must search, reconstruct, and verify. This struggle is the engine of lasting memory formation. Feedback matters, certainly, but the learning benefit comes primarily from the retrieval attempt itself, not just from the feedback that follows (Brown et al., 2014)

This finding has implications for quiz design. It suggests that the educational value of a quiz lies not in its ability to measure what students know, but in its power to elicit effortful cognitive engagement with material such as weekly readings or lecture notes. A quiz should be designed as a learning tool, not merely a measurement instrument.

When framed this way, Artificial Intelligence, in the form of Large Language Models, becomes a threat precisely because it eliminates the cognitive struggle. If a student uses AI to answer a quiz, they bypass the retrieval process entirely. They may complete the assessment, but they have not engaged the cognitive mechanism that drives learning. The quiz becomes a hollow ritual, completed, perhaps with excellent grades, but educationally barren.

Validity and AI

This raises an important question: What does a quiz score truly indicate when AI assistance is available? To assess its validity, we need to consider whether the quiz measures what we intend it to measure. If our goal is to determine whether a student can accurately answer factual questions, the answer is clearly no. In that case, we are measuring their ability to use an AI tool rather than their factual knowledge. If we are using the AI Assessment Scale, now common in many universities, we might need to set it to ‘Full AI’ to reflect this.

Imagine if we could find a way to prevent the use of AI during quizzes (i.e., making them supervised). We could argue that the quiz is intended to measure or promote ‘cognitive struggle’ (and catalyse the testing effect). However, in a world where AI is ubiquitous, and professionals, myself included, often use it as a fact-checking tool, evaluating student knowledge through quizzes without considering real-world applications may be misguided. This method assesses skills that do not accurately reflect how students will function in their personal and professional lives.

This shift in thinking is worth considering. The validity crisis created by AI is not just about cheating; it is a deeper conceptual issue (Dawson et al., 2024). AI pushes us to clarify what we value in learning and to design assessments that accurately measure those outcomes.

If we prioritise the retrieval of factual information, AI has rendered that learning objective largely obsolete, not due to cheating, but because this skill no longer aligns with professional practice (look it up). However, if we value skills such as analytical judgment, contextual reasoning, and the ability to evaluate and use AI-generated content critically, then our assessment design must evolve accordingly, including the use of quizzes.

This is not to say that quizzes are ineffective; they need to be better designed with their intended objectives in mind. Critics of quizzes raise valid points that should be acknowledged, but there is no easy fix: simply replacing them with another assessment type will not resolve the validity issue.

Formative versus summative: a critical distinction

Here is where a crucial distinction becomes essential: the difference between formative and summative assessment.

  • Formative assessment is low-stakes. It is designed to provide feedback, identify learning gaps, and drive improvement. A formative quiz does not contribute significantly to a final grade; it exists to identify what students do not yet understand. The diagnostic function remains powerful and intact as an early assessment or other tool, even with AI in the picture.
  • Summative assessment is high stakes. It is used to certify competency, assign sizable grades, and determine achievement of learning outcomes. When quiz scores contribute substantially to a final grade, and especially when they are used to make decisions about students, the validity and integrity of the measurement are paramount.

Summative online quizzes that lack supervision are essentially invalid as measures of student knowledge, especially given the prevalence of AI tools. This is because unsupervised contexts fundamentally break the assessment’s validity contract: quiz scores no longer reflect what students know but rather their ability to use AI systems to retrieve answers. Without the cognitive retrieval effort that underpins the testing effect, the mechanism by which quizzes produce durable learning, high-stakes scores become meaningless certifications.

 This issue is particularly relevant in fully online, asynchronous courses that lack human oversight, such as the common workplace compliance tests. Instead of measuring what students truly know, these assessments reflect what they can accomplish with AI.

The practical implication is that quizzes should always be formative, low-stakes learning tools, carefully designed to prompt retrieval (the testing effect) and provide feedback, rather than high-stakes summative instruments. This shift aligns with both research on the testing effect and the practical realities of AI-saturated assessment environments (Brown et al, 2014; Corbin et al., 2025; Dawson et al., 2024).

Deep processing

Another important principle from the learning sciences is that deeper processing leads to more durable learning. This principle, known as transfer-appropriate processing, indicates that learning is most effective when the cognitive effort required to answer a quiz or other question aligns with the cognitive demands of the skill we are trying to develop (Adescope et al., 2017).

For instance, a multiple-choice question such as “What is the capital of Latvia?” requires minimal cognitive effort. It is vulnerable to AI and educationally thin. A question that asks “Analyse the historical approaches to nation-building by three European countries in the 19th century and assess which strategy was most effective in shaping their modern identities and aligning with the prevailing political ideologies of the time” requires analysis, synthesis, and judgment. It is far more resistant to AI completion and supports deeper learning (when converted into a quiz question).

This is not new advice. Bloom’s taxonomy has been around for years; higher-order thinking questions are well-established best practice. However, AI makes the stakes clearer. Low-order questions, pure recall, and simple comprehension are now pedagogically vulnerable. They neither produce strong learning outcomes nor benefit from AI assistance. The implication is that quiz design should deliberately target higher-order cognitive processes.

Educational psychologists have likewise long known a counterintuitive fact: conditions that slow learning in the short term often strengthen it in the long term. Spacing quizzes over time, varying the contexts in which students encounter material, and mixing problem types all create friction during learning. They make tasks more difficult (Brown et al., 2014). This design strategy serves dual purposes: it strengthens the testing effect while simultaneously making the assessment more resistant to AI shortcuts, though making them highly context-specific to the material at hand.

Three principles for quiz redesign

From all of this, three principles emerge for thinking about quizzes in an AI-saturated educational environment.

  1. First: Prioritise formative use. Quizzes should serve as learning tools, frequent, low-stakes, designed to prompt difficult retrieval practice and provide corrective feedback. This use remains educationally valuable. The testing effect concerns the retrieval effort, and a student who completes a low-stakes formative quiz without learning anything has primarily harmed themselves.
  2. Second: Shift toward higher-order cognition. Quiz questions should target analytical, evaluative, and synthetic thinking. Pure recall questions are pedagogically weak and AI-vulnerable. Higher-order questions are both educationally stronger and more resistant to simple AI completion.
  3. Third: Design for cognitive struggle. Deliberately structure quizzes to require effort: multi-part reasoning, integration across contexts, and the requirement that students show intermediate thinking before seeing answers. These design features strengthen learning while reducing the utility of quick AI shortcuts.

What about the facts?

There is a legitimate tension here that deserves acknowledgement. Not all learning objectives involve higher-order thinking. Sometimes students genuinely need to know facts. Medical students need to know anatomy. History students need to know dates and events. Chemistry students need to memorise the periodic table. The AI critique does not eliminate the need for foundational knowledge. It challenges whether quizzes are the right tool for assessing and supporting that knowledge in environments where instant information retrieval is ubiquitous.

For educators working within existing institutional constraints, the practical path may be clearer than the theoretical resolution of one of the many ‘wicked problems’ posed by AI in assessment (Corbin et al., 2025).

Conduct an audit of your current quizzes. Ask: What are these quizzes measuring? What cognitive processes do they require? Are they formative or summative? If they are summative and conducted online without supervision, can they still be considered valid measures of student knowledge in the presence of AI?

For formative quizzes, maintain frequency and low-stakes conditions. Ensure they prompt effortful retrieval and provide meaningful feedback. In this context, students who primarily use AI to complete assignments harm their own learning; this is a problem they will face when they encounter higher-order assessments later in the course.

For summative quizzes, consider two options: (a) move them to supervised, in-person environments where you can regulate access to tools, or (b) redesign them to focus on higher-order cognitive demands that make it unlikely for AI to complete them without a genuine understanding of the material. Additionally, think about reformulating some summative assessments into authentic tasks. These assignments require students to apply their knowledge and skills in contexts that closely resemble real-world professional practice, incorporating appropriate use of tools.

Conclusion

In many ways, the AI disruption of quiz-based assessment is pulling us back toward educational first principles rather than forcing us into uncharted territory. We have long known that retrieval practice supports learning. We have known for decades that higher-order thinking matters more than recall. We have known that assessment validity requires alignment between what we claim to measure and what students need to know and do.

AI did not invent these truths; it simply made their violation harder to ignore. A quiz that relies on low-order factual retrieval was never particularly educationally valid, even before AI existed. The presence of AI makes that weakness impossible to overlook. The reassuring implication is this: educators who redesign quizzes around principles supported by learning research—frequent, formative use; higher-order cognitive demands; integration of knowledge into authentic contexts- will find that their assessments are simultaneously more educationally effective and more resistant to AI shortcuts.

References

Adesope, O. O., Trevisan, D. A., & Sundararajan, N. (2017). Rethinking the use of tests: A meta-analysis of practice testing. Review of Educational Research, 87(3), 659–701. https://doi.org/10.3102/0034654316689306

Brown, P. C., Roediger, H. L., III, & McDaniel, M. A. (2014). Make it stick: The science of successful learning. Belknap Press of Harvard University Press

Corbin, T., Bearman, M., Boud, D., & Dawson, P. (2025). The wicked problem of AI and assessment. Assessment & Evaluation in Higher Education. Advance online publication. https://doi.org/10.1080/02602938.2025.2553340

Dawson, P., Bearman, M., Dollinger, M., & Boud, D. (2024). Validity matters more than cheating. Assessment & Evaluation in Higher Education49(7), 1005–1016. https://doi.org/10.1080/02602938.2024.2386662

Gonsalves, C. (2023). On ChatGPT: What promise remains for multiple-choice assessment? Journal of Learning Development in Higher Education, 27. Retrieved from https://kclpure.kcl.ac.uk/portal/files/202623537/1009_On_ChatGPT_what_promise_remains_for_multiple_choice_assessment.pdf

Newton, P. M., & Xiromeriti, M. (2024). ChatGPT performance on multiple-choice question examinations in higher education: A pragmatic scoping review. Assessment & Evaluation in Higher Education, 49(7), 1070–1086. https://doi.org/10.1080/02602938.2023.2299059

Declaration of Generative AI Use in “The Quiz and AI: Reconsidering what outcomes really matter” using the AIAS Framework

This work was produced using AIAS Level “AI Collaboration”. Per the AI Assessment Scale (AIAS) framework for ethical academic writing, generative AI tools were employed to assist with drafting, refinement, critique, and clarity improvement across all sections (~1400 words total), incorporating learning science principles (testing effect, transfer-appropriate processing, deep processing), quiz redesign principles, and citations.

Specific AI Usage:
Tool: Perplexity AI (research version circa January 2026).
Purpose: (1) Refined argumentation on testing effect validity in AI-saturated quiz environments; (2) Improved clarity/structure of deep processing and formative/summative distinctions; (3) Suggested alternative phrasings for pedagogical concepts; (4) Critiqued flow of three quiz redesign principles.

Process: AI generated ~30% initial phrasing/suggestions from iterative prompts; human author provided core theoretical framework, verified all citations/learning science claims, critically evaluated outputs (~70% substantive revisions), rejected misaligned suggestions, and ensured pedagogical accuracy.

Human Accountability: The author takes full responsibility for the integrity of the content, its originality, and all claims. All facts/theories were cross-checked against academic knowledge and cited sources. AI served only as a collaborative assistant; the final text reflects human judgment, critical synthesis, and excludes unverified outputs.

Posted

Comments

Leave a Reply