Key Takeaways
What set it off
A carefully written email to OpenAI Support reported a specific factual mistake in GPT-4o about the color of square e8 on a standard chessboard.
What the public record shows
A published write-up described stepwise prompts—reformulations, reasoning breakdowns, and a nudge toward self-correction—that eventually led the model to change its answer.
Why it matters
Tiny, verifiable facts are ideal probes for model reliability. They make errors visible, corrections traceable, and progress measurable.
Story & Details
The claim under scrutiny
The issue was narrow and testable: e8 was asserted as black. Chess rules and board orientation imply it must be white. That contrast made the case clean enough to audit.
How the test unfolded
The public post (dated 2 March 2025) recounted three moves. First, a direct ask about e8 that reproduced the error. Second, a reasoning ladder—if e2 is dark and colors alternate, what is e7; if e1 is light, what should e8 be—revealing partial consistency but preserving the wrong claim. Third, a self-check prompt that asked the model to reconcile its own answers; after several tries, it corrected to “e8 is white.”
From private note to formal report
The writer sent a courteous message to OpenAI’s public support channel, linking the article and requesting feedback. It framed the test as constructive evidence: specific, reproducible, and easy to investigate.
Independent verification
On 10 November 2025 at 14:53, a follow-up report generated in ChatGPT with GPT-5 re-derived the answer without external lookups. It started from the standard anchor “a1 is dark,” traced the alternation along rank 8—a8 light, b8 dark, c8 light, d8 dark—and concluded e8 is light (white). The reasoning is simple, explicit, and falsifiable—meaning that anyone can repeat the same logic on a real board to check whether the conclusion holds or fails. In this sense, falsifiability ensures that even an AI’s explanation can be tested by human verification rather than trust alone.
The broader backdrop
GPT-4o is a multimodal flagship model; GPT-5 emphasizes stronger reasoning and safer defaults. Even so, deterministic patterns like a chessboard remain valuable calibration points, helping pinpoint where a system clings to a first guess and how structured prompts can bring it back to ground truth.
Conclusions
A template that scales
Pick a crisp fact; design prompts that expose the logic; document outcomes; share them through an official public contact. It’s modest, repeatable, and useful.
The quiet craft of progress
Improvements rarely announce themselves. They arrive through careful notes, accessible sources, and small, verifiable wins. This case shows how a single square can move a complex system toward greater reliability.
Sources
- Original public experiment — “¿Puede GPT-4o corregir su propio error? Un experimento documentado.” WordPress blog, 2 March 2025. https://leonardocardillodiary.car.blog/2025/03/02/2025-03-02-puede-gpt-4o-corregir-su-propio-error-un-experimento-documentado/
- FIDE — Laws of Chess: board layout and orientation. https://handbook.fide.com/chapter/e012023
- OpenAI — Model overview for GPT-4o. https://openai.com/index/hello-gpt-4o/
- OpenAI — Updates on addressing sycophancy in GPT-4o. https://openai.com/index/sycophancy-in-gpt-4o/
- Reuters — Reporting on GPT-4.1 product developments (industry context). https://www.reuters.com/technology/artificial-intelligence/openai-launches-new-gpt-41-models-with-improved-coding-long-context-2025-04-14/
- OpenAI Help Center — Contacting Support. https://help.openai.com/en/articles/6614161-how-can-i-contact-support
- YouTube (Stanford Engineering) — “The future of AI and the law” (discussion includes risks from hallucinations in practical settings). https://www.youtube.com/watch?v=cMqhvJEDDZ8
Appendix
Chessboard color pattern
A standard board has 8×8 alternating light and dark squares. Correct orientation places a light square at each player’s near-right corner, fixing the colors of named squares like e1 and e8.
Reasoning breakdown
A prompt strategy that decomposes a claim into smaller checks (e.g., e2→e7, e1→e8) to expose inconsistent steps and guide a model toward a consistent rule.
Self-correction prompt
A targeted instruction that asks the model to revisit its own statements, resolve contradictions, and commit to a corrected answer when rules demand it.
Falsifiable reasoning
A principle from science meaning that an explanation can be proven wrong if tested against evidence. Here, the AI’s conclusion about e8’s color can be verified—or falsified—by observing any standard chessboard, ensuring the answer rests on testable fact.
Support channel
An official public route for reporting reproducible issues to a developer—useful for attaching clear descriptions and links that aid triage and follow-up.
Multimodal model
A system that processes text, images, audio, and sometimes video within one model, aiming for natural interaction while maintaining stable factual reasoning.
One thought on “2025.11.10 – A chessboard, an error, and a fix: documenting how a small test improved an AI answer”