2025.11.10 – A chessboard, an error, and a fix: documenting how a small test improved an AI answer – My Diary

Key Takeaways

What set it off

A carefully written email to OpenAI Support reported a specific factual mistake in GPT-4o about the color of square e8 on a standard chessboard.

What the public record shows

A published write-up described stepwise prompts—reformulations, reasoning breakdowns, and a nudge toward self-correction—that eventually led the model to change its answer.

Why it matters

Tiny, verifiable facts are ideal probes for model reliability. They make errors visible, corrections traceable, and progress measurable.

Story & Details

The claim under scrutiny

The issue was narrow and testable: e8 was asserted as black. Chess rules and board orientation imply it must be white. That contrast made the case clean enough to audit.

How the test unfolded

The public post (dated 2 March 2025) recounted three moves. First, a direct ask about e8 that reproduced the error. Second, a reasoning ladder—if e2 is dark and colors alternate, what is e7; if e1 is light, what should e8 be—revealing partial consistency but preserving the wrong claim. Third, a self-check prompt that asked the model to reconcile its own answers; after several tries, it corrected to “e8 is white.”

From private note to formal report

The writer sent a courteous message to OpenAI’s public support channel, linking the article and requesting feedback. It framed the test as constructive evidence: specific, reproducible, and easy to investigate.

Independent verification

On 10 November 2025 at 14:53, a follow-up report generated in ChatGPT with GPT-5 re-derived the answer without external lookups. It started from the standard anchor “a1 is dark,” traced the alternation along rank 8—a8 light, b8 dark, c8 light, d8 dark—and concluded e8 is light (white). The reasoning is simple, explicit, and falsifiable—meaning that anyone can repeat the same logic on a real board to check whether the conclusion holds or fails. In this sense, falsifiability ensures that even an AI’s explanation can be tested by human verification rather than trust alone.

The broader backdrop

GPT-4o is a multimodal flagship model; GPT-5 emphasizes stronger reasoning and safer defaults. Even so, deterministic patterns like a chessboard remain valuable calibration points, helping pinpoint where a system clings to a first guess and how structured prompts can bring it back to ground truth.

Conclusions

A template that scales

Pick a crisp fact; design prompts that expose the logic; document outcomes; share them through an official public contact. It’s modest, repeatable, and useful.

The quiet craft of progress

Improvements rarely announce themselves. They arrive through careful notes, accessible sources, and small, verifiable wins. This case shows how a single square can move a complex system toward greater reliability.

Sources

Original public experiment — “¿Puede GPT-4o corregir su propio error? Un experimento documentado.” WordPress blog, 2 March 2025. https://leonardocardillodiary.car.blog/2025/03/02/2025-03-02-puede-gpt-4o-corregir-su-propio-error-un-experimento-documentado/
FIDE — Laws of Chess: board layout and orientation. https://handbook.fide.com/chapter/e012023
OpenAI — Model overview for GPT-4o. https://openai.com/index/hello-gpt-4o/
OpenAI — Updates on addressing sycophancy in GPT-4o. https://openai.com/index/sycophancy-in-gpt-4o/
Reuters — Reporting on GPT-4.1 product developments (industry context). https://www.reuters.com/technology/artificial-intelligence/openai-launches-new-gpt-41-models-with-improved-coding-long-context-2025-04-14/
OpenAI Help Center — Contacting Support. https://help.openai.com/en/articles/6614161-how-can-i-contact-support
YouTube (Stanford Engineering) — “The future of AI and the law” (discussion includes risks from hallucinations in practical settings). https://www.youtube.com/watch?v=cMqhvJEDDZ8

Appendix

Chessboard color pattern

A standard board has 8×8 alternating light and dark squares. Correct orientation places a light square at each player’s near-right corner, fixing the colors of named squares like e1 and e8.

Reasoning breakdown

A prompt strategy that decomposes a claim into smaller checks (e.g., e2→e7, e1→e8) to expose inconsistent steps and guide a model toward a consistent rule.

Self-correction prompt

A targeted instruction that asks the model to revisit its own statements, resolve contradictions, and commit to a corrected answer when rules demand it.

Falsifiable reasoning

A principle from science meaning that an explanation can be proven wrong if tested against evidence. Here, the AI’s conclusion about e8’s color can be verified—or falsified—by observing any standard chessboard, ensuring the answer rests on testable fact.

Support channel

An official public route for reporting reproducible issues to a developer—useful for attaching clear descriptions and links that aid triage and follow-up.

Multimodal model

A system that processes text, images, audio, and sometimes video within one model, aiming for natural interaction while maintaining stable factual reasoning.

2025.11.10 – A chessboard, an error, and a fix: documenting how a small test improved an AI answer

Key Takeaways

What set it off

What the public record shows

Why it matters

Story & Details

The claim under scrutiny

How the test unfolded

From private note to formal report

Independent verification

The broader backdrop

Conclusions

A template that scales

The quiet craft of progress

Sources

Appendix

Chessboard color pattern

Reasoning breakdown

Self-correction prompt

Falsifiable reasoning

Support channel

Multimodal model

Published by Leonardo Tomás Cardillo

One thought on “2025.11.10 – A chessboard, an error, and a fix: documenting how a small test improved an AI answer”

Leave a comment Cancel reply

Key Takeaways

What set it off

What the public record shows

Why it matters

Story & Details

The claim under scrutiny

How the test unfolded

From private note to formal report

Independent verification

The broader backdrop

Conclusions

A template that scales

The quiet craft of progress

Sources

Appendix

Chessboard color pattern

Reasoning breakdown

Self-correction prompt

Falsifiable reasoning

Support channel

Multimodal model

Share this:

Related

Published by Leonardo Tomás Cardillo

One thought on “2025.11.10 – A chessboard, an error, and a fix: documenting how a small test improved an AI answer”

Leave a comment Cancel reply