Cut AI deck-review noise 72% in a five-day sprint without hiding real findings
In a five-day sprint, I built an internal PowerPoint QA app that combined deterministic brand-rule checks with gated AI judgment, cutting a noisy first pass from 110 findings to 31 while preserving recall on the annotated deck.
Client: Technical evaluation brief for an internal brand-QA workflow
Timeline: Five-day solo sprint, May 14-18, 2026, with a hard Monday 5pm EST handoff
Stack: FastAPI, React, SQLite, Python PowerPoint parsing, LibreOffice rendering, Claude Sonnet 4.6
Challenge
The hard part was not finding more brand violations. It was making a reviewer trust the ones that survived. Consulting and corporate decks carry strict rules for fonts, colors, table treatment, confidentiality footers, title conventions, terminology, and softer judgment calls like whether a headline says anything useful. Manual review misses things across a 24- to 50-slide deck. Naive AI review has the opposite problem: it flags inherited theme styling, invents violations, and leaves the reviewer sorting guesses instead of checking evidence.
The brief had a tight evaluation constraint: ship something a technical reviewer could open, test, and inspect by Monday at 5pm EST. This could not be a demo tuned to one file. It had to show where code should be trusted, where AI should be allowed in, and how every finding could be verified against the slide itself.
Approach
I treated AI as the judgment layer, not the inspection engine. Anything the brand book could express as a rule became deterministic: typography overrides, title length and punctuation, table styling, banned terms, draft markers, confidentiality footers, bullet structure, and legend consistency. AI was reserved for three checks where code would be brittle: headline quality, parallel structure, and cross-slide naming consistency.
The key refusal was letting model confidence stand in for proof. Every AI finding had to pass gates before it reached the reviewer: the quoted text had to exist verbatim in parsed slide data, and the structural claim had to match the extracted structure. I also rejected a “drop mode” review pass after experiments showed it hid real issues. Detection stayed additive and auditable. A later review call could summarize, group themes, and assign P0-P3 priority, but it could not delete a finding.
The largest trust fix was treating inherited PowerPoint theme styling as correct. If a font, color, or size was None, that usually meant the slide was following the master theme, not breaking the brand. Flagging only explicit overrides removed a class of false positives that would have made the reviewer stop believing the tool.
Outcome
By the handoff deadline, the system had cut the noisy first pass from 110 findings to 31 while still catching 20 of 22 detectable reviewer-noted issues on the annotated sample deck. That is 91% recall against reviewer speaker notes and 67% precision against those notes after the trust pass. The remaining flagged items were largely aligned with written brand rules the reviewer had skipped, so the gap became a useful product signal: the tool was optimized for defensible guideline enforcement, not overfitting one person’s annotations.
Two decks, 24 and 33 slides, ran end to end in production on May 18. The reviewer could log in, upload a .pptx, wait for the background pipeline, start with an executive overview, then inspect numbered pins on real slide renders. Each finding carried evidence, expected behavior, slide location, element reference, and guideline reference. If the AI key failed, deterministic checks still returned. If the review layer failed, the finding list still worked.
Quote
“The visual review was exactly the right direction. The Grammarly-style dots made the findings intuitive, and the app stayed clean, simple, and straightforward instead of trying to look fancy.”
— Technical evaluator, drafted for approval
Lesson
Machine-correct is not the same as reviewer-useful. One guideline color was off by only two hex values on a slide. Mathematically, that is a violation. To a human eye, it is noise. If a checker flags that kind of microscopic mismatch across a deck, it creates dozens of “correct” findings that teach the reviewer to ignore the system. The lesson I carried forward is that brand QA has to encode human tolerance, not just exact rules. Precision is not only about whether a finding is technically true. It is about whether the finding helps a human make a better decision.