Banal Agent Comparison Post Number Infinity
Each year I try to make sure I catch the Conversations with Tyler retrospective, in which Tyler reflects on the year’s episodes and related content. I like hearing Tyler’s own self assessment, what stood out to him, and what he enjoyed or did not enjoy. This year also included a review of top picks made in 2015.
Typically I listen to the episode and then note items that stand out and put those on a list to work my way through in a generally casual manner.
While I was about to do that this year round, I figured I’d ask Claude Code to make duck soup of the task as Tyler provides annotated episode transcripts. Then, I figured, why not ask a suite of agents to give it their best one-shot and see where we end up.
This is up on GitHub, four AI coding agents were given the prompt below:
I’d like to make a static web page I can host on github that focuses on Tyler Cowen’s 2025 year in review podcast. The full transcript is here: https://conversationswithtyler.com/episodes/conversations-with-tyler-2025-retrospective/
I’d like to focus on his episodes that he praised or endorsed the most, with his comments or reason why.
I’d also like to highlight references to books or articles of note - again, only items that get some meaningful endorsement and that endorsement “why” should be clear.
I’d like this to be visually scannable as a page, with clear organization of topics (podcasts vs books vs other endorsements) and visually rich but sleek in overall design.
The Results
Here are their one-shot solutions (one page per agent):
Claude Code (Opus 4.5) · GPT Codex (5.2 High) · Gemini 3 Auto · Grok 4.1 Thinking
Why compare?
I was curious to see what out-of-the-box aesthetics would come from the prompt above, but also this “basic summarization” task is somewhat nuanced. There’s a fair bit of fast-moving opinion, differentiating statistical popularity (listen count) and personal favorites, comparisons from years past, not to mention the different media types discussed (interviews, books, music, film).
I think a human tasked with the above would infer all of that and do a good job of the summary.
So I’d call this an easy task that’s easy to do poorly.
Of course, at this point in the AI story it’d be pretty shocking to find a truly bad job done with a task like this. The agents all did ok in my opinion, but Claude Code is the only agent that really broke the content out into specific media categories, and at that split books into fiction and non-fiction commentary. But Claude’s page was too dark and had poor contrast, while Codex (OpenAI) had the most “original” style, which is not to say the style I liked the most. Gemini’s summary had what I’d call the most functionally concise clarity to its episode cards (guest, topic, and quote were very easy to skim over).
One common flop is that none of the models bothered to hyperlink their specific highlights out to the destination episodes, despite the original transcript they were given having explicit episode links for all content referenced. This is an example of something I’d say a “good human” would definitely do for their own summary, but so it goes.
In any case, this was a fun point in time, albeit a somewhat banal comparison.