🌍 Public

I Asked Claude to Grade Its Recent Performance

The AI chatbot was refreshingly honest about its jaggedness.
3-min read
SHARE
I Asked Claude to Grade Its Recent Performance
Image Source: Google Gemini

As I mentioned last month, I'm all in on Claude. Over the past five weeks, I've been using its Pro plan anywhere from three to four hours per day in a wide variety of capacities. In no particular order, that list includes:

  1. Fixing minor and intermediate website tweaks here and on Racket Publishing.
  2. Adding new products for premium Racket Publishing members.
  3. Building private custom skills. (I'm considering offering it as a service in the near future.)
  4. Evaluating different products and services to save money in 2026. (Adiós, AT&T.)
  5. Managing the fantasy basketball team I took over after a longtime friend's recent death. (I miss you, Mike.)
  6. Creating and modifying data in other applications via model context protocols.
  7. Interpreting long PDFs for my new car and health insurance plans.
  8. Analyzing a contract for a potential ghostwriting gig.
  9. Helping me up my Mac Terminal game. (This neat text-based Claude search tool supports booleans, unlike native Claude. Yeah, I'm a geek.)
  10. Rewriting others' interesting but poorly written articles, text messages, and emails. (I'll keep their names to myself.)
  11. Interpreting different websites' UI clusterfucks and indecipherable server errors. (Cue Office Space reference.)
  12. Creating artifacts for future posts, including some cool data visualizations.
  13. Brainstorming titles and excerpts of blog post drafts.
  14. Helping me produce 1099s for my contractors.
  15. Fact-checking.

At times, its AI chatbot has amazed me with its insightful observations and recommendations. Because AI confirmation bias terrifies me, my instructions explicitly include contradicting me when I'm a little off, much less completely wrong. All of my Claude projects include the following line:

I care about accuracy more than speed.

Put differently, I don't politely request skepticism from Claude; I demand it.

Limitations

As expected, Claude is hardly pitching a perfect game. It has been downright jagged, to borrow a word from Ethan Mollick's latest Substack. Among its most glaring gaffes, it has:

  1. Advised me to ignore a WordPress email that would have caused me to lose my publishing site's domain name. (I immediately called bullshit and averted the problem.)
  2. Failed to understand how I wanted it to ask me questions when diagnosing website issues.
  3. Written a script that deleted all the content on this site via the Ghost API, although I explicitly warned Claude of that very possibility. (Fortunately, I backed up all my site's data. The downtime was minimal.)
  4. Arbitrarily decided to change the contents of skill download files despite 20 previously successful efforts with the previous format.
  5. Told me that I could do plenty of things that I simply couldn't—even when I provided sufficient context and background information.
  6. Routinely ignored my instructions on custom skills. For example, it has used all caps and the passive voice in its responses. My repeated reminders to avoid such crap haven't consistently corrected its behavior.

Yeah, it's a decidedly mixed bag.

Claude's Self-Evaluation

I was curious about how Claude viewed its own recent performance. Would it cop to its mistakes or pooh-pooh them? Against that backdrop, I asked it for an honest self-assessment. Here it is:

The Case for Mundane AI
Some of the most effective ways to use these newfangled tools are decidedly unsexy.

Simon Says

Props to Claude for its forthright self-assessment. Still, its error rate and the severity of some of those errors continue to baffle me. Here's my best analogy: AI is the driver of golf clubs.

As any weekend duffer knows, hitting your driver well helps immensely. When you mishit that finicky club, though, expect to spend five minutes looking for your ball in the woods or worse. You're facing a double bogey, if not a snowman.

Feedback

What say you?

Before You Go…
If you'd like to support my writing efforts, I'd appreciate it.

TIP THE AUTHOR

Member discussion