I compared GPT-4 and Claude for writing code and the gap was bigger than I thought

I spent last weekend doing a little test. I gave both models the same 5 coding tasks from my work projects. Things like refactoring a messy function and writing a SQL query from scratch. GPT-4 got 4 out of 5 right on the first try. Claude needed 2 or 3 prompts for each to get working code that didnt have bugs. The biggest difference was in memory handling. GPT caught edge cases I didnt even mention in my prompt. Has anyone else run similar tests or am I just bad at prompting Claude?

2 comments

2 Comments

sagea8828d ago

I saw a similar breakdown on a coding subreddit where someone tested them on debugging legacy JavaScript... GPT caught a subtle closure issue that Claude completely missed for three tries. Memory handling seems to be where the gap really shows, especially with complex state management stuff.

reese_bell28d ago

That "closure issue" thing actually cuts the other way if you look at the whole thread. Claude missed it three times but once it got it, it explained the fix way better and caught two other related bugs in the same function. GPT just fixed the one thing and moved on. I think you're cherry picking one example when Claude has way better consistency on large codebases with tangled dependencies. I've seen Claude nail a React state management refactor that had GPT scratching its head for five minutes straight. Both models have blind spots, but Claude's are usually less common in real world code.