AI Coding Tools: A Review

This is a comparison of various different coding agents. We conducted this prior to the release of Google’s Gemini 3 and hope to update it soon with Google Gemini 3 inclusion.

Metrics

In this review we have considered the following metrics:

Vendor published SWE metrics
Speed
Cost
Overall Developer Sentiment
Features (including team and enterprise features)

We wanted to be as objective as possible in our review. Given the nature of the work performed by coding agents and the wide variations caused by the work type, prompt and context quality, we acknowledge this review to be difficult and ask the readers to consider the above challenges while consider using the results of this review in their usage.

Reviewed Agents

Anthropic Claude Code (Sonnet 4.5)
OpenAI Codex (GPT 5)
Google Gemini (2.5 Pro)

Vendor Published SWE Metrics

Vendor-published SWE metrics provide an objective baseline for comparing how well each coding agent performs on real-world software engineering tasks. These benchmarks—most notably SWE-bench—measure a model’s ability to understand GitHub issues, navigate multi-file codebases, and produce correct patches that pass automated tests. While these results help establish comparative performance, they should be interpreted with caution: vendors often use idealized conditions, and real-world performance may vary depending on codebase complexity, prompt quality, and context size. Nevertheless, they offer a useful high-level signal of each model’s practical problem-solving capability.

Model	Vendor	Benchmark (Resolve Rate)
Claude 4.5 Sonnet	Anthropic	70.60%
GPT-5	OpenAI	65.00%
Gemini 2.5 Pro	Google	53.60%

Jimenez, Carlos E., et al. "SWE-bench: Can Language Models Resolve Real-world Github Issues?" arXiv preprint arXiv:2310.06770 (2024).

Speed

Speed matters because it directly affects how “flowy” a coding agent feels in day-to-day use. While raw accuracy is important, a tool that responds quickly helps maintain momentum, especially during debugging or exploratory coding. Here we look at two simple but meaningful indicators: how fast each model produces its first token, and how many tokens per second it can sustain. These numbers don’t always reflect real-world usage — network conditions, prompt size, and model load can all influence responsiveness — but they give a good sense of which agents feel snappy, and which ones can lag when handling heavier tasks.

Model	First Token (ms)	Token per Second (tps)
Claude 4.5 Sonnet	35	37.94
GPT-5	42	55.29
Gemini 2.5 Pro	25	38

References:
Data Studios
Replicate

Cost

Cost is often where the practical differences between coding agents really show up. While performance and features matter, the price you pay per million tokens can significantly impact how comfortably you can use a model day-to-day—especially for larger repos, long debugging threads, or extended pair-programming sessions. Here we look at each model’s published input and output rates to give a sense of ongoing usage costs. These numbers don’t always reflect promotional tiers, enterprise discounts, or bundled plans, but they offer a straightforward way to compare how far your budget goes with each tool.

Model	Input	Output
Claude 4.5 Sonnet	$3	$15
GPT-5	$1.25	$10
Gemini 2.5 Pro	Between $1.25 and $2.5	Between $10 and $15

Overall Developer Sentiment

Overall developer sentiment reflects how each coding agent feels to work with in practice—not just how it performs in benchmarks. These impressions come from day-to-day usage across Reddit, Hacker News, X, and developer communities, where people talk openly about what actually helps or frustrates them.

In this section, we summarize how developers perceive each tool across four areas: the general developer experience, how well the model understands real codebases, its accuracy in solving problems, and the richness of its features. Codex is consistently seen as highly reliable and technically precise, especially in reviews and refactoring. Claude Code tends to be the most pleasant and “human-like” to interact with, but can be inconsistent on large or complex repos. Gemini receives praise for its whole-repo comprehension and integrations, though users frequently report unstable execution and occasional looping.

These sentiment ratings aren’t scientific scores, but they capture the collective vibe of thousands of real-world interactions—how dependable, polished, and enjoyable each agent feels when you're actually trying to get work done.

Tool	DX (Developer Experience)	Codebase Understanding	Accuracy of Problem Solving	Feature-fullness
OpenAI Codex	3.5 / 5 – Feels mechanical but reliable; strong for structured debugging	4 / 5 – Excellent long-context reasoning	4.5 / 5 – Very high precision, especially for bug-fixing and reviews	4 / 5 – Solid API and tool integration, fewer UX niceties
Claude Code	4.5 / 5 – Very pleasant, human-like, responsive DX	3.5 / 5 – Good repo awareness, weaker on large multi-file context	3.5 / 5 – Fast but occasionally inconsistent or over-complicating	4.5 / 5 – Rich interface, good CLI ops and pair-programming features
Google Gemini (Code Assist)	3 / 5 – CLI rough, planning strong but execution buggy	5 / 5 – Outstanding whole-repo comprehension and summarization	3 / 5 – Erratic performance; sometimes loops or fails	4.5 / 5 – Huge context window, integrations, free preview, advanced models

Tool	Code reviews	Codebase semantic search	Security reviews	New code generation	Refactoring	Pricing
OpenAI Codex	OpenAI Codex – developers report that Codex’s PR-reviewer and bug‑finding capabilities are among the most accurate	Google Gemini – thanks to its 1 M‑token context window and integrated search, Gemini offers the most powerful full‑repo comprehension	Google Gemini – Gemini Code Assist (Standard/Enterprise) includes enterprise‑grade security features such as private VPC service controls and granular IAM	Claude Code – reviewers note that Claude produces the highest‑quality new code with richer features	OpenAI Codex – users favour Codex for large‑scale refactoring because it produces more consistently correct results, even if slower	Google Gemini – Gemini Code Assist offers a generous free tier and lower per‑hour pricing for individuals, whereas Codex and Claude require paid subscriptions
Claude Code	OpenAI Codex	Google Gemini	Google Gemini	Claude Code	OpenAI Codex	Google Gemini
Google Gemini (Code Assist)	OpenAI Codex	Google Gemini	Google Gemini	Claude Code	OpenAI Codex	Google Gemini

Features

While most of what distinguishes these coding agents is around their quality, there is a great deal of productivity gains that can be attributed to their features. In this sense, Claude was a pioneer in the CLI based coding agents and decoupling agents from code editors (like Cursor) and IDEs (Jet Brain Agent). Today all major coding agents work more or less the same way: a CLI based tool with support for sub-agents, shortcuts and MCP integrations. However, one area where there is still differences between them is the “enterprise” features sets, like team and access management.

In this sense, all coding agents are still lacking sophisticated access control and rules support, but Google Gemini is in a better position to leverage GSuite and Google Cloud account controls that are already in place for their coding agent. Both Claude and Codex do support company accounts, but not much more than that.

As for agent guidelines and rules, Claude is slightly ahead of others with support for project and personal rule files (CLAUDE.md).

Both Codex and Google Gemini have cloud persisted sessions which allows the user to pickup a session on any machine and in case of Codex even handover from web to CLI. Claude still stores the sessions locally so they can only be resumed on the same machine.

Summary

Across all dimensions—accuracy, speed, cost, developer experience, and feature set—each coding agent brings something meaningfully different to the table. Codex stands out for its precision, reliability, and strength in structured tasks like reviews, refactoring, and debugging. Claude Code delivers the best overall developer experience, with a natural, collaborative feel and strong pair-programming features, even if its performance can dip on very large repos. Gemini excels in whole-repo comprehension and enterprise-grade integrations, though its execution quality can be inconsistent.

While vendor benchmarks and pricing tables provide objective comparisons, real-world performance still depends heavily on project complexity, prompt quality, and personal workflow. The choice of agent ultimately comes down to what matters most for your team—raw accuracy, human-like collaboration, powerful integrations, or predictable cost.

Overall, the landscape is maturing quickly, and developers today benefit from a diverse ecosystem of highly capable coding agents. The best results often come from matching the right tool to the right task, rather than assuming one agent is universally superior.