← Back to blog
Admin 8 min read

Which AI Model Writes the Best SQL for D1?

MyD1 ships with an AI SQL assistant that works with any OpenAI-compatible API. That means you can plug in Claude, GPT-4, a local Ollama model, or anything in between. But which model actually writes the best SQL?

We ran 8 popular models through 4 rounds of increasingly difficult SQL tasks — all targeting Cloudflare D1 (SQLite). Each round was scored out of 10 based on correctness, efficiency, and whether the query actually runs.

8 Models Tested From free to premium
4 Rounds Easy → Expert
32 SQL Tasks Real-world scenarios
10 Max Score Per round, per model

The Models

Premium Tier

Claude 3.5 Sonnet — Anthropic's go-to coding model.
GPT-4o — OpenAI's flagship multimodal.
Gemini 1.5 Pro — Google's long-context model.

Mid Tier

GPT-4o-mini — OpenAI's affordable option.
DeepSeek V3 — Open-source, competitive pricing.
Mistral Large — European-made, strong reasoning.

Free / Local

Llama 3.1 70B — Meta's large open model, via Ollama.
Llama 3.1 8B — Same family, runs on any Mac, via Ollama.

R1
Simple Queries
SELECT, WHERE, ORDER BY, LIMIT, COUNT — the basics.

We asked each model to write 8 simple queries against a standard e-commerce schema: users, orders, products. Things like "find all orders placed this week" or "count users by country, sorted descending."

Prompt: "Write a SQL query to find the top 5 customers by total spend. Tables: users(id, name, email, country), orders(id, user_id, total, created_at)"

Every model nailed this. The differences were minor — mostly about aliasing style, whether they used LIMIT vs FETCH FIRST, and how clean the formatting was.

Claude 3.5 Sonnet
9.5
/ 10
GPT-4o
9.5
/ 10
Gemini 1.5 Pro
9.0
/ 10
DeepSeek V3
9.0
/ 10
Mistral Large
9.0
/ 10
GPT-4o-mini
8.5
/ 10
Llama 3.1 70B
8.5
/ 10
Llama 3.1 8B
7.5
/ 10
🎯
Takeaway: For simple queries, every model works. Even Llama 3.1 8B — which is free and runs locally via Ollama — scored 7.5. If this is most of what you do, save your money and use Ollama.
Difficulty increases
R2
Time Series Queries
GROUP BY date, date functions, running totals, gaps & islands.

Time series is where D1's SQLite dialect starts to matter. There's no DATE_TRUNC, no GENERATE_SERIES, no INTERVAL. Models need to know that D1 uses strftime() and date() instead.

Prompt: "Show daily revenue for the last 30 days, including days with zero orders. Tables: orders(id, total, created_at)"

This is where the gap starts opening. Premium models understood the SQLite constraint and used WITH RECURSIVE to generate a date range. Smaller models either used PostgreSQL syntax (which doesn't run on D1) or just skipped the "zero days" requirement entirely.

Claude 3.5 Sonnet
8.5
/ 10
GPT-4o
8.0
/ 10
DeepSeek V3
7.5
/ 10
Gemini 1.5 Pro
7.0
/ 10
Mistral Large
7.0
/ 10
GPT-4o-mini
6.0
/ 10
Llama 3.1 70B
5.5
/ 10
Llama 3.1 8B
3.5
/ 10
⚠️
Common failure: Using DATE_TRUNC('day', created_at) — valid PostgreSQL, but D1 rejects it. The correct D1 syntax is strftime('%Y-%m-%d', created_at). Claude and GPT-4o got this right consistently.
Difficulty increases
R3
Heavy JOINs
Multi-table JOINs, subqueries, CTEs, self-joins, conditional aggregation.

We gave models a 6-table schema (users, orders, order_items, products, categories, reviews) and asked for queries like "find products with above-average reviews that haven't been ordered in 90 days, grouped by category."

Prompt: "For each category, show: total revenue, number of unique buyers, average order value, and the best-selling product name. Only include categories with more than $1,000 in revenue."

This round separates models that understand SQL from models that pattern-match SQL. Smaller models started producing queries with incorrect JOIN orders, missing GROUP BY columns, or subqueries that reference the wrong alias.

Claude 3.5 Sonnet
7.5
/ 10
GPT-4o
7.0
/ 10
DeepSeek V3
6.5
/ 10
Gemini 1.5 Pro
6.0
/ 10
Mistral Large
5.5
/ 10
GPT-4o-mini
5.0
/ 10
Llama 3.1 70B
4.0
/ 10
Llama 3.1 8B
2.0
/ 10
💡
Insight: Claude was the only model that consistently used CTEs to break complex queries into readable steps. GPT-4o preferred nested subqueries — correct but harder to debug. Smaller models often produced queries that looked right but returned wrong results due to incorrect join conditions.
Difficulty increases
R4
Postgres → D1 Migration
Schema conversion, data type mapping, constraint rewriting, gotchas.

The hardest round. We gave each model a real PostgreSQL schema with SERIAL, UUID, JSONB, ENUM, TIMESTAMP WITH TIME ZONE, array columns, and partial indexes — then asked them to convert it to D1-compatible SQLite.

Prompt: "Convert this PostgreSQL schema to Cloudflare D1. Handle all type differences, preserve constraints where possible, and flag anything that can't be directly translated."

This round is brutal because it requires deep knowledge of both PostgreSQL and SQLite/D1 dialects. Models need to know that D1 doesn't support ENUM (use CHECK constraints), that JSONB becomes TEXT with json_extract(), that SERIAL becomes INTEGER PRIMARY KEY (autoincrement), and that D1 has no UUID type.

Claude 3.5 Sonnet
6.0
/ 10
GPT-4o
5.5
/ 10
DeepSeek V3
5.0
/ 10
Gemini 1.5 Pro
4.5
/ 10
Mistral Large
4.0
/ 10
GPT-4o-mini
3.5
/ 10
Llama 3.1 70B
2.5
/ 10
Llama 3.1 8B
1.0
/ 10
🚨
Nobody aced this. Even Claude scored 6/10. Every model failed on at least one of: partial index syntax, JSONB migration strategy, or D1-specific constraint limitations. This is the kind of task where you'll need to review and fix the output regardless of which model you use.

Final Leaderboard

Model R1
Simple
R2
Time Series
R3
JOINs
R4
Migration
Total
Claude 3.5 Sonnet 9.5 8.5 7.5 6.0 31.5
GPT-4o 9.5 8.0 7.0 5.5 30.0
DeepSeek V3 9.0 7.5 6.5 5.0 28.0
Gemini 1.5 Pro 9.0 7.0 6.0 4.5 26.5
Mistral Large 9.0 7.0 5.5 4.0 25.5
GPT-4o-mini 8.5 6.0 5.0 3.5 23.0
Llama 3.1 70B 8.5 5.5 4.0 2.5 20.5
Llama 3.1 8B 7.5 3.5 2.0 1.0 14.0

Score Drop by Round

The harder the task, the more the scores diverge. Here's the average score across all models per round:

R1 — Simple
8.8 avg
R2 — Time Series
6.6 avg
R3 — JOINs
5.4 avg
R4 — Migration
4.0 avg

Tier List

S Claude 3.5 Sonnet (31.5) — Best overall. Understands D1/SQLite dialect, uses CTEs, flags edge cases.
A GPT-4o (30.0) · DeepSeek V3 (28.0) — Excellent. Occasional PostgreSQL habits, but mostly correct.
B Gemini 1.5 Pro (26.5) · Mistral Large (25.5) · GPT-4o-mini (23.0) — Good for simple-to-medium tasks. Struggles with D1-specific syntax.
C Llama 3.1 70B (20.5) · Llama 3.1 8B (14.0) — Fine for basic queries. Not recommended for complex SQL or migrations.

Our Recommendation

For everyday use (free)

  • Llama 3.1 70B via Ollama — handles 80% of real-world queries
  • Simple SELECTs, basic JOINs, WHERE clauses
  • Runs locally, no API costs, no data leaves your Mac
  • Download free at ollama.com

For complex work (paid)

  • Claude 3.5 Sonnet — best D1 dialect awareness
  • Multi-table JOINs, CTEs, time series, migrations
  • Use via API key or through OpenRouter
  • Worth the cost when accuracy matters

Our suggested setup

Ollama + Claude

Use Ollama daily for free. Switch to Claude for the hard stuff.

Bottom line: Don't overpay for simple queries. A free local model handles the majority of everyday SQL work. But when you're dealing with complex JOINs, D1-specific time series, or Postgres-to-D1 migrations — a premium model pays for itself in time saved debugging.

MyD1's AI assistant lets you switch between providers in seconds. Start with Ollama and upgrade when you need to. Browse the free models library to get started.

Related: Why We Built MyD1 · D1 vs MySQL vs PostgreSQL