Published January 23, 2023.

Update January 26, 2023: Someone beat me to it and wrote a slightly more advanced version of this: https://www.patterns.app/blog/2023/01/18/crunchbot-sql-analyst-gpt

The latest versions of GPT (ChatGPT, GPT-3.5) are very impressive at understanding and generating natural language. They are also good at coding and translating natural language to code and vice versa. We can use this as a convenient tool to generate code snippets based on brief descriptions, or to add type annotations to existing code, and much more.

In this article, I want to put these skills to use by “giving database access” to GPT and seeing which questions it is able to answer about the data. I do this by having GPT generate an SQL query based on the input question, executing said query, and then have GPT interpret the results to formulate an answer. As context, I provide the database schema and descriptions in the first prompt.

The results are, in some ways, impressive, in others they show the limitations of this approach. I can imagine improving the usefulness by optimizing the prompts and adding intermediary steps, but one limiting factor is the structure and quality of the data (as it so often turns out to be).

Here’s a peek at the end result:

I ask a question in natural language that can be answered by querying the dataset. GPT-3 generates the corresponding SQL correctly and parses the response from SQLite. Note that it realizes that not all locations are countries (e.g. “World”, “High income”) and picks the actual country (”United States”) correctly.

The Dataset

For the experiment, I chose a Covid-19 dataset published by Our World in Data. It contains information about cases and deaths, hospitalizations, vaccines and testing, among other data. I downloaded the CSV file (owid-covid-data.csv), imported it into a SQLite database and ran queries against it.

Some columns are Covid-19 specific:

Column	Description
`total_cases`	Total confirmed cases of COVID-19. Counts can include probable cases, where reported.
`new_cases`	New confirmed cases of COVID-19. Counts can include probable cases, where reported. In rare cases where our source reports a negative daily change due to a data correction, we set this metric to NA.
`total_cases_per_million`	Total confirmed cases of COVID-19 per 1,000,000 people. Counts can include probable cases, where reported.

Others contain general data:

Column	Description
`iso_code`	ISO 3166-1 alpha-3 – three-letter country codes. Note that OWID-defined regions (e.g. continents like 'Europe') contain prefix 'OWID_'.
`continent`	Continent of the geographical location
`location`	Geographical location
`date`	Date of observation
`population`	Population (latest available values). See https://github.com/owid/covid-19-data/blob/master/scripts/input/un/population_latest.csv for full list of sources
`population_density`	Number of people divided by land area, measured in square kilometers, most recent year available

There are more than 67 columns in total. Each row contains data for a particular date in one location.

Generating SQL from Natural Language

GPT-3 is able to translate natural language to SQL queries out of the box:

GPT-3 output is highlighted in green.

However, this is assuming a fictitious database and schema which won’t work for our specific dataset. To provide the relevant knowledge, I include a description of the schema as part of the prompt (which is published by Our World in Data: owid-covid-codebook.csv):

For clarity, I only included the first few columns in this prompt which, in this case, are enough to generate the correct SQL. The full schema description contains 68 lines and is used in full in the finished program.