Skip to main content

DuckDB

DuckDB is powerful analytics tool that allows you to query remote files and download only the data you want. You'll need to install at least DuckDB 1.1.0, which supports reading and writing geoparquet.

Installation

Install DuckDB. You'll need extensions to work with spatial data in the cloud. Run the following commands in the DuckDB command line:

  1. INSTALL SPATIAL; to install the duckdb_spatial extension.
  2. INSTALL httpfs; or INSTALL azure; to read from either Amazon S3 (httpfs) or Microsoft Azure Blob Storage (azure).

Downloading Overture Data

These SQL queries will access Overture data in the cloud and download only the attributes that you request.

Download all pizza restaurants in New York City as a GeoJSON file.

LOAD spatial; -- noqa

SET s3_region='us-west-2';

COPY( -- COPY <query> TO <output> saves the results to disk.
SELECT
id,
names.primary as name,
confidence AS confidence,
CAST(socials AS JSON) as socials, -- Ensure each attribute can be serialized to JSON
geometry -- DuckDB understands this to be a geometry type
FROM read_parquet('s3://overturemaps-us-west-2/release/2025-05-21.0/theme=places/type=place/*', filename=true, hive_partitioning=1)
WHERE categories.primary = 'pizza_restaurant'
AND bbox.xmin BETWEEN -75 AND -73 -- Only use the bbox min values
AND bbox.ymin BETWEEN 40 AND 41 -- because they are point geometries.

) TO 'nyc_pizza.geojson' WITH (FORMAT GDAL, DRIVER 'GeoJSON');

See the places schema to learn more about each attribute or the common schema concepts to better understand names.primary from names.common.


tip

To download the data as shapefile, replace the last line of any of the above queries:

) TO 'overture_data.shp' WITH (FORMAT GDAL, DRIVER 'ESRI Shapefile');

or as a geopackage:

) TO 'overture_data.gpkg' WITH (FORMAT GDAL, DRIVER 'GPKG');

The DuckDB Spatial extension supports more than 50 geospatial file formats.

You can also use DuckDB to loop over the Parquet files in a directory and convert the data to a spatial format of your choice. The spatial extension uses GDAL so you could export to flatgeobuf if that's the data format that works best for you. Here's a bash script that shows you how:

for f in *.parquet;
do duckdb -c \
"install spatial;
load spatial;
COPY (SELECT id, st_geomfromwkb(geometry) from read_parquet('$f'))
to '$f.fgb' WITH (FORMAT GDAL, DRIVER 'flatgeobuf');";
done

Regional Extracts

We can use data from Overture's division theme to define our bounds for a regional extract.

This example stores the GERS ID for the division in the division_id variable. If you know the GERS ID, you can simply set it with: SET variable division_id = '<GERS ID>'.

If you don't know the ID, you can set the variable to the result of a query for that specific division, as shown here:

INSTALL spatial; --noqa
LOAD spatial; --noqa


-- SET variable division_id = <GERS ID HERE>

-- Or, search for it:
SET variable division_id = (
SELECT
id
FROM
read_parquet('s3://overturemaps-us-west-2/release/2025-05-21.0/theme=divisions/type=division/*.parquet')
WHERE
names.primary = 'Marion County' AND subtype = 'county'
LIMIT 1
);

-- Fetch the bounds and geometry of the Region
CREATE OR REPLACE TABLE bounds AS (
SELECT
id AS division_id, names.primary, geometry, bbox
FROM
read_parquet('s3://overturemaps-us-west-2/release/2025-05-21.0/theme=divisions/type=division_area/*.parquet')
WHERE
division_id = getvariable('division_id')
);

-- Extract the bounds and geometry of the division into variables for faster table scan
SET variable xmin = (select bbox.xmin FROM bounds);
SET variable ymin = (select bbox.ymin FROM bounds);
SET variable xmax = (select bbox.xmax FROM bounds);
SET variable ymax = (select bbox.ymax FROM bounds);
SET variable boundary = (select geometry FROM bounds);

-- Create a local GeoParquet file.
COPY(
SELECT
*
FROM
read_parquet('s3://overturemaps-us-west-2/release/2025-05-21.0/theme=buildings/type=building/*.parquet')
WHERE
bbox.xmin > getvariable('xmin')
AND bbox.xmax < getvariable('xmax')
AND bbox.ymin > getvariable('ymin')
AND bbox.ymax < getvariable('ymax')
AND ST_INTERSECTS(
getvariable('boundary'),
geometry
)
) TO 'extract.parquet';

-- Convert GeoParquet to line-delimited GeoJSON (or any other GDAL format)
COPY(
SELECT
id,
subtype,
class,
height,
names.primary as name,
geometry
FROM 'extract.parquet'
) TO 'extract.geojsonseq' WITH (FORMAT GDAL, DRIVER 'GeoJSONSeq');
note

This query is broken into discrete steps for efficiency:

  1. Using DuckDBs variables in the WHERE clause proved more performant than joining to the bounds table.
  2. Storing the result first as GeoParquet and then converting to another format is significantly faster than doing it all in one step.

Next steps