Skip to main content

Extracting Data Locally

DuckDB

DuckDB is an analytics tool that can query remote parquet files using SQL. It will only download the subset of files it needs to fulfill your queries. Note: updating to DuckDB 0.10.0 will dramatically improve query performance because this version enables faster row and row-group filtering on parquet files containing a struct. (A struct is a column containing one or more other columns in named fields and is analogous to a JSON object.)

Install DuckDB locally. You'll need extensions to work with spatial data in the cloud. Using the DuckDB CLI, do the following:

  1. INSTALL SPATIAL; to install the duckdb_spatial extension.
  2. INSTALL httpfs; or INSTALL azure; to read from either Amazon S3 (httpfs) or Microsoft Azure Blob Storage (azure).

Here is an example query that downloads all of the country boundaries from the admins theme and creates a single GeoJSON file:

LOAD spatial;
LOAD httpfs;
SET s3_region='us-west-2';

CREATE OR REPLACE VIEW admins_view AS (
SELECT
*
FROM
read_parquet('s3://overturemaps-us-west-2/release/2024-05-16-beta.0/theme=admins/type=*/*', filename=true, hive_partitioning=1)
);

COPY (
SELECT
admins.id,
admins.subtype,
admins.iso_country_code_alpha_2,
admins.admin_level,
areas.area_id,
names.primary AS primary_name,
sources[1].dataset AS primary_source,
ST_GeomFromWkb(areas.area_geometry) AS geometry
FROM admins_view AS admins
INNER JOIN (
SELECT
id AS area_id,
locality_id,
geometry AS area_geometry
FROM admins_view
) AS areas ON areas.locality_id == admins.id
WHERE admins.admin_level = 1
) TO 'countries_overture.geojson'
WITH (FORMAT GDAL, DRIVER 'GeoJSON', SRS 'EPSG:4326');

Apache Sedona (Python + Spatial SQL)

Sedona is a cluster computing system for spatial data. You can get a single-node Sedona Docker image from Apache Software Foundation DockerHub or install Sedona to Databricks, AWS EMR and Snowflake using Wherobots.

To get started with the single-node docker image, run:

docker pull apache/sedona
docker run -p 8888:8888 apache/sedona:latest

A Jupyter Lab and notebook examples will then be available at localhost:8888. The following Python + Spatial SQL code reads the Places dataset and runs a spatial filter query on it.

from sedona.spark import *

config = SedonaContext.builder().config("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider").getOrCreate()
sedona = SedonaContext.create(config)

df = sedona.read.format("geoparquet").load("s3a://overturemaps-us-west-2/release/2024-03-12-alpha.0/theme=places/type=place")
df.filter("ST_Contains(ST_GeomFromWKT('POLYGON((-122.48 47.43,-122.20 47.75,-121.92 47.37,-122.48 47.43))'), geometry) = true").show()

For more examples from wherobots, check out their Overture-related Notebook examples.