Skip to main content

Sedona / Spark

As a series of GeoParquet files, Overture data is already optimized for distributed computing environments. The following example shows you how to work with Overture data in Sedona, a cluster computing system for spatial data.

For this example, you can spin up a single-node Sedona Docker image from Apache Software Foundation DockerHub. In production, Sedona can be deployed to nearly any cloud environment (Databricks, AWS EMR, etc.), or check out Wherobots to learn more about hosted Sedona environments.

Example

To get started with the single-node docker image, ensure your docker client is started, and then run:

docker pull apache/sedona
docker run -p 8888:8888 apache/sedona:latest

A Jupyter Lab and notebook examples is now available in your browser at http://localhost:8888.

Create a new Python notebook with the following first cell:

from sedona.spark import *

config = SedonaContext.builder().config(
"fs.s3a.aws.credentials.provider",
"org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider"
).getOrCreate()
sedona = SedonaContext.create(config)

After initializing your PySpark/Sedona environment, you can load theme data directly from S3. The following examples leverage Sedona's understanding of GeoParquet, so we can take full advantage of spatial queries:

places = sedona.read.format("geoparquet").load(
"s3a://overturemaps-us-west-2/release/2024-09-18.0/theme=places/type=place/")

For example, to find all of the places in Seattle, you can apply a spatial filter with the bounding box for Seattle:

places.filter("""ST_Contains(
ST_GeomFromWKT('POLYGON((-122.459681 47.734124, -122.224433 47.734124, -122.224433 47.481002, -122.459681 47.481002, -122.459681 47.734124))'),
geometry)""").limit(100).show()

Or, find all of the places within 1km of the Space Needle with the following query:

places.filter("""ST_DistanceSpheroid(
ST_GeomFromWKT('POINT(-122.3493 47.6204)'),
geometry) < 1000
""")

tip

These are just examples to show you how to interface with Overture data in Sedona. Running locally on a single-node docker image doesn't offer any performance benefits, but when deployed in a distributed cloud environment, you can operate on the entire Overture dataset.

For more examples from wherobots, check out their Overture-related Notebook examples.