A Resource For Working With Geospatial Data In Neo4j
In this post we explore some techniques for working with geospatial data in Neo4j. We will cover some basic spatial Cypher functions, spatial search, routing algorithms, and different methods of importing geospatial data into Neo4j.
I recently went through the examples in the Spatial Cypher Cheat Sheet in an episode of the Neo4j Livestream, you can watch therecordingbelow:
The first page of theSpatial Cypher Cheat Sheetintroduces Cypher and the property graph data model, the spatial types available in the Neo4j database, as well as some of the spatial functions available in Cypher. We also touch on importing geospatial data into Neo4j (from CSV and GeoJSON) as well as some of the path-finding algorithms (breadth-first search, Dijkstra’s Algorithm, and A*).
Page 2 of theSpatial Cypher Cheat Sheetcovers using Neo4j with Python. First, using the Neo4j Python driver to query data from Neo4j to build a GeoDataFrame. We then explore using the OSMNx Python package for fetching data from OpenStreetMap to load a road network into Neo4j.
You candownload the PDF version of the Spatial Cypher Cheat Sheet,or read on for the same content in the rest of this blog post.
Introduction To Geospatial Cypher Functions With Neo4j
- Intro To Cypher & The Property Graph Data Model
- Spatial Point Type
- Spatial Cypher Functions
- Data Import
- Routing With Path Finding Algorithms
- The Neo4j Python Driver
- Creating A GeoDataFrame From Data Stored In Neo4j
- Working With OpenStreetMap Data
- Loading A Road Network With OSMNx
- Analyzing and Visualizing Road Networks With Neo4j Bloom And Graph Data Science
Introduction To Geospatial Cypher Functions With Neo4j
Intro To Cypher And The Property Graph Data Model
Neo4j is a database management system (DBMS) that uses the property graph data model which is composed of nodes, relationships, and properties to model, store, and query data as a graph. Nodes can have one or more labels, relationships have a single type and direction. Key-value pair properties can be stored on nodes and relationships.
The Cypher query language is used to query data and interact with Neo4j. Cypher is a declarative query language that uses ASCII-art like syntax to define graph patterns that form the basis of most query operations. Nodes are defined with parenthesis, relationships with square brackets, and can be combined to create complex graph patterns. Common Cypher commands areMATCH
(find where the graph pattern exists),CREATE
(add data to the database using the specified graph pattern), andRETURN
(return a subset of the data as a result of a traversal through the graph.
MATCHp=(sfo:Airport{iata:"SFO"})-[:FLIGHT_TO*2]->(rsw:Airport{iata:"RSW"})RETURNp
Spatial Point Type
Neo4j supports 2D or 3D geographic (WGS84) or cartesian coordinate reference system (CRS) points. Here we create a point by specifying latitude/longitude, WGS84 is assumed when using latitude/longitude:
RETURNpoint({latitude:37.62245,longitude:-122.383989})
Point data can be stored as properties on nodes or relationships. Here we create an Airport node and set its location as a point.
CREATE(a:Airport)
SETa.iata="SFO",
a.location=point({latitude:37.62245,longitude:-122.383989})
RETURNa
Database indexes are used to speed up search performance. Here we create a database index on the location property for Airport nodes. This will help us find airports faster when searching for airports by location (radius distance or bounding box spatial search).
CREATEPOINTINDEXairportIndexFOR(a:Airport)ON(a.location)
Spatial Cypher Functions
Radius Distance Search
To find nodes close to a point or other nodes in the graph we can use thepoint.distance()
function to perform a radius distance search
MATCH(a:Airport)
WHEREpoint.distance(
a.location,
point({latitude:37.55948,longitude:-122.32544}))<20000
RETURNa
Within Bounding Box
To search for nodes within a bounding box we can use thepoint.withinBBox()
function.
MATCH(a:Airport)
WHEREpoint.withinBBox(
a.location,
point({longitude:-122.325447,latitude:37.55948}),
point({longitude:-122.314675,latitude:37.563596}))
RETURNa
See also my other blog post that goes into a bit more detail on spatial search functionality with Neo4j, including point in polygon:Spatial Search Functionality With Neo4j
Geocoding
To geocode a location description into latitude, longitude location we can use theapoc.spatial.geocode()
procedure. By default this procedure uses the Nominatim geocoding API but can be configured to use other geocoding services, such as Google Cloud.
CALLapoc.spatial.geocode('SFO Airport')YIELDlocation
---------------------------------------------------------------
{
"description":"San Francisco International Airport, 780, South Airport Boulevard, South San Francisco, San Mateo County, CAL Fire Northern Region, California, 94128, United States",
"longitude":-122.38398938548363,
"latitude":37.622451999999996,
}
Data Import
We can use Cypher to import data into Neo4j from formats such as CSV and JSON, including GeoJSON.
CSV
Using theLOAD CSV
Cypher command to create an airport routing graph.
1 - Create a constraint on the field that identies uniquesness, in this case Airport IATA code. This ensures we won’t create duplicate airports but also creates a database index to improve performance of our data import steps below.
CREATECONSTRAINTFOR(a:Airport)REQUIREa.iataISUNIQUE;
2 - Create Airport nodes, storing their location, name, IATA code, etc as node properties.
LOADCSVWITHHEADERS
FROM"https://cdn.neo4jlabs /data/flights/airports.csv"
ASrow
MERGE(a:Airport{iata:row.IATA_CODE})
ONCREATESETa.city=row.CITY,
a.name=row.AIRPORT,
a.state=row.STATE,
a.country=row.country,
a.location=
point({latitude:toFloat(row.LATITUDE),
longitude:toFloat(row.LONGITUDE)
});
3 - Create FLIGHT_TO relationships connecting airports with a connecting flight. Increment the num_flights counter variable to keep track of the number of flights between airports per year.
:auto
LOADCSVWITHHEADERS
FROM"https://cdn.neo4jlabs /data/flights/flights.csv"ASrow
CALL{
WITHrow
MATCH(origin:Airport{iata:row.ORIGIN_AIRPORT})
MATCH(dest:Airport{iata:row.DESTINATION_AIRPORT})
MERGE(origin)-[f:FLIGHT_TO]->(dest)
ONCREATESET
f.num_flights=0,f.distance=toInteger(row.DISTANCE)
ONMATCHSET
f.num_flights=f.num_flights+1
}INTRANSACTIONSOF50000ROWS;
GeoJSON
We can also store arrays of Points to represent complex geometries like lines and polygons, for example to represent land parcels.
CALLapoc.load.json('https://cdn.neo4jlabs /data/landgraph/parcels.geojson')
YIELDvalue
UNWINDvalue.featuresASfeature
CREATE(p:Parcel)SET
p.coordinates=[coordINfeature.geometry.coordinates[0]|point({latitude:coord[1],longitude:coord[0]})]
p+=feature.properties;
Routing With Path Finding Algorithms
Shortest Path
The shortestPath function performs a binary breadth-first search to find the shortest path between nodes in the graph.
MATCHp=shortestPath(
(:Airport{iata:"SFO"})-[:FLIGHT_TO*..10]->(:Airport{iata:"RSW"})
)RETURNp
ShortestWeightedPath
Often we want to consider the shortest weighted path taking into account distance, time or some other cost stored as relationship properties. Dijkstra and A* are two algorithms that take relationship (or edge) weights into account when calculating the shortest path.
Dijkstra's Algorithm
Dijkstra’s algorithm is similar to a breadth-first search, but takes into account relationship properties (distance) and prioritizes exploring low-cost routes first using a priority queue.
MATCH(origin:Airport{iata:"SFO"})
MATCH(dest:Airport{iata:"RSW"})
CALL
apoc.algo.dijkstra(
origin,
dest,
"FLIGHT_TO",
"distance"
)
YIELDpath,weight
UNWINDnodes(path)ASn
RETURN{
airport:n.iata,
lat:n.location.latitude,
lng:n.location.longitude
}ASroute
A* Algorithm
The A* algorithm adds a heuristic function to choose which paths to explore. In our case the heuristic is the distance to the final destination.
MATCH(origin:Airport{iata:"SFO"})
MATCH(dest:Airport{iata:"RSW"})
CALL
apoc.algo.aStarConfig(
origin,
dest,
"FLIGHT_TO",
{
pointPropName:"location",
weight:"distance"
}
)
YIELDweight,path
RETURNweight,path
There are additional path-finding algorithms available in Neo4j'sGraph Data Science Library.
Using Neo4j With Python For Geospatial Data
The Neo4j Python Driver
In this section we’ll use the Neo4j Python Driver to create a GeoDataFrame of our flight data. We’ll also compute weighted degree centrality so we can plot airport size relative to their “importance” in the US airline network. The Neo4j Python Driver can be installed with:
pip install neo4j
Creating A GeoDataFrame From Data Stored In Neo4j
First we import theneo4j
Python package, define our connection credentials for our Neo4j instance (here we are using a local Neo4j instance), and create adriver
instance.
importneo4j
NEO4J_URI="neo4j://localhost:7689"
NEO4J_USER="neo4j"
NEO4J_PASSWORD="letmeinnow"
NEO4J_DATABASE="neo4j"
driver=neo4j.GraphDatabase.driver(
NEO4J_URI,auth=(NEO4J_USER,NEO4J_PASSWORD)
)
Next, we define a Cypher query to fetch data from Neo4j. In addition to fetching each flight between airports we computeweighted degree centrality,a measure of each node's importance in the network, the sum of all relationship weights connected to a given node, in this case the number of flights per year for each airport.
We also return the geometry of our origin and destination airports, and the flight route as Well Known Text (WKT).POINT
for the airports andLINESTRING
for the flight route. We’ll parse this WKT when defining the geometry in our GeoDataFrame.
AIRPORT_QUERY="""
MATCH (origin:Airport)-[f:FLIGHT_TO]->(dest:Airport)
CALL {
WITH origin
MATCH (origin)-[f:FLIGHT_TO]-()
RETURN sum(f.num_flights) AS origin_centrality
}
CALL {
WITH dest
MATCH (dest)-[f:FLIGHT_TO]-()
RETURN sum(f.num_flights) AS dest_centrality
}
RETURN {
origin_wkt:
"POINT ( "+ origin.location.longitude +" "+ origin.location.latitude +" ) ",
origin_iata: origin.iata,
origin_city: origin.city,
origin_centrality: origin_centrality,
dest_centrality: dest_centrality,
dest_wkt:
"POINT ( "+ dest.location.longitude +" "+ dest.location.latitude +" ) ",
dest_iata: dest.iata,
dest_city: dest.city,
distance: f.distance,
num_flights: f.num_flights,
geometry:
"LINESTRING ( "+ origin.location.longitude +" "+ origin.location.latitude +", "
+ dest.location.longitude +" "+ dest.location.latitude +" ) "
}
"""
Next we define a Python function to execute our Cypher query and process the results into a GeoPandas GeoDataFrame. The Neo4j Python driver has a.to_df()
method which will convert a Neo4j result set to a Pandas DataFrame. Note that we parse the WKT columns into GeoSeries and convert the pandas DataFrame into a GeoPandas GeoDataFrame.
defget_airport(tx):
results=tx.run(AIRPORT_QUERY)
df=results.to_df(expand=True)
df.columns=
['origin_city','origin_wkt','dest_city','dest_wkt','origin_centrality','distance','origin_iata',
'geometry','num_flights','dest_centrality','dest_iata']
df['geometry']=geopandas.GeoSeries.from_wkt(df['geometry'])
df['origin_wkt']=geopandas.GeoSeries.from_wkt(df['origin_wkt'])
df['dest_wkt']=geopandas.GeoSeries.from_wkt(df['dest_wkt'])
gdf=geopandas.GeoDataFrame(df,geometry='geometry')
returngdf
withdriver.session(database=NEO4J_DATABASE)assession:
airport_df=session.execute_read(get_airport)
We now have a GeoDataFrame where each row is a flight route between two airports. We can plot the airport and routes, using the centrality metric to size airport nodes: more important airports should be larger.
world=geopandas.read_file(
geopandas.datasets.get_path('naturalearth_lowres')
)
ax=world[world.continent=='North America']
.plot(color='white',edgecolor='black')
flights_gdf=flights_gdf.set_geometry("origin_wkt")
flights_gdf.plot(ax=ax,markersize='origin_centrality')
flights_gdf=flights_gdf.set_geometry("geometry")
flights_gdf.plot(ax=ax,markersize=0.1,linewidth=0.01)
Working With OpenStreetMap Data
In this section we will import data from OpenStreetMap into Neo4j using the OSMNx Python package. Below is the property graph data model we will use to model the road network of Boston.
pipinstallosmnx
Loading A Road Network With OSMNx
importosmnxasox
G=ox.graph_from_place("Boston, MA, USA",network_type="drive")
fig,ax=ox.plot_graph(G)
gdf_nodes,gdf_relationships=ox.graph_to_gdfs(G)
gdf_nodes.reset_index(inplace=True)
gdf_relationships.reset_index(inplace=True)
Here is our nodes GeoDataFrame. Each row represents an intersection in the Boston road network:
Here is our relationships GeoDataFrame. Each row represents a road segment connecting two intersections.
We'll define a Cypher query to add intersection nodes from the nodes GeoDataFrame and add road segments from the relationships GeoDataFrame connecting intersection nodes.
First let's create a constraint to ensure we don't have duplicate Intersection nodes (this will also create a node index to improve lookups during import):
CREATECONSTRAINTFOR(i:Intersection)REQUIREi.osmidISUNIQUE
We can also create an index on theosmid
property of theROAD_SEGMENT
relationship to improve import performance:
CREATEINDEXFOR()-[r:ROAD_SEGMENT]-()ONr.osmid
Because our GeoDataFrames can be very large we break them up into batches to avoid sending too much data to the database at once.
node_query='''
UNWIND $rows AS row
WITH row WHERE row.osmid IS NOT NULL
MERGE (i:Intersection {osmid: row.osmid})
SET i.location =
point({latitude: row.y, longitude: row.x }),
i.ref = row.ref,
i.highway = row.highway,
i.street_count = toInteger(row.street_count)
RETURN COUNT(*) as total
'''
rels_query='''
UNWIND $rows AS road
MATCH (u:Intersection {osmid: road.u})
MATCH (v:Intersection {osmid: road.v})
MERGE (u)-[r:ROAD_SEGMENT {osmid: road.osmid}]->(v)
SET r.oneway = road.oneway,
r.lanes = road.lanes,
r.ref = road.ref,
r.name = road.name,
r.highway = road.highway,
r.max_speed = road.maxspeed,
r.length = toFloat(road.length)
RETURN COUNT(*) AS total
'''
definsert_data(tx,query,rows,batch_size=1000):
total=0
batch=0
whilebatch*batch_size<len(rows):
results=tx.run(query,parameters={
'rows':
rows[batch*batch_size:(batch+1)*batch_size]
.to_dict('records')
}).data()
print(results)
total+=results[0]['total']
batch+=1
withdriver.session()assession:
session.execute_write(insert_data,node_query,gdf_nodes.drop(columns=['geometry']))
session.execute_write(insert_data,rels_query,gdf_relationships.drop(columns=['geometry']))
In this post we introduced some of the spatial functionality natively supported by Neo4j including the point type and related Cypher functions and demonstrated how to accomplish various spatial search operations as well as a brief look at routing with graph algorithms.