The GIS Road to Fruition: spatial analysis

Showing posts with label spatial analysis. Show all posts

Monday, October 7, 2024

Scale and Resolution Effects on Spatial Data

What a last two weeks it has been this semester. Hurricane Helene threatened the area during the final week of September, shifting everyone's focus to preparation and expected impacts. The storm center passed approximately 90 miles to our west. While coastal impacts were severe, we were spared the brunt inland, even keeping electricity throughout the storm.

Followed that with a preplanned trip for AARoads to Puerto Rico. Then got started on the final module for GIS Special Topics and increased my time investment into the module leading into this past weekend as newly named tropical storm Milton formed in the Bay of Campeche. A Category 5 hurricane as of this writing, Hurricane Milton is expected to make landfall somewhere on the west coast of Florida on Wednesday or Thursday. While wind shear is eventually expected to weaken the storm, unlike Helene, Debby, Idalia and other storms, Milton is forecast to be a major wind event for inland locations. So anxiety levels are high!

The sixth module for GIS Special Topics investigates the effects of scale on vector spatial data and resolution on raster spatial data. The lab also covers spatial data aggregation and the concept of gerrymandering using GIS.

There are multiple meanings of scale to consider for Geographic Information Systems (Zanbergen, 2004).

as an indication of the relationship between units on a map and units in the real world. This is typically a representative fraction, which is commonly used with USGS Quads and GIS Maps in general.
to indicate the extent of the area of interest. Examples include spatial areas such as neighborhoods, cities, counties and regions.
to express the amount of detail or resolution. The resolution of a raster spatial dataset is the cell size, such as 10 meters for the Sentinel 2 blue, green and red spectral bands. This defines the scale of the data.

Scale in the Raster Data Model is straight forward represented by the resolution or cell size. A general rule is that a real world object needs to be at least as large as a cell in order to be recognizable.

Scale in the Vector Data Model also represents the amount of detail. While there is no single best method to express scale in vector data, a good indicator is the size of the smallest polygon or length of the shortest segment of a polyline.

When measuring the length of a complex shape, the total length depends on the smallest unit of the measuring tool. Where the units of a measuring tool decrease, the total length of the shape increases. More nodes and connecting segments result in longer shape lengths or area perimeters. The following images illustrate the differences in scale for the Vector Data Model.

Differing scales of Wake County, NC water flowlines

Water flowline vector data for Wake County, NC in different scales

Polygon vector data for Wake County, NC waterbodies at different scales

Waterbodies vector data for Wake County, NC in different scales

The properties of a Digital Elevation Model (DEM) depends upon what resolution is used. Higher resolution provides more detail. When measuring Slope, values decrease as the cell size increases and detail decreases. Higher detail results in steeper slopes. This effect applies to the full range of slopes regardless of steep areas of terrain (Zanbergen, 2004).

Scatterplot showing the relationship of Resolution vs. Slope in a DEM

Quantification of Resolution vs. Slope for a DEM in lab

The Modifiable Areal Unit Problem (MAUP) factors into deciding what scale to use for analysis of spatial data. MAUP is a complication with statistical analysis when quantifying aerial data. There are two facets of MAUP.

Scale Effect

The optimal spatial scale for analysis is generally not known, as there are multiple scales for analysis to be theoretically considered (Manley 2013). The results of data can be manipulated positively or negatively depending upon upon the size of the aggregation units used.

Zoning Effect

The method used to create areal units. This effect is the result of how spatial data is separated, such as the grouping of smaller areal units into less numbers of larger areal units (Dark & Bram 2007). Changing the grouping can manipulate the results of spatial analysis.

Part 2 of the lab conducting Linear Regression analysis of poverty statistics for Florida in U.S. Census data resulted in an example of MAUP. Different levels of aggregation convey different results:

Linear Regression Results based upon Congressional District

Linear Regression Results based upon Counties

Linear Regression Results based upon Zip Codes

Gerrymandering is the purposeful manipulation of a district shape with intentional bias (Morgan & Evans, 2018) or to affect political power (Levitt, 2010). Partisan gerrymandering takes place when the political party controlling the redistricting process draws district lines to benefit itself and restrict opportunities for opposition parties. While this maneuvering aims to increase inordinately the political power of a group (Levitt, 2010), the U.S. Supreme Court ruled that partisan-focused gerrymandering is not unconstitutional (Morgan & Evans, 2018).

GIS can measure gerrymandering by the compactness in a number of ways. Compactness is the only common rule pertaining to redestricting that takes into account the geometric shape of the district. A district is considered compact if it has a regular shape where constituents generally live near each other. A circular district is very compact while a linear district is not (Levitt, 2010).

Thanks to a discussion board post from our classmate Emily Jane, a method for determining compactness that I found easy to interpret is the Reock Score. Using this method, geoprocessing determines the minimum bounding circle around each polygon of a Congressional District. That is the smallest circle that entirely encloses the district. Reock scoring uses the ratio of the district area to the minimum bounding circle with the following equation R=A_D/A_MBC where A_D is the area of the district and A_MBC is the area of the minimum bounding circle. The score ranges from 0, which is not compacted, to 1, which is optimally compact.

Example of the Minimum Bounding Circle used with the Reock Score method

An example of the Minimum Bounding Circle around a District polygon for the Reock Score method

Proceeded with the Reock Score analysis using the Minimum Bounding Geometry tool in ArcGIS Pro. This creates circular polygons for each record in the Congressional District dataset provided. With the minimum bounding circle area variable and the area value of the district, calculated the Reock score for every district. With a field added for the Reock Score, the worst "offenders" of gerrymandering based upon failing to have district 'compactness' from the provided dataset were determined.

Florida District 5 - 2nd worst gerrymandering 'offender'

North Carolina District 2 - the worst gerrymandering 'offender'

References

Zanbergen (2004). DEM Resolution. Vancouver Island University, Nanaimo, BC, Canada.

Manley, D. J. (2013). Scale, Aggregation, and the Modifiable Areal Unit Problem. In Handbook of Regional Science. Springer Verlag.

Dark, S. J., & Bram, D. (2007). The modifiable areal unit problem (MAUP) in physical geography. Progress in physical geography, 31(5), 471-479.

Morgan, J. D., & Evans, J. (2018). Aggregation of spatial entities and legislative redistricting. The geographic information science & technology body of knowledge, 2018(Q3).

Levitt, J. (2010). A Citizen's Guide to Redistricting. New York, NY: Brennan Center for Justice at New York University School of Law.

Sunday, September 22, 2024

Interpolation Methods - Tampa Bay Water Quality

There are numerous spatial interpolation methods used to generate surfaces in GIS. This is the prediction of variables at unmeasured locations based upon sampling of similar variables at known locations or true points. Related, spatial prediction is the estimation of variables at unsampled locations based partly on other variables and a collective set of measurements. Comprised of spatially continuous data, surfaces could be topographic, a measure of air pollution, soil moisture, air temperatures and population density among others (Bolstad & Manson, 2022).

A number of factors can affect the performance of spatial interpolation methods. Some of these factors are data accuracy, temporality of the data, sampling design, sample spatial distribution, the presence of abnormal values or outliers, and the correlation of primary and secondary variables (Hu, 1995, Li & Heap, 2014).

Deciding upon the best interpolation method is not always a straight forward process. Methods often work well for a specific data set because of inherent assumptions and algorithm design for estimation. Different interpolations methods applied to the same data set may produce desired results for one study objective but not another (Hu, 1995).

Module 5 for GIS Special Topics performs interpolation analyses for Tampa Bay water quality data. Specifically four methods are used for the estimation of Biochemical Oxygen Demand (BOD) in milligrams per liter variables for Tampa Bay. A point feature class of BOD sample locations is provided and the study area is all of Tampa Bay, Old Tampa Bay and Hillsborough Bay. A statistical analysis of each is compared in an effort to determine which derived surface best describes water quality.

The first interpolation method implemented for the Tampa Bay water quality analysis is Thiessen Polygon. This method was the easiest to interpret. It aggregates the point dataset within the study area to polygons with one per point, which is referred to as a centroid. All estimated points within the Thiessen polygon (proximal zone) are closer in value to the associated centroid than any other centroid in the overall analysis.

The Thiessen Polygon method is optimal when there is no uniform distribution of the sample points. The method is applicable to environmental management (Wrublack et. al, 2013).

Thiessen Polygon interpolation of Tampa Bay water quality

The Thiessen Polygon raster with an output cell size of 250.

Previously discussed in the Isarithmic Mapping lab in Computer Cartography, the Inverse Distance Weighting (IDW) spatial interpolation method estimates values using the values of sample points and the distance to nearby known points (Bolstad & Manson, 2022). Values closer to a location have more weight on the predicted value than those further away. The power parameter in the mathematical equation of the method determines the weighting, which decreases as the distance increases. When the power parameter increases, a heavier weight is applied to nearby samples, which increases their influence on estimation (Ikechukwu, 2017).

The IDW method assumes that the underlying surface is smooth. It works well with regularly spaced data, but cannot account for the spatial clustering of sample points (Li & Heap, 2014).

Tampa Bay water quality estimates from the IDW method

The IDW raster for water quality. The power parameter was 2 and output cell size of 250.

Spline interpolation uses a mathematical function to interpolate a smooth curve along a set of sample data points with minimal curvature. Polynomial functions calculate the segments between join points. These accommodate local adjustments and define the amount of smoothing. The method is named after splines, the flexible ruler cartographers used to fit smooth curves through fixed points (Ikechukwu, 2017).

The performance of Splines improves when dense, regularly-spaced data is used (Li & Heap, 2014). The method is very suitable for estimating densely sampled heights and climatic variables (Ikechukwu, 2017).

The lab uses the options of Regularized and Tension for the Spline geoprocessing tool in ArcGIS Pro. This changes the weight parameter, where higher values in Regularized splines result in smoother surfaces. A weight of zero for the Tension spline option results in a basic thin plate spline interpolation. This is also referenced as the basic minimum curvature technique.

Estimated Tampa Bay water quality - Regularized Spline Interpolation Method

Estimated Tampa Bay water quality - Tension Spline Interpolation Method

References:

Bolstad, B., & Manson, S. (2022). GIS Fundamentals – 7th Edition. Eider Press.

Hu, J. (1995, May). Methods of generating surfaces in environmental GIS applications. In 1995 ESRI user conference proceedings.

Li, J., & Heap, A. D. (2014). Spatial interpolation methods applied in the environmental sciences: A review. Environmental Modelling & Software, 53, 173-189.

Wrublack, S. C., Mercante, E., & Vilas Boas, M. A. (2013). Water quality parameters associated with soil use and occupation features by Thiessen polygons. Journal of Food, Agriculture & Environment, 11(2), 846-853.

Ikechukwu, M. , Ebinne, E. , Idorenyin, U. and Raphael, N. (2017) Accuracy Assessment and Comparative Analysis of IDW, Spline and Kriging in Spatial Interpolation of Landform (Topography): An Experimental Study. Journal of Geographic Information System, 9, 354-371. doi: 10.4236/jgis.2017.93022.

Friday, September 6, 2024

Spatial Data Quality - Road Network Completeness

Continuing the focus on Spatial Data Quality in GIS Special Topics, Module 1.3 covers the Accuracy Assessment of Roads. Road networks are widely used as the basemap for many applications. This factors into expectations for positional accuracy and completeness, which this week's lab covers.

Road networks are also used for geocoding and network routing. The usability of such is dependent upon robust attributes such as street names, address numbers, zip codes in addition to networking aspects such as turn restrictions and one-way directions. Topologically, road networks must also be robust, with exact connectivity found in reality (Zanbergen 2004).

Typically road network datasets are compiled from an array of historical sources, with digitization from aerial imagery and augmentation from GPS field data collection. One of the most comprehensive datasets in the U.S. with a long lineage is TIGER (Topologically Integrated Geographic Encoding and Referencing).

Produced by the US Census Bureau for 1:100,000 scale maps (Syoung & O'Hara, 2009), TIGER was originally compiled to be topologically correct. That is data was not focused on being as accurate as possible, but instead data stressed connections and boundaries. (Zanbergen 2004) This resulted in legacy errors, which were carried over in succeeding updates from 2000 onward.

TIGER roads centerline data for Jackson County, Oregon

Covered in the last week's lab, accuracy assessment of roads utilizes methods such as "ground-truthing" using GPS or surveying equipment, comparing roads with high resolution imagery, and comparing roads to existing datasets deemed to be of higher accuracy.

Positional accuracy last week looked at the comparison of points between two datasets using root-mean-square-error (RMSE) with reference or true points. Additional methods include using buffers. This is where the true line is buffered with some distance to show discrepancies. It is also used to determine where displacements between matching features fall within an expected nominal accuracy. (Syoung & O'Hara, 2009) In other words data located in areas outside a buffer (specified tolerance) are deemed to be substantial errors.

Another method for positional accuracy is line displacement. This is where the displacement of various sections of a polyline are measured using Euclidean distance. Using matching algorithms, errors show the displacement of one road network from another. These displacements can be summarized (Zanbergen 2004), or be represented as a raster dataset to analyze vector geometry (Syoung & O'Hara, 2009).

The lab assignment for Module 1.3 conducts accuracy assessment for completeness on two datasets of street centerlines for Jackson County, Oregon. The feature classes are TIGER road data from 2000 and a Streets_Centerlines feature class compiled by Jackson County GIS.

Street Centerlines data from Jackson County, Oregon GIS

Completeness is one of the aspects cited by Haklay (2010) in accessing data quality. Completeness is the measure of the lack of data, i.e. how much data is expected versus how much data is present. Zanbergen (2004) references measuring the total length of a road network and comparing that to a reference scenario and secondly counting the number of missing elements as a count of features.

Both accuracy assessment scenarios for completeness overlay an arbitrary grid cell over compared datasets to determine the total length of count in a smaller unit. Then a comparison between two sets of roads based on a total length can be determined.

Haklay (2010) references completeness as asking the question of how comprehensive is the coverage of real-world objects. Generalizing this as a simple measure of completeness for our analysis, the dataset with the higher total length of polylines is assumed to be more complete.

Our analysis proceeds by projecting the Tiger roads data into StatePlane coordinates to match the other provided datasets. The shape length of each polyline in kilometers is calculated from feet into a new field for each road feature class. Statistics for total length of all road segments per dataset are then summarized for the initial assessment of completeness, where the dataset with more kilometers of roads is considered more complete.

The results were 10,805.82 km of roads for the County Street Centerlines feature class and 11,382.69 km for the Tiger roads feature class. With more data, the Tiger roads data is considered more complete.

Further accuracy assessment for completeness continues with a feature class of grid polygons to be used as the smaller units for comparison. Both feature classes were clipped so that all roads outside of the 297 grid cells were dropped. Geoprocessing using the Pairwise Intersect tool separates each road centerline dataset by grid. This provides a numerical summary indicating a simple factor of completeness on a smaller scale.

The collective length of Tiger road segments exceeds the County street centerline segment length in 162 of the 297 grid cells.
The collective length of County street centerline segments exceeds the Tiger road segment length in 134 of the 297 grid cells
Additionally one grid cell contained zero polylines for either centerline dataset.

Visualization of these results shows the percent difference for the length of Tiger roads centerline data as compared to the County roads centerline data. Statistics were calculated using a mathematical formula:
% 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 = (𝑡𝑜𝑡𝑎𝑙 𝑙𝑒𝑛𝑔𝑡ℎ 𝑜𝑓 𝑐𝑒𝑛𝑡𝑒𝑟𝑙𝑖𝑛𝑒𝑠 − 𝑡𝑜𝑡𝑎𝑙 𝑙𝑒𝑛𝑔𝑡ℎ 𝑜𝑓 𝑇𝐼𝐺𝐸𝑅 𝑅𝑜𝑎𝑑𝑠)/(𝑡𝑜𝑡𝑎𝑙 𝑙𝑒𝑛𝑔𝑡ℎ 𝑜𝑓 𝑐𝑒𝑛𝑡𝑒𝑟𝑙𝑖𝑛𝑒𝑠) ×100%
Completeness is aggregated where cells with more kilometers of Tiger roads than County roads appear in reds and oranges and shades of green where the collective length of County roads polylines exceeds the length of the Tiger roads data.

Length comparison between County street centerline data and TIGER roads data

Map showing the geographic distribution in the differences of completeness for the two road datasets

References:

Zanbergen (2004, May). Spatial Data Management: Quality and Control. Quality of Road Networks. Vancouver Island University, Nanaimo, BC, Canada.

Suyoung & O'Hara (2009, December). International Journal of Geographical Information Science 23, 1503-1525.

Haklay (2010, August 1). Environment and Planning B: Planning and Design, 37, 682-703.

Wednesday, August 28, 2024

Spatial Data Quality - Positional Accuracy of Road Networks

When viewing a map or working with geospatial data, it is generally assumed to be accurate. But this may not always be the case, and many factors can affect accuracy. Unaccounted bias may be present, data may have been digitized at a coarser scale than was required, errors present on a previous dataset used to update a new one could be carried over, etc. So how accurate is a map or geospatial data?

Since 1998, the National Standard for Spatial Data Accuracy (NSSDA) is the Federal Geographic Data Committee (FGDC) metric used for estimating the positional accuracy of points in the horizontal or vertical direction of geospatial data. Testing uses well-defined locations to compare observed or sample data to reference or true data. Reference data might be a higher accuracy dataset, such as data at a larger scale (1:24000 versus 1:250000). It may constitute high resolution digital imagery or field survey data.

The NSSDA methodology calculates the positional error using the coordinates of the reference or true points and the observed points of the dataset being tested. The positional error, or error difference, is simply the distance between the true coordinates and dataset coordinates. It uses the equation
√(x_t-x_d)² + (y_t-y_d)²where x_t and y_t are the true point / reference point coordinates and x_d and y_d are the sample point coordinate locations. The resulting error distance value is squared so that there are no negative numbers (no direction to the error).

Positional Error

The error distances for all sample points are summed. That total is averaged for the mean square error. Taking the square root of the mean square error determines the Root Mean Square Error (RMSE) statistic for the data set. The RMSE is then converted using a multiplication factor of 1.7308 for horizontal accuracy and 1.9600 for vertical accuracy. This results in the 95th percentile in map units. The confidence level means that 95% of the positions in the dataset will have an error equal to or lower than the reported accuracy value with regards to true ground position.

The second lab for Special Topics in GIS partially returns me to my previous life is a cartographer and map researcher. The subject of the lab is positional accuracy of road networks, and the data provided covers a portion of Albuquerque, New Mexico. One of the projects I worked on at Universal Map was an update for the Albuquerque wall map. Back then we routinely worked with TeleAtlas data, which at the time was a substantial improvement from TIGER data, but far below today's accuracy standards.

The lab works with two feature classes for the study area: a feature class of road centerlines compiled by the city of Albuquerque and streets data from StreetMap USA, a TeleAtlas product. 6" ortho images from 2006 covering the study area represent the reference data.

The second protocol of NDSSA is to collect test points from the data set to which the accuracy needs to be determined. For this we implement the Stratified Random Sampling Design, which while not always possible with some data, is the ideal approach:

Data points should not be within a distance of one tenth the length of the diagonal of the study area.
Partitioning the study area into four quadrants, each quadrant should have at least 20% of the sampling points.

Sampling of Test Points for the Albuquerque, NM Study Area

Six per quadrant, the sampling of 24 test points for the Albuquerque study area

Within ArcGIS Pro I created a layout of the study area and added guides across the center horizontally and vertically. Points were selected based upon suitability of the ortho imagery, i.e. the reference data. The principle is similar to selecting control points for georeferencing, which ideally uses geometrically linear features such as T-intersections.

Using a T-intersection as the reference data for sample point #20

Substantial error distance for StreetMap USA Sample Point 1

Large error distance for StreetMap USA sample point #1

With mutual ID numbers, sample points were digitized for both street centerline datasets in new feature classes. A point with a similarly corresponding ID number was digitized in a new reference feature class. Coordinate data for all points was generated using the Add XY Coordinates geoprocessing tool.

Tables for all three feature classes were exported into Microsoft Excel using the Table to Excel geoprocessing tool. Error distances were then calculated between each sample point and associated reference point. I did this at first with one formula, but then replicated the horizontal accuracy statistic worksheet provided in the Positional Accuracy Handbook from Minnesota Planning Land Management Information Center (LIMC) in Excel.

Horizontal Accuracy Assessment for StreetMap USA data

The calculations result in the error distance squared as compiled in the last column. These values are summed and then averaged. The RMSE is the square root of the mean square error, which multiplied by 1.7308 outputs the NSSDA horizontal accuracy.

Formal accuracy reporting per the FGDC document Geospatial Positioning Accuracy Standards Part 3: National Standard for Spatial Data Accuracy on page 3-5 and the Minnesota IT Services A Methodology for Measuring and Reporting Positional Accuracy in Spatial Data web page:

Tested 12.43 (feet) horizontal accuracy at 95% confidence level for the Albuquerque Streets data set.

Tested 401.65 (feet) horizontal accuracy at 95% confidence level for the Street Map USA data set.

Positional accuracy statements as reported in metadata:

Using the National Standard for Spatial Data Accuracy, the Albuquerque Streets data set tested to 12.43 feet horizontal accuracy at 95% confidence level.

Using the National Standard for Spatial Data Accuracy, the Street Map USA data set tested to 401.65 feet horizontal accuracy at 95% confidence level.

Saturday, August 24, 2024

Spatial Data Quality - Precision and Accuracy Metrics

The first module of Special Topics in GIScience covers aspects of spatial data quality. Furthermore, the associated lab defines and contrasts the concepts of accuracy and precision in spatial data.

Quality generally represents a lack of error, where error in spatial data is the difference between a true value and an observed or predicted value. Rather than unrealistically attempting to know the exact error, an estimated error based upon sampling or another statistical approach or model can be used to ascertain this.

The lab for module 1 includes a point feature class of 50 waypoints collected with a Garmin GPSMAP 76 unit. We are first tasked with determining the precision of the waypoints. Precision is formally defined as a measure of the repeatability of a process. It is usually described in terms of how dispersed a set of repeat measurements are from the average measurement.

Precision is the variance of measurement to gauge how close data observations or collected data points are when taken for a particular phenomenon. If the same information is recorded multiple times, how close are these together? Tightly packed results correlate to a high level of precision.

When shooting multiple points of the same object with a GPS unit, the coordinates should be consistent, if not identical. If internal calibrations are off, obstructions exist between the unit and open sky, or a simple user error take place, the recorded points could vary widely. This would equate to low precision.

Accuracy is a measure of error, or a difference between a true value and a represented value. Accuracy is the inverse of error, and perfect accuracy means no error at all. Expressing accuracy in simpler terms, it is the difference between the recorded location of an observation and the true point or reference location of said phenomena.

How close is the recorded data from the actual location of the data? Inaccuracies can be reported using many methods, such as by a mean value, frequency distribution or a threshold value. Positional accuracy can be measured in x,y, and z dimensions or any combination thereof. It is common to use metrics for horizontal spatial accuracy in two dimensions.

If data is numeric, such as the GPS points for Lab 1, the accuracy error can be expressed using a metric like the root mean square error (RMSE). Precision, on the other hand, is commonly measured using standard deviation or some other measure. The difference between the two is that accuracy is compared to a reference or true value while precision utilizes the average value derived from data collected.

Buffers showing the distance of collected data points for precision and accuracy

Measuring accuracy for the GPS waypoints from the true point and precision from the average waypoint based upon the mean coordinates

Using the 68th percentile, the horizontal precision was 5.62 meters. The horizontal precision was 6.01 meters. The average waypoint was 1.13 meters off the recorded true waypoint.

There are additional aspects of accuracy to consider. Temporal accuracy means how accurate data is in terms of temporal representation. This is also referred to as currentness, meaning up to date. There are also scenarios where instead of using up to date information, historical records are more appropriate.

Thematic accuracy, or attribute accuracy, relates as to whether data contains the correct information to describe the properties of the specific data element. Misclassified data is an example of thematic inaccuracy.

There are scenarios where data can be precise but inaccurate, or imprecise but accurate. If the average of all collected or observed points falls within an acceptable threshold from the true point location, this data can be considered accurate, even if the point locations are widely place, and therefore imprecise.

Conversely if a number of points are well clustered, but well away from the true point location, this data is considered precise but also inaccurate. This is also referred to as bias, which refers to a systematic error.

The second part of Lab 1 worked with a larger provided dataset of 200 collected points with X,Y coordinates. The RMSE was calculated using Microsoft Excel. A Cumulative Distribution Function was

CDF showing the error distribution of collected point data

Rather than focusing on selected error metrics, the CDF gives a visual indication of the entire error distribution. The graph plots the frequency of observations based upon error. The 68th Percentile here was 3.18, and that matches the location of the CDF plot where the x-axis shows that the amount of error is 68% of the cumulative probability percentage.

References:

Zanbergen. Spatial Data Management: Quality and Control. Fundamentals of Spatial Data Quality. Vancouver Island University, Nanaimo, BC, Canada.

Bolstad, B., & Manson, S. (2022). GIS Fundamentals – 7th Edition. Eider Press.

Leonardo, Alex. (2024, June 10). Cumulative Distribution Function CDF. Statistics HowTo.com https://www.statisticshowto.com/cumulative-distribution-function-cdf/

Monday, August 5, 2024

Corridor Suitability Analysis - Coronado National Forest

The final scenario for the lab of GIS Applications Module 6 is to determine a potential protected corridor linking two areas of black bear habitat in Arizona's Coronado National Forest. Data provided included the extent of the two existing areas of known black bear habitat, a DEM, a raster of land cover and a feature class of roads in the study area. Parameters required for a protected corridor facilitating the safe transit of black bear included land use away from population and preferably with vegetation, mid level elevations and distances far from roadways.

Geoprocessing flow chart for Scenario 4

The initial geoprocessing in our Corridor Suitability Analysis reclassifies the DEM and landcover rasters into suitability rasters using predetermined scales. The development of a suitability raster for the roads feature class commenced with creating a multi-ring buffer feature class, and then converting the derived polygons into a suitability raster using the Reclassify tool.

Reviewing the previous scenarios on outputting buffers from a polyline, I also ran the Euclidean Distance tool on the roads feature class. The succeeding output raster was then Reclassified using the distance suitability values that rank higher proximities with lower values. The results mirrored those using the Multi-Ring Buffer tool:

Suitability Raster for proximity to roads using the Euclidean Distance tool

The suitability raster for the distance to roads derived from the raster output from the Euclidean Distance tool.

With suitability raster files finalized for elevation, landcover and proximity to roads, we can proceed with the analysis using the Weighted Overlay tool. The objective is to generate a cost raster using the integer scale of 1 through 10, based upon the influence percentages of 60% for land cover, 20% for elevation and 20% for distance to roads.

The result shows the highest suitability score for mid level elevations representative of undeveloped forest land that mostly avoids roads. Low level elevations represented by urban areas, agriculture and barren land factor into low suitability areas:

Weighted Overlay raster of Suitability areas

Weighted Overlay raster with the values of 1-10 where lighter colors represent lower suitability scores

Utilizing the Weighted Overlay raster, a cost surface raster is generated by using the Raster Calculator geoprocessing tool. The cost surface values were obtained by inverting the suitability model so that higher habitat suitability values translated into lower travel costs:

Cost Surface raster where the darker colors represent higher costs

With the Cost Surface raster, the Corridor Suitability Analysis continues with the Cost Distance tool run on the two Coronado National Forest black bear habitat area feature classes. This outputs Cost Distance and Cost Distance Backlink Rasters.

Coronado N.F. Destination Raster - 1st feature class

The cost distance raster for the northeastern unit of Coronado N.F.

Coronado N.F. Destination Raster - 2nd feature class

The cost distance raster for the southwestern unit of Coronado N.F.

Together the two cost distance rasters for Coronado National Forest are the parameters for the Corridor geoprocessing tool, which generates the Least-Path Corridor raster. The threshold value for determining the best corridors was subjective, so I went with percentages used in the previous scenario, where the minimum destination cost value multiplied by 1.05 represented the optimal corridor. Chose a color scheme based upon the ColorBrewer web site.

Black Bear Suitability Corridor Analysis

The Least-Path Corridor for a protected Black Bear Corridor between Coronado National Forest units

Sunday, August 4, 2024

Least-Cost Path and Corridor Analysis with GIS

The second half of Module 6 for GIS Applications conducts Least-Cost Path and Corridor Analysis on two scenarios. The first continues working with the Jackson County, Oregon datasets from Scenario 2.

There are several ways that GIS measures distance. Euclidean, the simplest, represents travel across a straight line or "as the crow flies". Manhattan distance simulates navigating along a city street grid, where travel is restricted to either north-south and east-west directions. Network Analysis models travel in terms of time, where travel is restricted by a road network or transit infrastructure.

Least-Cost Path Analysis models travel across a surface. It determines the single best course, a polyline, that has the lowest cost for a given source and destination, which are represented by points. This can be described as the routing over a landscape that is not restricted by road networks.

The course through the landscape is modeled as a cost. More specifically each cell in a cost raster has a value which represents the cost of traveling through it.

Typical cost factors are slope and land cover. A cost surface can vary from just a single factor to a combination of them. Even if multiple factors are considered, the analysis only uses a single cost raster.

Least-Cost Path Analysis can be expanded to Corridor Analysis. Instead of resulting in a single base solution represented by a polyline, corridor analysis produces multiple solutions, representing a zone where costs are close to the least cost. The corridor width uses is somewhat subjective. It is controlled by deciding what range of cost to consider. Values of a few percentage points above the lowest cost to as much as 10% above the lowest cost are common.

Scenario 3 uses least-cost path analysis on an area of land in the planning for a potential pipeline. Cost factors include elevation, proximity to rivers and potential crossings of waterways. Datasets used for these cost factors include a DEM, a rivers feature class and feature classes determining the source and destination of the proposed pipeline. Analysis proceeds focusing on each cost factor individually.

Geoprocessing Flowchart for Scenario 3 - Analysis D

Geoprocessing flowchart for least-cost path analysis factoring solely on slope

Focusing first on the DEM, the raster is converted to a slope raster, and subsequently reclassified using a cost factor range of eight values. The next analysis step utilizes the Cost Distance geoprocessing tool. Using an iterative algorithm, a cost distance raster is generated that represents the accumulated cost to reach a given cell from the source location point.

A cost backlink raster is also created, which traces back how to reach a given cell from the source. This reveals the actual path utilized to obtain the lowest cost. The actual cell values of the backlink raster represent either cardinal directions or the intercardinal point (NE, NW, etc.) instead of cost. The combination of the two output rasters contain every least cost path solution from the single source to all cells within the study area.

Cost Distance Raster - Values represent the cost of traveling through a cell

Cost Distance Backlink Raster - Values correspond with compass directions

The final step of least cost path analysis obtains the least cost path from the source to one or more destinations. The result of the Cost Path geoprocessing tool, this consists of a single polyline representing the lowest accumulated cost.

Output least-cost path and DEM for a proposed pipeline in Oregon

The result of Least-Cost Path Analysis solely on slope

Continuing our analysis of a proposed pipeline in Jackson County, Oregon, we factor in river crossings as a cost factor.

Geoprocessing Flowchart for Scenario 3 - Analysis E

Geoprocessing flowchart for least-cost path analysis factoring in both slope and river crossings

The result of factoring in river crossings to the cost analysis reduces potential crossings to five from the 16 when factoring in slopes alone:

Least-Cost Path Analysis factoring in both river crossings and slope

Furthering our analysis, we change from factoring in river crossings to instead factor in the distance to waterways. Using a multiple ring buffer, cost factors are set high for areas within 100 meters of hydrology and moderate for areas within 500 meters. Distances beyond 500 meters from a waterway are zero, reflecting no cost.

Geoprocessing Flowchart for Scenario 3 - Analysis F

Geoprocessing flowchart for least-cost path analysis factoring slope and proximity to waterways

As the cost factor criterion for the least-cost path analysis is adjusted to better compensate natural factors, the least cost path adjusts accordingly:

Least-Cost Path Analysis with the cost factors of Slope and Proximitys to Waterway

The final analysis for third scenario looks at Least-Cost Path Corridor Analysis, which not only includes the least-cost path, but also a multiple of other least-cost alternatives within a corridor determined on a case-by-case basis.

Geoprocessing Flowchart for Scenario 3 - Analysis G

Geoprocessing flowchart for least-cost path corridor analysis

The geoprocessing to develop the least-cost path corridor utilizes the previously generated cost raster factoring in the proximity to waterways and slope. Instead of using the source point as the feature source data, the Cost Distance tool is based off the destination point feature class.

Together the two cost distance rasters, one based off the "destination" feature class and the other off the "source" feature class, are input into the Corridor geoprocessing tool. This outputs the Least-Path Corridor raster, which areas of least-cost paths symbolized based upon a percentage from the minimum cost value:

Least-Cost Path Corridor Analysis for a proposed pipeline

Thursday, August 1, 2024

Suitability Modeling with GIS

Module 6 for GIS Applications includes four scenarios conducting Suitability and Least-Cost Path and Corridor analysis. Suitability Modeling identifies the most suitable locations based upon a set of criteria. Corridor analysis compiles an array of all the least-cost paths solutions from a single source to all cells within a study area.

For a given scenario, suitability modeling commences with identifying criteria that defines the most suitable locations. Parameters specifying such criteria could include aspects such as percent grade, distance from roads or schools, elevation, etc.

Each criteria next needs to be translated into a map, such as a DEM for elevation. Maps for each criteria are then combined in a meaningful way. Often Boolean logic is applied to criteria maps where suitability is assigned the value of true and non suitable is false. Boolean suitability modeling overlays maps for all criteria and then determines where all criterion is met. The result is a map showing areas suitable versus not suitable.

Another evaluation system in suitability modeling use Scores or Ratings. This scenario expresses criterion as a map showing a range of values from very low suitability to very high, with intervening values in between. Suitability is expressed as a dimensionless score, often by using Map Algebra on associated rasters.

Scenario 1 for lab 6 analyzes a study area in Jackson County, Oregon for the establishment of a conservation area for mountain lions. Four sets of criterion area are specified. Suitable areas must have slopes exceeding 9 degrees, be covered by forest, be located within 2,500 feet of a river and more than 2,500 feet from highways.

Flow Chart outlining the Suitability Modeling

Flowchart outlining input data and geoprocessing steps.

Working with a raster of landcover, a DEM and polyline feature classes for rivers and highways, we implement Boolean Suitability modeling in Vector. The DEM raster is converted to a slope raster, so that it can be reclassified into a Boolean raster where slopes above 9 feet are assigned the value of 1 (true) and those below 0 (false). The landcover raster is simply reclassified where cells assigned to the forest land use class are true in the Boolean.

Buffers were created on the river and highway feature classes, where areas within 2,500 feet of the river are true for suitability and areas within 2,500 feet of the highway are false for suitability. Once the respective rasters are converted to polygons and the buffer feature classes clipped to the study area, a criteria union is generated using geoprocessing. The suitability is deduced based upon the Boolean values of that feature class and selected by a SQL query to output the final suitability selection.

We repeat this process, but utilizing Boolean Suitability in Raster. Using the Euclidean Distance tool in ArcGIS Pro, buffers for the river and highway feature classes were output as raster files where suitability is assigned the value of 1 for true and 0 for false. Utilized the previously created Boolean rasters for slope and landcover.

Obtaining the suitable selection raster with the four rasters utilizes the Raster Calculator geoprocessing tool. Since the value of 1 is true for suitability in the four rasters, simply adding the cell values for all result in a range of 0 to 4, where 4 equates to fully suitable. The final output was a Boolean where 4 was reclassified as 1 and all other values were assigned NODATA.

Scenario 2 determines the percentage of a land area suitable for development in Jackson County, Oregon. The suitability criteria ranks land areas comprising meadows or agricultural areas as most optimal. Additional criterion includes soil type, slopes of less than 2 degrees, a 1,000 foot buffer from waterways and a location within 1,320 feet of existing roads. Input datasets consist of rasters for elevation and landcover, and feature classes for rivers, roads and soils.

Flowchart showing data input and processes to output a weighted suitability raster

Flowchart of the geoprocessing for Scenario 2

With all five criteria translated into respective maps, we proceed with combining them into a final result. However with Scenario 2, the Weighted Overlay geoprocessing tool is implemented. This tool utilizes a percentage influence on each input raster corresponding to the raster's significance to the criterion. The percentages of each raster input must total 100 and all rasters must be integer-based.

Cell values of each raster are multiplied by their percentage influence and the results compiled in the generation of an output raster. The first scenario evaluated for lab 6 includes an equal weight scenario, where the 5 raster files have the same percentage influence (20%). The second scenario assigned heavier weight to slope (40%) while retaining 20% influence to land cover and soils criterion, and decreasing the percentage influence of road and river criterion to 10%. The final comparison between the two scenarios:

Land Development Suitability Modeling - Jackson County, OR

Opted to symbolize the output rasters using a diverging color scheme from ColorBrewer.

Wednesday, July 24, 2024

Coastal Flooding Analysis - Storm Surge

Module 4 for GIS Applications performs analyses on coastal flooding and storm surge. Storm surge is generally associated with landfalling tropical storms and hurricanes, but it can also be attributed to extratropical storms, such a Nor'easters along the Eastern Seaboard, or powerful winter storms with low barometric pressure and tight wind gradients. Coastal flooding events can also be due to spring tide events based upon the moon's cycle.

Storm surge from Hurricane Idalia inundated Bayshore Boulevard in Tampa, FL

Storm surge inundating Bayshore Boulevard in Tampa during Hurricane Idalia on August 30, 2023.

The first lab assignment revisits Superstorm Sandy, which made landfall as a hurricane transitioning into a powerful extratropical storm along the New Jersey coastline on October 29, 2012. The second and third part of the lab assignment uses Digital Elevation Models (DEMs) to develop scenarios for a generalized storm surge.

The lab analysis on Hurricane Sandy works with LiDAR data covering a barrier island along the Atlantic Ocean between Mantoloking and Point Pleasant Beach, New Jersey. LAS files were downloaded showing the conditions before the storm's impact and afterward.

Initial work in the lab for Module 4 created DEMs by converting the two LAS files to TIN files using geoprocessing in ArcGIS Pro. The TINs were then converted to a raster with a separate geoprocessing tool running upwards of ten minutes.

Comparing the two raster datasets, some pronounced impacts from the hurricane turned extratropical storm were visible. Several datapoints representing structures along the beach were noticeably missing. Additionally a wide breech was cut across the island, with several smaller breeches visible further north. It also appearing that severe scouring of the sand along the coast occurred with a wide area of lower data returns on the post Sandy dataset.

Indicative of the large file size of LiDAR data, when substracting the raster cell values of the post Sandy dataset from the pre Sandy dataset, geoprocessing took 12 minutes and 59 seconds. The result is a raster with values ranging from 33.69 to -35.87. Values toward the high range reflect earlier LiDAR returns, representing the build-up of material, such as sand or debris. Lower values in the change raster indicate later returns, or returns of bare-Earth. This correlates to areas where significant erosion may have occurred or the destruction of a structure.

The change in the the LiDAR pointclouds reveal parcels where homes were destroyed or where the barrier island was breeched by storm surge. The change raster quantifies the amount of change.

LiDAR before Superstorm Sandy

LiDAR showing a major breech caused by Superstorm Sandy

The difference between the two LiDAR pointclouds showing the breech and associated destruction of structures

Recent aerial imagery of Mantoloking, NJ where the breech occurred

The overall impact of Hurricane Sandy on the boroughs of Mantoloking, Bay Head and Point Pleasant Beach in Ocean County, New Jersey:

The raster quantifying the rate of change between the LiDAR datasets before and after Sandy

Output raster using a Boolean

The second analysis for Module 4 utilizes a storm surge DEM for the state of New Jersey. Our task was to reclassify the raster where all cells with values of 2 meters or less constitute areas potentially submerged as a result of Hurricane Sandy. Those cells with values above 2 meters were classified as "no data."

I began the process by adding a new field to the DEM for flooded areas due to storm surge. Cells where the elevation value was equal to or less than 2 were assigning a flood value of 1 for the Boolean of true. All other cells with an elevation value above 2 were assigned 0, for false.

With the added field, I used the Reclassify geoprocessing tool to output a raster of the DEM showing potentially flooded areas versus those at higher ground. The mask was set to the feature class of the New Jersey state outline to exclude areas of the DEM outside of the state that were not needed for our analysis.

Our analysis then focused on Cape May County in South Jersey, where we quantify the percentage of the county potentially inundated with a 2 meter storm surge. The storm surge raster was converted to a polygon and subsequently clipped to the the polygon of the Cape May County boundary.

Another issue encountered was that the storm surge data and county boundary were in different units of measurement. Ended up clipping the storm surge polygon from the county polygon, then comparing the output with the unclipped county boundary for the final percentage. This workaround succeeded as both used the same units.

Clipped feature class of the storm surge polygon over Cape May County, NJ

2-ft storm surge data clipped to Cape May County, NJ

The third analysis for Lab 4 focuses on a potential 1 meter storm surge in Collier County, Florida. Two DEM's are provided, one derived from LiDAR data and another from the regular elevation model from the USGS. Commenced working with this data by reclassifying each DEM to a new raster using a Boolean where any elevation 1 meter or less is considered flooded and anything above is not flooded.

Since we are only interested in storm surge related flooding, any areas shown inland that are entirely disconnected from the tidal basin are omitted from analysis. Accomplished this by using the Region Group geoprocessing tool, where all cells in a raster are reclassified by group and assigned a new ObjectID number.

The Region Group tool takes all of the cells within the hydrologic area of open waters extending into the Gulf of Mexico, and all associated bays and waterways seamlessly feeding into it, and assigns them to a single ObjectID. Similarly, the mainland of Florida is assigned an ObjectID as well. Islands, lakes, ponds, etc. that are independent of one another are also assigned unique ObjectID numbers.

Region Group assigns a unique ObjectID for each homogenous area of raster cells. The different colors in this sample from Naples shows separate groups for each land and hydrologic feature based upon the 1 meter elevation threshold

Using the Extract by Attribute geoprocessing tool, selecting the hydrologic area comprising the entire tidal basin is straightforward once the ObjectID number is determined. With that, a new raster comprising just water areas subjected to storm surge is output and subsequently converted to a polygon. The polygon feature class was juxtaposed with a feature class of building footprints for quantitative analysis.

There are a variety of methods in ArcGIS Pro that can be used to determine the number of impacted buildings of a 1 meter storm surge. One such process was to Select by Location based upon the Intersect relationship. This selects records where any part of a building footprint polygon falls within the storm surge raster polygon. Having preadded two fields to the buildings feature class based upon the Boolean of 1 = impacted and 0 = unaffected, with those records selected, used Calculate Field to assign each a value of 1. Repeated the process for both rasters and then proceeded with statistical calculations.

The final analysis quantified whether a building was located within the storm surge zone for the LiDAR based DEM, the USGS based DEM, or both. Errors of omission were calculated where a building was impacted by storm surge in the LiDAR DEM but not the USGS DEM, with that total divided by the overall total number of buildings affected in the LiDAR DEM. Errors of commission were calculated using the opposite and taking that result and dividing it again by the overall total number of buildings affected in the LiDAR DEM. The result tabulates affected buildings by feature type:

Storm surge inundation of areas 1 meter or less in elevation based upon DEMs