Karush Logo

DeGAUSS

Geocoding Address Information

Geocoding is a term that broadly means converting address information into map coordinates, specifically latitude and longitude, but it can also include other activities like determining the census tract.

Software developers often need to geocode address information for use in applications. Third-parties such as Google and Microsoft provide APIs for this purpose, but both have costs and/or rate limitations. Furthermore, they require that you share data with an external third party.

If you have the need to geocode a large volume of addresses, or if you need to keep the address information private, then DeGAUSS is an excellent, alternate solution. DeGAUSS simply takes a two-column csv file as input (ID, address), processes it via a Docker image, and outputs the geocoded information in a second csv file.

Getting Started

First, install Docker if you have not already done so. If you are unfamiliar with Docker, it is basically a mechanism to run software in an isolated environment, similar to a Virtual Machine but with less overhead and maintenance.

Once installed open Powershell and enter: docker run --rm hello-world

This will download and run a sample Docker image, called a container, and then exit the container when done (because of the --rm). Look for the text "This message shows that your installation appears to be working correctly."

DeGAUSS

Create a temporary folder on your C: drive called temp (C:\temp)

Create or download a csv file containing a few sample addresses. Name the file sample_addresses.csv and place it in the temp folder.

In Powershell enter: docker run --rm -v C:\\temp:/tmp ghcr.io/degauss-org/geocoder:latest sample_addresses.csv

Docker will download the DeGAUSS geocoder and process the addresses in your csv file. The image is about 6GB in size so this will take a few minutes. Subsequent runs will not require the Docker image to be downloaded again, unless there is an update. Note that Docker downloads images in "layers" so you will see multiple items being downloaded in Powershell.

When the process is done downloading and running you will see "FINISHED!" in Powershell.

There will now be a second file in your c:\temp folder named like "sample_addresses_geocoder_3.2.0_score_threshold_0.5.csv". Open this file and you will see your original information, plus now additional columns containing the geocoded data, including latitude and longitude. If the last column in the csv file contains the text "geocoded" that means DeGAUSS was able to accurately geocode the address. Any other text indicates the address wasn't fully geocoded.

Census Tract

A census tract represents represents an area of about 7000 homes, defined by the United States Census Bureau. Often statisics about areas, such as median income, are reported based on census tract. As such it is often helpful to also know the correspond census tract for a given address. This can be determined using DeGAUSS as well.

In Powershell enter: docker run --rm -v C:\\temp:/tmp ghcr.io/degauss-org/census_block_group:latest sample_addresses_geocoder_3.2.0_score_threshold_0.5.csv 2020

Note that this command references the output file of the previous step as its input file. The value "2020" is also appended to the command to indicate that we want DeGAUSS to use the most recent census tract information.

Docker will download the DeGAUSS census_block_group image and process the information in your csv file. The image is about 5GB in size, and will process similar to the previous step.

When the process is done downloading and running you will again see "FINISHED!" in Powershell.

There will now be a third file in your c:\temp folder named like "sample_addresses_geocoder_3.2.0_score_threshold_0.5_census_block_group_0.5.0_2020.csv". Open this file and you will see your original geocoded information, plus now additional columns containing census block and census tract information.

A Note on Docker

If you don't plan to use Docker regularly you might want to uninstall it or change the configuration in settings in Docker Desktop to disable the option "Start Docker Desktop when you log in", as there is some overhead to always having Docker running the background on your PC.

Conclusion

Using DeGAUSS with Docker is a great way to privately geocode large quantities of address data. The process can be used to obtain latitude, longitude, census block, and census tract data from a set of addresses stored in a csv file.