I recently completed a project for a client that involved some socioeconomic status (SES) data. This data is also known as Social Determinants of Health (SDH). Examples of this data are statistics related to housing, education, and jobs. A specific example would be "median household income".
Many research studies are done to explore correlations between this data and various real-world conditions.
One major source of SES data is the American Community Survey (ACS). This data is gathered by the US Census Bureau. The ACS is different than the once-a-decade census most of us are familiar with. The ACS is an annual survey of approximately 3.5 million US households regarding data on "how we live". The resulting dataset is publicly available on the US Census Bureau website.
The data I recently used was 5 year data by state. This dataset consists of a .zip file for each state (and the District of Columbia and Puerto Rico). Each .zip file contains a series of .txt files in csv (comma separated values) format.
For example, if you download the "Nebraska_Tracts_Block_Groups_Only.zip" file, you will see that it contains 284 text files. The files that begin with "e" are the actual statistics. The files that start with "m" are details about the margin of error of the various statistics.
None of the files contain column headers, so if you want know which data is in which columns, you you need to download the "2019_5yr_Summary_FileTemplates.zip" file. This file contains a series of Excel (.xlsx) files that index the "e" files. For example, if you wanted to find data for "MEDIAN HOUSEHOLD INCOME IN THE PAST 12 MONTHS", you will see this column listed in file seq58.xlsx, under column number 177. This means that if you parse file e20195ne__0058__000.txt (file #58) column number 177, you will find data on the median household income of Nebraska.
Each row in the "e" files are keyed to a location. The third column is the state, and the sixth column is the "logical record number" (LOGRECNO).
To make sure of the logical record number, you will need to download the "2019_ACS_Geography_Files.zip" file. This contains a series of files by state that map the location (logical record number) to its corresponding "Geography ID" value.
A Geography ID can be one of several things. We are using "census tracts". A census tract is a subdivision of a county, and is specifically divided up so that they each contain about 4000 people. You can find out the census tract for a given address using this tool on the US Census Bureau website.
Let's tie this all together now with an example. If I look up my address, I see that I am located in census tract "31055007314".
If I look up this census tract in the NE Geography file, I see that is is associated to logical record number 0003267:
NE - 0003267 - 14000US__31055007314__ - Census Tract 73.14, Douglas County, Nebraska (line 3268 of the spreadsheet).
If I then wanted to find the median household income for this census tract, I go to file #58, column #177 (based on the logic above), and whichever row is keyed to 0003267. I then find that the value is $108,854.
Very cool! You can browse the index files (the seqXX.xlsx files) to see all the different data elements you can find information on.
As you might imagine, a lot of studies are done to try to find correlations using this socioeconomic status data. For example, in "Neighbourhood socioeconomic status and biological ‘wear and tear’ in a nationally representative sample of US adults" Bird, Chloe E et al. find that "People who live in disadvantaged neighbourhoods experience increased physiologic dysregulation, irrespective of their own SES". The article defines specific set of variables related to education, employment, poverty, public assistance, family structure, and income that are used to derive an indicator of socioeconomic status. This statistic is referred to as "The Bird Index" in Gardner, Bret J., et al. "Incorporating a location-based socioeconomic index into a de-identified i2b2 clinical data warehouse.". This paper concludes that a lower Bird Index correlates to a higher number of emergency room visits.
In summary, The Bird Index is calculated as follows, using ACS data:
Each statistic (S1-S6) is converted to a z-score for each logical record number.
Next sum of the z-scores of the six statistics S1-S6, using the following formula, which assigns a negative value to the statistic when a higher absolute value indicates a negative effect on SES:
S_SUM = -S1 + -S2 + S3 + -S4 + -S5 + -S6 (note only S3 household income has a positive effect)
Now, again convert the resulting values to z-scores. These values are the Bird Index for each logical record number.
If permitted I'll publish the code for this to my github repository, but I'll summarize the details for now.
I used python for this work, taking advantage of the pandas library. I first created a list of all of the relevant ACS columns. For each column I then searched the index (seqXX.xlsx) files for the column's location, determining both the file number and the column number. I then extracted the data from the corresponding "e" file for every state using pandas. I stored this data in a SQL Server table, using bulk insert. Each column in this table was one of the ACS columns, and each row was a state/logical record number combination.
I wrote a TSQL stored procedure to calculate the Bird Index for each state/logical record number using the above data.
Per the US Census Bureau website, "The data collected through the American Community Survey and the 2020 Census help determine how to distribute more than $675 billion of federal spending each year."
This documents summarizes some specific use cases of ACS data related to healthcare. Hopefully this summary helps others take advantage of this important dataset.