Chapter 4 Description of Data

The data was extracted from the Stanford Open Policing Project data hub. The Open Policing Project’s website contains ninety city and state-wide standardized stop datasets in both RDS and CSV format. From there, we uploaded six datasets to a mySQL database and locally queried the data on our personal laptops via RMySQL package within R.

The possible variables within each dataset are listed below with description and corresponding units.

Variable Description Units
Stop Date States the year, month, day XXXX-XX-XX
Stop Time States the time of the stop XX:XX:XX
Stop Location States the street intersection of the stop Names two streets, occasionally lists a city
Driver Race Describes the race of driver as given on the driver’s license String representing the subject’s race
Driver Sex Describes the sex orientation of driver Binary Female/Male
Driver Age Describes the age of the driver Integer representing the driver’s age
Search Conducted Describes whether there was a search conducted or not True (search conducted) or False (search not conducted)
Contraband Found Describes whether there was contraband found after a search was conducted True (contraband found) or False (contraband not found)
Citation Issued Describes whether there was a citation issued to the driver True (citation issued) or False (no citation issued)
Warning Issued Describes whether there was a warning issued to the driver True (warning issued) or False (no warning issued)
Frisk Performed Describes whether there was a frisk performed on the driver True (frisk performed) or False (no frisk performed)
Arrest Made Describes whether the driver was arrested True (arrest made) or False (no frisk performed)
Reason for Stop Describes why the driver was stopped String describing the reason for the stop
Violation Describes the violation when a citation or warning was issued String representing the violation

We focused on the datasets for San Francisco, San Diego, Oakland, Raleigh, Charlotte, and Durham, and San Antonio. Below, the “X” represents whether the dataset for that particular city includes the specified variable.

City Number.of.Observations Stop.Date Stop.Time Stop.Location Driver.Race Search.Conducted Contraband.found Citation.Issued Warning.Issued Frisk.Performed Arrest.Made Reason.For.Stop Violation
San Francisco 905,070 X X X X X X X X X X
Oakland 133,405 X X X X X X X X X
San Diego 382,844 X X X X X X X X X X
Charlotte 1,598,453 X X X X X X X X X X
Durham 326,024 X X X X X X X X X X
Raleigh 856,400 X X X X X X X X X X
San Antonio 1,040,428 X X X X X X X X X

While many of the datasets included a multitude of the variables above, there were clear limitations. First, while the California datasets were rich in variables, the “reason for stop” was dubious at best. Not only did many reasons make no logical sense, but also up to four reasons were listed for a single observation. Moreover, there is no uniformity in the “reason for stop” for all datasets in California. In contrast, the datasets in North Carolina had a clean “reason for stop”; however, these datasets do not have latitude/longitude which limits the geographical comparisons we can make. Additionally, only a small fraction of the city and state datasets contained subject age; the scarcity of information regarding subject age makes it difficult to compare nationally. Lastly, only a few datasets, such as Chicago, contained the police officer’s identification number. While we did not investigate the effect of the age, race, or other factors of individual police officers, it would be meaningful for more datasets to contain that information to draw national correlations.

4.1 Querying Datasets with RMySQL

First import the necesarry packages to work with SQL in R

library(XML)
library(RMySQL)
## Loading required package: DBI
library(tidyverse)
library(knitr)
library(rlist)
## Warning: package 'rlist' was built under R version 3.6.3

Next, setup a connection to the database.

# Connect to database
con <- dbConnect(
  MySQL(), host = "traffic.st47s.com", user = "student",
  password = "Sagehen47", dbname = "traffic")

Note: If you are looking to modify the database you must use a different username and password that can be acquired through Professor Hardin.

dbGetQuery returns a dataframe, so storing it in a variable adds that dataframe to the global environment

raleigh_df <- DBI::dbGetQuery(con, "SELECT * FROM NCraleigh LIMIT 15")

4.1.1 Querying Subsets of dataset

Certain scenarios such as running a logistic regression need not every variable from a dataset - only the variables of interest. Moreover, running a logistic regression on every city with all columns is a taxing operation. This is where querying can be useful in selecting portions of the datasets deemed essential.

The logistic regression called for the race, age, sex, and data variables. The query_data function will takes in a database table name (city_name) and will select the desired variables from that table.

query_data <- function(city_name){
  # cancenate SQL query string
  command <- 
    paste("SELECT subject_age, subject_race, subject_sex, date, search_conducted FROM",
         city_name, sep = " ")
  return(DBI::dbGetQuery(con, command))
} 

# Make a list of datasets
datasets <- list()
datasets <- lapply(datasets_of_interest, query_data)

4.2 Web Scrapping and Uploading Datasets

This section will show the dataset upload process. The first step is to acquire all the RDS weblinks from the Stanford Policing Project Website.

The code here is written by Professor Sarkis and it will get all the rds files from the stanford policing project website. Then, it stores all the file names into a character string

url <- "https://openpolicing.stanford.edu/data/"
doc <- htmlParse(readLines(url), asText=TRUE)
links <- xpathSApply(doc, "//a/@href")
free(doc)

all_links <- as.character(links[grep('_.*rds', links)])

One could also get all the dataset file names for a specific state by adding the state’s two character abbreviation e.g.

california_links <- as.character(links[grep('ca_.*rds', links)])

Another optional step is to splice certain links using base R indexing

# subset list to get desired links
all_links <- all_links[43:48]
all_links <- all_links[-c(1, 2, 5)]

Since the datasets are written to the database using the dbWriteTable function from RMySQL library, logical datatypes get interpreted as character, which will be tedious to deal with in future analysis. Therefore, the convertLogicalToInt function will convert columns that have logical datatypes into doubles (0 and 1).

convertLogicalToInt function: input: dataset output: dataset with logical values set to binary

convertLogicalToInt <- function(input_df){
  vars <- sapply(input_df, class) %>% 
  list.which(.=='logical')

  input_df[,vars] <- sapply(input_df[,vars], as.numeric)
  input_df
  return(input_df)
}

Subsequently, dbWriteTable also needs a name for the table, which can parsed out of the weblinks. The getDFName function returns a suitable datatable name.

getDFName: input: name of dataset (this could also be name of link from web scrape) output: clean dataset name that includes state and city

getDFName <- function(input_string){
  tmp <- str_split(input_string, '_', simplify = TRUE)
tmp

state <- str_to_upper(tmp[,2])

if(tmp[,4] != "2019"){
    cityName <- paste(tmp[,3], tmp[,4], sep="")
    name <- paste(state, cityName, sep="")
} else {
    cityName <- tmp[,3]
    name <- paste(state, cityName, sep="")
}
name
return(name)
}

The penultimate step is to read the RDS file and write that to the database using the dbWriteTable function. The results from the two previous functions will serve as arguments for the dbWriteTable function.

uploadLinksToDatabase: input: string link from web scrape outout: 0 - indicating that the code finished executing This function will take in a RDS link and write the dataset to the database

uploadLinksToDatabase <- function(input_link){
  tmpDF <- readRDS(gzcon(url(input_link)))
  tmpDF <- convertLogicalToInt(tmpDF)
  name<- getDFName(input_link)
  
  dbWriteTable(con2, name, tmpDF, overwrite = TRUE)
  return(0)
}

lapply serves as a ‘oneline’ for loop. The argument all_links is a character string of links, which lapply will iterate through and call the function, uploadLinksToDatabase, on each link

lapply(all_links, uploadLinksToDatabase)