Chapter 4 Description of Data

The data was extracted from the Stanford Open Policing Project data hub. The Open Policing Project’s website contains ninety city and state-wide standardized stop datasets in both RDS and CSV format. From there, we uploaded six datasets to a mySQL database and locally queried the data on our personal laptops via RMySQL package within R.

The possible variables within each dataset are listed below with description and corresponding units.

Variable	Description	Units
Stop Date	States the year, month, day	XXXX-XX-XX
Stop Time	States the time of the stop	XX:XX:XX
Stop Location	States the street intersection of the stop	Names two streets, occasionally lists a city
Driver Race	Describes the race of driver as given on the driver’s license	String representing the subject’s race
Driver Sex	Describes the sex orientation of driver	Binary Female/Male
Driver Age	Describes the age of the driver	Integer representing the driver’s age
Search Conducted	Describes whether there was a search conducted or not	True (search conducted) or False (search not conducted)
Contraband Found	Describes whether there was contraband found after a search was conducted	True (contraband found) or False (contraband not found)
Citation Issued	Describes whether there was a citation issued to the driver	True (citation issued) or False (no citation issued)
Warning Issued	Describes whether there was a warning issued to the driver	True (warning issued) or False (no warning issued)
Frisk Performed	Describes whether there was a frisk performed on the driver	True (frisk performed) or False (no frisk performed)
Arrest Made	Describes whether the driver was arrested	True (arrest made) or False (no frisk performed)
Reason for Stop	Describes why the driver was stopped	String describing the reason for the stop
Violation	Describes the violation when a citation or warning was issued	String representing the violation

We focused on the datasets for San Francisco, San Diego, Oakland, Raleigh, Charlotte, and Durham, and San Antonio. Below, the “X” represents whether the dataset for that particular city includes the specified variable.

City	Number.of.Observations	Stop.Date	Stop.Time	Stop.Location	Driver.Race	Search.Conducted	Contraband.found	Citation.Issued	Warning.Issued	Frisk.Performed	Arrest.Made	Reason.For.Stop	Violation
San Francisco	905,070	X	X	X	X	X	X	X	X		X	X
Oakland	133,405	X	X		X	X	X	X	X		X	X
San Diego	382,844	X	X	X	X	X	X	X	X		X	X
Charlotte	1,598,453	X	X		X	X	X	X	X	X	X	X
Durham	326,024	X	X		X	X	X	X	X	X	X	X
Raleigh	856,400	X	X		X	X	X	X	X	X	X	X
San Antonio	1,040,428	X	X	X	X	X	X	X			X		X

While many of the datasets included a multitude of the variables above, there were clear limitations. First, while the California datasets were rich in variables, the “reason for stop” was dubious at best. Not only did many reasons make no logical sense, but also up to four reasons were listed for a single observation. Moreover, there is no uniformity in the “reason for stop” for all datasets in California. In contrast, the datasets in North Carolina had a clean “reason for stop”; however, these datasets do not have latitude/longitude which limits the geographical comparisons we can make. Additionally, only a small fraction of the city and state datasets contained subject age; the scarcity of information regarding subject age makes it difficult to compare nationally. Lastly, only a few datasets, such as Chicago, contained the police officer’s identification number. While we did not investigate the effect of the age, race, or other factors of individual police officers, it would be meaningful for more datasets to contain that information to draw national correlations.

4.1 Querying Datasets with RMySQL

First import the necesarry packages to work with SQL in R

library(XML)
library(RMySQL)

## Loading required package: DBI

library(tidyverse)
library(knitr)
library(rlist)

## Warning: package 'rlist' was built under R version 3.6.3

Next, setup a connection to the database.

# Connect to database
con <- dbConnect(
  MySQL(), host = "traffic.st47s.com", user = "student",
  password = "Sagehen47", dbname = "traffic")

Note: If you are looking to modify the database you must use a different username and password that can be acquired through Professor Hardin.

dbGetQuery returns a dataframe, so storing it in a variable adds that dataframe to the global environment

raleigh_df <- DBI::dbGetQuery(con, "SELECT * FROM NCraleigh LIMIT 15")

4.1.1 Querying Subsets of dataset

Certain scenarios such as running a logistic regression need not every variable from a dataset - only the variables of interest. Moreover, running a logistic regression on every city with all columns is a taxing operation. This is where querying can be useful in selecting portions of the datasets deemed essential.

The logistic regression called for the race, age, sex, and data variables. The query_data function will takes in a database table name (city_name) and will select the desired variables from that table.

query_data <- function(city_name){
  # cancenate SQL query string
  command <- 
    paste("SELECT subject_age, subject_race, subject_sex, date, search_conducted FROM",
         city_name, sep = " ")
  return(DBI::dbGetQuery(con, command))
} 

# Make a list of datasets
datasets <- list()
datasets <- lapply(datasets_of_interest, query_data)

4.2 Web Scrapping and Uploading Datasets

This section will show the dataset upload process. The first step is to acquire all the RDS weblinks from the Stanford Policing Project Website.

The code here is written by Professor Sarkis and it will get all the rds files from the stanford policing project website. Then, it stores all the file names into a character string

url <- "https://openpolicing.stanford.edu/data/"
doc <- htmlParse(readLines(url), asText=TRUE)
links <- xpathSApply(doc, "//a/@href")
free(doc)

all_links <- as.character(links[grep('_.*rds', links)])

One could also get all the dataset file names for a specific state by adding the state’s two character abbreviation e.g.

california_links <- as.character(links[grep('ca_.*rds', links)])

Another optional step is to splice certain links using base R indexing

# subset list to get desired links
all_links <- all_links[43:48]
all_links <- all_links[-c(1, 2, 5)]

Since the datasets are written to the database using the dbWriteTable function from RMySQL library, logical datatypes get interpreted as character, which will be tedious to deal with in future analysis. Therefore, the convertLogicalToInt function will convert columns that have logical datatypes into doubles (0 and 1).

convertLogicalToInt function: input: dataset output: dataset with logical values set to binary

convertLogicalToInt <- function(input_df){
  vars <- sapply(input_df, class) %>% 
  list.which(.=='logical')

  input_df[,vars] <- sapply(input_df[,vars], as.numeric)
  input_df
  return(input_df)
}

Subsequently, dbWriteTable also needs a name for the table, which can parsed out of the weblinks. The getDFName function returns a suitable datatable name.

getDFName: input: name of dataset (this could also be name of link from web scrape) output: clean dataset name that includes state and city

getDFName <- function(input_string){
  tmp <- str_split(input_string, '_', simplify = TRUE)
tmp

state <- str_to_upper(tmp[,2])

if(tmp[,4] != "2019"){
    cityName <- paste(tmp[,3], tmp[,4], sep="")
    name <- paste(state, cityName, sep="")
} else {
    cityName <- tmp[,3]
    name <- paste(state, cityName, sep="")
}
name
return(name)
}

The penultimate step is to read the RDS file and write that to the database using the dbWriteTable function. The results from the two previous functions will serve as arguments for the dbWriteTable function.

uploadLinksToDatabase: input: string link from web scrape outout: 0 - indicating that the code finished executing This function will take in a RDS link and write the dataset to the database

uploadLinksToDatabase <- function(input_link){
  tmpDF <- readRDS(gzcon(url(input_link)))
  tmpDF <- convertLogicalToInt(tmpDF)
  name<- getDFName(input_link)
  
  dbWriteTable(con2, name, tmpDF, overwrite = TRUE)
  return(0)
}

lapply serves as a ‘oneline’ for loop. The argument all_links is a character string of links, which lapply will iterate through and call the function, uploadLinksToDatabase, on each link

lapply(all_links, uploadLinksToDatabase)