Chapter 4 Description of Data
The data was extracted from the Stanford Open Policing Project data hub. The Open Policing Project’s website contains ninety city and state-wide standardized stop datasets in both RDS and CSV format. From there, we uploaded six datasets to a mySQL database and locally queried the data on our personal laptops via RMySQL package within R.
The possible variables within each dataset are listed below with description and corresponding units.
Variable | Description | Units |
---|---|---|
Stop Date | States the year, month, day | XXXX-XX-XX |
Stop Time | States the time of the stop | XX:XX:XX |
Stop Location | States the street intersection of the stop | Names two streets, occasionally lists a city |
Driver Race | Describes the race of driver as given on the driver’s license | String representing the subject’s race |
Driver Sex | Describes the sex orientation of driver | Binary Female/Male |
Driver Age | Describes the age of the driver | Integer representing the driver’s age |
Search Conducted | Describes whether there was a search conducted or not | True (search conducted) or False (search not conducted) |
Contraband Found | Describes whether there was contraband found after a search was conducted | True (contraband found) or False (contraband not found) |
Citation Issued | Describes whether there was a citation issued to the driver | True (citation issued) or False (no citation issued) |
Warning Issued | Describes whether there was a warning issued to the driver | True (warning issued) or False (no warning issued) |
Frisk Performed | Describes whether there was a frisk performed on the driver | True (frisk performed) or False (no frisk performed) |
Arrest Made | Describes whether the driver was arrested | True (arrest made) or False (no frisk performed) |
Reason for Stop | Describes why the driver was stopped | String describing the reason for the stop |
Violation | Describes the violation when a citation or warning was issued | String representing the violation |
We focused on the datasets for San Francisco, San Diego, Oakland, Raleigh, Charlotte, and Durham, and San Antonio. Below, the “X” represents whether the dataset for that particular city includes the specified variable.
City | Number.of.Observations | Stop.Date | Stop.Time | Stop.Location | Driver.Race | Search.Conducted | Contraband.found | Citation.Issued | Warning.Issued | Frisk.Performed | Arrest.Made | Reason.For.Stop | Violation |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
San Francisco | 905,070 | X | X | X | X | X | X | X | X | X | X | ||
Oakland | 133,405 | X | X | X | X | X | X | X | X | X | |||
San Diego | 382,844 | X | X | X | X | X | X | X | X | X | X | ||
Charlotte | 1,598,453 | X | X | X | X | X | X | X | X | X | X | ||
Durham | 326,024 | X | X | X | X | X | X | X | X | X | X | ||
Raleigh | 856,400 | X | X | X | X | X | X | X | X | X | X | ||
San Antonio | 1,040,428 | X | X | X | X | X | X | X | X | X |
While many of the datasets included a multitude of the variables above, there were clear limitations. First, while the California datasets were rich in variables, the “reason for stop” was dubious at best. Not only did many reasons make no logical sense, but also up to four reasons were listed for a single observation. Moreover, there is no uniformity in the “reason for stop” for all datasets in California. In contrast, the datasets in North Carolina had a clean “reason for stop”; however, these datasets do not have latitude/longitude which limits the geographical comparisons we can make. Additionally, only a small fraction of the city and state datasets contained subject age; the scarcity of information regarding subject age makes it difficult to compare nationally. Lastly, only a few datasets, such as Chicago, contained the police officer’s identification number. While we did not investigate the effect of the age, race, or other factors of individual police officers, it would be meaningful for more datasets to contain that information to draw national correlations.
4.1 Querying Datasets with RMySQL
First import the necesarry packages to work with SQL in R
## Loading required package: DBI
## Warning: package 'rlist' was built under R version 3.6.3
Next, setup a connection to the database.
# Connect to database
con <- dbConnect(
MySQL(), host = "traffic.st47s.com", user = "student",
password = "Sagehen47", dbname = "traffic")
Note: If you are looking to modify the database you must use a different username and password that can be acquired through Professor Hardin.
dbGetQuery returns a dataframe, so storing it in a variable adds that dataframe to the global environment
4.1.1 Querying Subsets of dataset
Certain scenarios such as running a logistic regression need not every variable from a dataset - only the variables of interest. Moreover, running a logistic regression on every city with all columns is a taxing operation. This is where querying can be useful in selecting portions of the datasets deemed essential.
The logistic regression called for the race, age, sex, and data variables. The query_data function will takes in a database table name (city_name) and will select the desired variables from that table.
query_data <- function(city_name){
# cancenate SQL query string
command <-
paste("SELECT subject_age, subject_race, subject_sex, date, search_conducted FROM",
city_name, sep = " ")
return(DBI::dbGetQuery(con, command))
}
# Make a list of datasets
datasets <- list()
datasets <- lapply(datasets_of_interest, query_data)
4.2 Web Scrapping and Uploading Datasets
This section will show the dataset upload process. The first step is to acquire all the RDS weblinks from the Stanford Policing Project Website.
The code here is written by Professor Sarkis and it will get all the rds files from the stanford policing project website. Then, it stores all the file names into a character string
url <- "https://openpolicing.stanford.edu/data/"
doc <- htmlParse(readLines(url), asText=TRUE)
links <- xpathSApply(doc, "//a/@href")
free(doc)
all_links <- as.character(links[grep('_.*rds', links)])
One could also get all the dataset file names for a specific state by adding the state’s two character abbreviation e.g.
Another optional step is to splice certain links using base R indexing
# subset list to get desired links
all_links <- all_links[43:48]
all_links <- all_links[-c(1, 2, 5)]
Since the datasets are written to the database using the dbWriteTable function from RMySQL library, logical datatypes get interpreted as character, which will be tedious to deal with in future analysis. Therefore, the convertLogicalToInt function will convert columns that have logical datatypes into doubles (0 and 1).
convertLogicalToInt function: input: dataset output: dataset with logical values set to binary
convertLogicalToInt <- function(input_df){
vars <- sapply(input_df, class) %>%
list.which(.=='logical')
input_df[,vars] <- sapply(input_df[,vars], as.numeric)
input_df
return(input_df)
}
Subsequently, dbWriteTable also needs a name for the table, which can parsed out of the weblinks. The getDFName function returns a suitable datatable name.
getDFName: input: name of dataset (this could also be name of link from web scrape) output: clean dataset name that includes state and city
getDFName <- function(input_string){
tmp <- str_split(input_string, '_', simplify = TRUE)
tmp
state <- str_to_upper(tmp[,2])
if(tmp[,4] != "2019"){
cityName <- paste(tmp[,3], tmp[,4], sep="")
name <- paste(state, cityName, sep="")
} else {
cityName <- tmp[,3]
name <- paste(state, cityName, sep="")
}
name
return(name)
}
The penultimate step is to read the RDS file and write that to the database using the dbWriteTable function. The results from the two previous functions will serve as arguments for the dbWriteTable function.
uploadLinksToDatabase: input: string link from web scrape outout: 0 - indicating that the code finished executing This function will take in a RDS link and write the dataset to the database
uploadLinksToDatabase <- function(input_link){
tmpDF <- readRDS(gzcon(url(input_link)))
tmpDF <- convertLogicalToInt(tmpDF)
name<- getDFName(input_link)
dbWriteTable(con2, name, tmpDF, overwrite = TRUE)
return(0)
}
lapply serves as a ‘oneline’ for loop. The argument all_links is a character string of links, which lapply will iterate through and call the function, uploadLinksToDatabase, on each link