MA 415 BURG Team 9 Final Project

For our model, we have chosen Arrest types as our binomial response variable, which ‘marked’ arrest type as 1 and ‘Unmarked’ arrest type as type 0. Our predictor variables are Race, Alcohol, and Gender. We then used a glm model for the binomial family.

In the original model we generated, we have observed that except the variable “male Gender”, every other categorical variables has significant effect in the prediction of probability of our dependent model, which points out that there is relation between the categorical variables we choose and their arrest type probability.

What’s more, based on the coefficient value, we can summarize that the variables “Black”,“Hispanic”,“Other” and “white” in Categorical “Race”, the variable “Alcohol” has negative log-odds, which means that will lower the probability they got arrest type as “marked”. While the variables “Male” “U” in Categorical variable”Gender”, the Race “Native American” has positive log-odds, which will increase their probability they got the arrest type as “Marked”.

We also built a confusion matrix to find our model accuracy. From this, our model accuracy is 0.9277944, which roughly is %92.78, when we pick the predict probability larger than 0.5 as 1 and less than 0.5 as 0. Compare with the marked and unmarked arrest type proportion, we may cautiously evaluate this model’s accuracy because we have to think about the probability of overfitting in our mode and the imbalance in our model.

In the model summary, we have notices that the null deviance is 984980 and residual deviance is 979510, and the deviance range is (-2.6691,1.2755), and our model may contains over fitting because of the imbalance outcome proportion in our dependent variable.

Based on the current outcome of our origional model, to enhance the performance of our original model, we are exploring several avenues for improvement. First and foremost, our current dataset comprises solely categorical variables. Introducing relevant numerical variables could significantly augment our model’s predictive capabilities, particularly when utilizing logistic regression as our analytical framework. To achieve this, we are considering merging our existing dataset with other correlated datasets, a step we believe is crucial in uncovering additional potential relationships between the dependent and independent variables.

Additionally, we recognize the imbalance in our current dependent variable. To address this, we are contemplating the exploration of alternative dependent variables. Building multiple analysis models with varied dependent variables can provide a broader perspective, potentially revealing different relationships and leading to more robust conclusions. This approach not only enhances the stability of our findings but also enriches our exploratory data analysis (EDA) by fostering new insights.

By diversifying our dataset and exploring alternative dependent variables, we aim to refine our understanding of the underlying relationships, ultimately improving the overall performance and reliability of our analytical models.

We plan to polish our visualizations and tables after finalizing our model to best show the data in terms of our method of statistical modeling. This will include highlighting information that informed the variables we made the decision to include. For example, in our scatterplot from our exploratory data analysis we will highlight the variables race, alcohol, and gender which seemed to have a potential relationship with arrest_type_numeric to show that this contributed to these variables being chosen for our statistical model. We also plan to add titles to our visualizations and tables to more clearly show what is being modeled in that particular plot.

Additionally, we will polish up our data visualizations by adding captions and annotations to write a short summary of what our takeaways from that visualization were. For example, in the scatterplot mentioned before we will add a caption or annotation saying that only race, alcohol, and gender appear to potentially have a relationship with arrest_type_numeric. We also plan to improve our figures using the options for displaying tables from https://gallery.htmlwidgets.org/, particularly the scatterD3 option to include both colors and comments more clearly for specific points in our plots. We are also planning to use pairsD3 to be able to show various relationships between variables that we explored to choose our model.

We are still trying out different EDA, with GLM being the priority. We are still unsatisfied with produced results, but we believe we could reach a conclusion that everyone can agree.