Dataset 1 Job Note for future: use broom::tidy() to represent regression statistics
https://data.boston.gov/dataset/boston-jobs-policy-compliance-reports/resource/5ab4b4de-c970-4619-ab55-ce4338535b24
This is the link to the original data set which has 412,931 observations with 14 different variables. It describes the type of employees that were hired for various development projects taking place in Boston. We can use this dataset to outline the relationships between the area of residence, ethnicity, sex, and job someone may have. The goal of this dataset was to offer insights into how project managers and individual development initiatives adhere to policy requirements, as determined by their workforce. By gathering and sharing data related to the Residents Jobs Policy, the City aims to diminish disparities related to race and gender in construction projects while simultaneously enhancing employment prospects for Boston residents.
This dataset is not one of our original 3 datasets that we included in blog post 1. We felt that our original 3 proposed datasets were not adequate enough to conduct our project in. The first dataset was from Kaggle and generated by chatGPT so any outcomes that we would have come to throughout our project would not be based on real data. The second dataset was not only massive, but also the variables were mainly describing the demographics of a mother, a father, and their child so there wasn’t much opportunity to make a significant predictive model. The third dataset was simply too small. We feel that this dataset is definitely big enough and gives us opportunities to predict various outcomes. For example, we could potentially predict the demographics of workers based on where the development project takes place.
Original 14 variables: agency, compliance_project_name, project_address, neighborhood, developer, general_contractor_name, subcontractor, trade, period_ending, gender, person_of_color, race, boston_resident, worker_hours_this_period
Data Cleaning:
Can eliminate the following variables: person_of_color, project_address, developer, general_contractor_name, subcontractor
person_of_color: this is a binary variable but can look at “race” to determine whether the person is of color or not, so this variable is not necessary. project_address: There is a “neighborhood” variable that is a little more general but still encapsulates the residence of the particular observation. A specific address may be too small and trivial to actually be predicted, or be used to predict.
Dataset 2 Birth data: https://www.nber.org/research/data/vital-statistics-natality-birth-data This dataset contains a total of 3,669,928 observations with 225 different factors. It provides comprehensive information about births that occurred in the United States, based on data extracted from birth certificates submitted to vital statistics offices in each state. The dataset spans multiple years, with data collection methods evolving over time.
The dataset is a valuable resource interested in understanding how maternal education, racial backgrounds, and maternal age impact various aspects of childbirth, healthcare, and demographic trends in the United States. It plays a critical role in addressing disparities and improving healthcare and support systems for expectant mothers and newborns
Data Cleaning:
Can eliminate the following variables: Mother’s age,race, education, daily smoke before pregnancy, height in inches, bmi, weight in pound before pregnancy, weight in pound when delivered, pre-diabetes, Birth Weight