Traffic and Drugs Related Violations Dataset https://www.kaggle.com/datasets/shubamsumbria/traffic-violations-dataset/data This data set has 15 columns and 52,967 rows. The data was originally collected to record observations of traffic and drugs violations. This dataset is in the form of a CSV file, so we will be able to load and clean the data. The main question we hope to address is if one race has more traffic and drugs related violations than the other race categories. Another question we hope to address is if the outcome of the violation is different for a certain race category than it is for others, such as if people in one race were more or less likely to be arrested. We will also aim to answer this in terms of if a certain violation category more frequently resulted in a type of outcome as well. A challenge that we may come across is in organizing the data since it looks like the data may contain repeats of the same violator when there were multiple records of a violation, whether the same or different type, for one person. An additional challenge that might come up when analyzing this dataset is from the missing values in some columns which could make it more difficult to see connections.
People Receiving Homeless Response Services by Age, Race, Ethnicity, and Gender - Homelessness Demographic By Race https://catalog.data.gov/dataset/people-receiving-homeless-response-services-by-age-race-ethnicity-and-gender-b667d This data set has 5 columns and 2162 rows. The data was originally collected by 44 Continuums of Care (CoC) in California, which are regional planning bodies and service coordinators for homelessness care, and the data is about the people the centers serve to help them learn more about their demographics. The data is a CSV file so it can be loaded into R Studio and cleaned. We hope to address with this data set if one race receives more care for homelessness than other races across all the CoC centers in California, if there are any trends in the number of people of each race receiving care (is it increasing, decreasing, or steady), and if one region serves a larger population of homeless people than other regions. A challenge that might come up in the future is the data set does not have a lot of information so more complex analysis of the data might be limited. The data also seems to include age, ethnicity, and gender information but in different CSV files so we were wondering if there was a way for us to combine this information.
NCHS - Death rates and life expectancy at birth https://catalog.data.gov/dataset/nchs-death-rates-and-life-expectancy-at-birth This data set has 5 columns and 1072 rows. FIve columns consist of Year, Race, Sex, Average Life Expectancy in Years, and Age-adjusted Death Rate. This dataset tracked U.S. mortality trends from 1900 to 2017. It highlights the differences in age-adjusted death rates and life expectancy at birth by race and sex. Using this dataset, we hope to discover any correlation between race and life expectancy. Of course, there are multiple factors that we cannot take into our consideration, but we believe that at least some kind of trend should be visible with R. As time progresses, the medical care people receive should have been better than before. This means life expectancy would go up, but that would vary among different races and there is a difference between who can receive the care and who cannot. It is true that it would have been better if there were more columns and rows; however, it would be interesting to extract good information from a small sample of data. If this data does not work, we hope to find better datasets from online.