Blog Post 1

Dataset Proposals

Author

Team 9

Published

February 28, 2025

Modified

February 28, 2025

Original Data Source:

The datasets that we have chosen are Suspensions, School Support, and Harassment & Bullying. The datasets originate from the U.S. Department of Education’s Civil Rights Data Collection (CRDC) from the year 2021-2022, accessible at civilrightsdata.ed.gov. The CRDC collects information from public schools and school districts across the United States to monitor civil rights compliance and ensure equal educational opportunities.

Dataset 1: Suspensions.csv

This dataset specifically focuses on school suspensions, capturing disciplinary actions reported by various educational institutions. The data includes key identifiers such as state, district, and school names, along with detailed metrics on suspension incidents, disciplinary actions for students under IDEA (Individuals with Disabilities Education Act), and total days missed due to suspensions. The dataset is structured with 98,010 rows and 189 columns, indicating a vast amount of data covering multiple years and educational jurisdictions.

The data was originally collected for federal reporting and policy-making to identify disparities in disciplinary actions across different demographics, such as race, gender, and disability status. Schools are required to report disciplinary statistics to the CRDC, which is then used to inform research, educational reforms, and civil rights enforcement. However, challenges arise due to missing or inconsistent values, such as placeholder values like -9, which indicate missing or N/A data. Additionally, some columns contain mixed data types, requiring standardization for proper analysis. The primary questions that can be explored with this dataset include: which states or districts have the highest suspension rates, and whether certain demographic groups face disproportionate disciplinary actions. The size and complexity of the dataset make it essential to clean and preprocess the data before drawing meaningful conclusions. Potential challenges include handling data inconsistencies, ensuring comparability across districts, and dealing with missing or erroneous records.

Dataset 2: School_Support.csv

This dataset includes a total of 19 columns across 98010 entries. It provides an overview of the staffing resources in schools in each state of the United States, including security personnel, teaching fellows (both certified and uncertified), and various support services such as counselors, nurses, psychologists, and social workers. The data includes specific schools and justice agencies, indicating that diverse educational environments have different educational support structures. Some columns (such as TOT_TEACHERS_CURR_M and TOT_TEACHERS_CURR_F) contain negative values, which indicates missing data or recording errors. The number of resources and personnel allocated to school highlights the support and help provided by different agencies and organizations to students and also shows the attention and emphasis on specialized educational environments. Some questions that we can come up with using this dataset include: How does the staffing of school security personnel vary across states and regions? What is the ratio of counselors, nurses, and psychologists to students in different schools? Does school resource allocation vary by location or regional funding level?

Dataset 3: Harassment_and_Bullying.csv

This dataset primarily focuses on harassment and bullying incidents reported in educational institutions,capturing various forms of misconduct and disciplinary actions. It includeskey variables such as state, district, school names, student demographics and so on. The dataset contains 98,010 rows and 159 columns, making it a large dataset with detailed information on bullying and harassment incidents in schools across different states.

The dataset is coming from CRDC, it gathers key civil rights metrics to assess access to and obstacles in educational opportunities from early childhood through grade 12. The Office for Civil Rights (OCR) utilizes this data from public school districts to investigate discrimination complaints, evaluate compliance with federal civil rights laws, conduct proactive reviews, and offer policy guidance and technical support to schools and districts.

Potential areas of investigation include identifying regions with the highest bullying reports, examining patterns of disciplinary actions, and evaluating disparities among student groups.

The biggest challenge for this dataset is to organize and clean the format of it, as The dataset is structured in a wide format, where multiple columns represent similar variables. Also, there are many columns containing only 0s, indicating that some schools never reported incidents