Data

We describe the sources of our data and the cleaning process.

Source

The data is publicly available on the CRDC website, allowing access to the original source.

Describe where/how to find data

All of our three raw datasets are coming from the Civil Rights Data Collection (CRDC). It includes public schools and school districts across the United States, collected by the U.S. Department of Education’s Office for Civil Rights (OCR).

It is used to monitor educational access and equity, ensuring compliance with federal civil rights laws. The datasets originate from the U.S. Department of Education’s Civil Rights Data Collection (CRDC) from the year 2021-2022, accessible at civilrightsdata.ed.gov.

Data Source

Data Set 1: Harassment and Bullying

Description:

This dataset primarily focuses on harassment and bullying incidents reported in educational institutions, capturing various forms of misconduct and disciplinary actions. It includes key variables such as state, district, school names, student demographics and so on.It gathers key civil rights metrics to assess access to and obstacles in educational opportunities from early childhood through grade 12. The Office for Civil Rights (OCR) utilizes this data from public school districts to investigate discrimination complaints, evaluate compliance with federal civil rights laws, conduct proactive reviews, and offer policy guidance and technical support to schools and districts.

Variable	Description	Data Type
HBALLEGATIONS_SEX	Number of reported harassment/bullying allegations based on sex or gender	int64
HBALLEGATIONS_RAC	Number of reported harassment/bullying allegations based on race	int64
HBALLEGATIONS_DIS	Number of reported harassment/bullying allegations based on disability	int64
HBALLEGATIONS_REL	Number of reported harassment/bullying allegations based on religion	int64
HBREPORTED_RAC_HI_M	Number of Hispanic Male Students	int64
HBREPORTED_RAC_HI_F	Number of Hispanic Female Students	int64
HBREPORTED_RAC_AM_M	Number of American Indian/Alaska Native Male Students	int64
HBREPORTED_RAC_AM_F	Number of American Indian/Alaska Native Female Students	int64
HBREPORTED_RAC_AS_M	Number of Asian Male Students	int64
HBREPORTED_RAC_AS_F	Number of Asian Female Students	int64
HBREPORTED_RAC_HP_M	Number of Native Hawaiian/Pacific Islander Male Students	int64
HBREPORTED_RAC_HP_F	Number of Native Hawaiian/Pacific Islander Female Students	int64
HBREPORTED_RAC_BL_M	Number of Black Male Students	int64
HBREPORTED_RAC_BL_F	Number of Black Female Students	int64
HBREPORTED_RAC_WH_M	Number of White Male Students	int64
HBREPORTED_RAC_WH_F	Number of White Female Students	int64
HBREPORTED_RAC_TR_M	Number of Two or More Races Male Students	int64
HBREPORTED_RAC_TR_F	Number of Two or More Races Female Students	int64

Distribution of key variables

Harassment Type

Harassment grouped by race

Data Set 2: School Support

Description The school support dataset contains information on the number of full-time equivalent (FTE) counselors, psychologists, and security guards at individual schools. After removing negative and missing values, we found that most schools have very few support staff, with distributions heavily skewed toward zero. While the majority of schools report no psychologists or guards, a small number of schools report larger support teams. The presence of a few schools with unusually high staff numbers indicates variation likely related to school size or reporting differences.

Variable	Description	Data Type
FTECOUNSELORS	Number of FTE school counselors	Decimal
FTESERVICES_PSY	Number of FTE psychologists	Decimal
FTESECURITY_GUA	Number of FTE security guards	Decimal

Findings about key variables

   counselors     psychologists      security_guards   
 Min.   :  0.00   Min.   :  0.0000   Min.   :  0.0000  
 1st Qu.:  0.20   1st Qu.:  0.0000   1st Qu.:  0.0000  
 Median :  1.00   Median :  0.0000   Median :  0.0000  
 Mean   :  1.35   Mean   :  0.4033   Mean   :  0.3739  
 3rd Qu.:  2.00   3rd Qu.:  0.6000   3rd Qu.:  0.0000  
 Max.   :280.00   Max.   :175.0000   Max.   :100.0000  
 NA's   :99       NA's   :3          NA's   :2468

# A tibble: 3 × 8
  variable        count missing  mean median    sd   min   max
  <chr>           <int>   <int> <dbl>  <dbl> <dbl> <dbl> <dbl>
1 counselors      98010      99  1.35      1  2.1      0   280
2 psychologists   98010       3  0.4       0  1.21     0   175
3 security_guards 98010    2468  0.37      0  1.42     0   100

Counselors are more consistently present across schools.
    Psychologists and especially security guards are much less common, with higher proportions of zeros and missing entries.
    The high standard deviations relative to means for all three suggest substantial variability across school sizes or resource allocations.

Data Set 3: School Enrollment

Reason to use this dataset: The Enrollment dataset contains detailed counts of student populations across multiple racial and ethnic groups at the school level.It breaks down enrollment into male and female students across 7 racial categories: Hispanic, American Indian/Alaska Native, Asian, Native Hawaiian/Pacific Islander, Black, White, and Two or More Races.This data is crucial for accurately normalizing harassment report counts by providing the total population against which allegations are measured, enabling fair comparison across institutions of varying sizes and demographics. If we only compare raw counts of harassment reports, larger schools will naturally have more reports just because they have more students.By dividing number of harassment reports by total enrollment, we calculate harassment allegations per 100 students, which normalizes across schools of different sizes.

Variable	Description
ENR_HI_M	Number of Hispanic Male Students
ENR_HI_F	Number of Hispanic Female Students
ENR_AM_M	Number of American Indian/Alaska Native Male Students
ENR_AM_F	Number of American Indian/Alaska Native Female Students
ENR_AS_M	Number of Asian Male Students
ENR_AS_F	Number of Asian Female Students
ENR_HP_M	Number of Native Hawaiian/Pacific Islander Male Students
ENR_HP_F	Number of Native Hawaiian/Pacific Islander Female Students
ENR_BL_M	Number of Black Male Students
ENR_BL_F	Number of Black Female Students
ENR_WH_M	Number of White Male Students
ENR_WH_F	Number of White Female Students
ENR_TR_M	Number of Two or More Races Male Students
ENR_TR_F	Number of Two or More Races Female Students

Distribution of key variables

# A tibble: 14 × 2
   Variable     MissingRate
   <chr>              <dbl>
 1 SCH_ENR_HI_M        1.92
 2 SCH_ENR_HI_F        1.92
 3 SCH_ENR_AM_M        1.92
 4 SCH_ENR_AM_F        1.92
 5 SCH_ENR_AS_M        1.92
 6 SCH_ENR_AS_F        1.92
 7 SCH_ENR_HP_M        1.92
 8 SCH_ENR_HP_F        1.92
 9 SCH_ENR_BL_M        1.91
10 SCH_ENR_BL_F        1.92
11 SCH_ENR_WH_M        1.92
12 SCH_ENR_WH_F        1.93
13 SCH_ENR_TR_M        1.92
14 SCH_ENR_TR_F        1.92

# A tibble: 7 × 2
  Race            Total_Enrollment
  <chr>                      <dbl>
1 White                   21947054
2 Hispanic                13914099
3 Black                    7197685
4 Asian                    2611056
5 TwoOrMore                2284917
6 AmericanIndian            456405
7 PacificIslander           185273

Merge and Clean Data Process

data_clean_process.qmd

data_clean_process.R

This project involved merging and cleaning multiple datasets related to school enrollment, harassment reports, and school support services. In the original dataset, several variables contained reserved codes represented by negative values. These codes are not actual data values but rather placeholders for missing, suppressed, or logic-dependent responses. The specific reserve codes and their meanings are:

Reserve Code Value	Definition
-3	Skip Logic or Processing Failure
-4	Missing Optional Data
-5	Action Plan/Quick Plans
-6	Force Certified
-9	Not Applicable/Skipped
-12	Suppressed for Privacy Protections
-13	Missing DIND Skip Logic

To ensure accurate analysis, these values were converted to NA, as they do not represent valid quantitative information. Including them would lead to misleading statistics and incorrect interpretations in visualizations and summaries.

The overall script includes:

- Reading raw datasets from the dataset/ directory

- Replacing reserved codes (e.g. -3, -4, -6, etc.) with NA

- Removing invalid or fully-missing rows

- Renaming variables for clarity (e.g., removing SCH_ prefix)

- Merging harassment, enrollment, and support data into a single cleaned dataset - Saving the output as .rds and .csv files