Data

We describe the sources of our data and the cleaning process.

This comes from the file data.qmd.

Your first steps in this project will be to find data to work on.

I recommend trying to find data that interests you and that you are knowledgeable about. A bad example would be if you have no interest in video games but your data set is about video games. I also recommend finding data that is related to current events, social justice, and other areas that have an impact.

Initially, you will study one dataset but later you will need to combine that data with another dataset. For this reason, I recommend finding data that has some date and/or location components. These types of data are conducive to interesting visualizations and analysis and you can also combine this data with other data that also has a date or location variable. Data from the census, weather data, economic data, are all relatively easy to combine with other data with time/location components.

What makes a good data set?

Data you are interested in and care about.
Data where there are a lot of potential questions that you can explore.
A data set that isn’t completely cleaned already.
Multiple sources for data that you can combine.
Some type of time and/or location component.

Where to keep data?

Below 50mb: In dataset folder

Above 50mb: In dataset-ignore folder which you will have to create manually. This folder will be ignored by git so you’ll have to manually sync these files across your team.

Clean data script

The idea behind this file is that someone coming to your website could largely replicate your analyses after running this script on the original data sets to clean them. This file might create a derivative data set that you then use for your subsequent analysis. Note that you don’t need to run this script from every post/page. Instead, you can load in the results of this script, which will usually be .rds files. In your data page you’ll describe how these results were created. If you have a very large data set, you might save smaller data sets that you can use for exploration purposes. To link to this file, you can use [cleaning script](/scripts/clean_data.R) wich appears as cleaning script.

Rubric: On this page

You will

Describe where/how to find data.
- You must include a link to the original data source(s). Make sure to provide attribution to those who collected the data.
- Why was the data collected/curated? Who put it together? (This is important, if you don’t know why it was collected then that might not be a good dataset to look at.
Describe the different data files used and what each variable means.
- If you have many variables then only describe the most relevant ones, possibly grouping together variables that are similar, and summarize the rest.
- Use figures or tables to help explain the data. For example, showing a histogram or bar chart for a particularly important variable can provide a quick overview of the values that variable tends to take.
Describe any cleaning you had to do for your data.
- You must include a link to your clean_data.R file.
- Rename variables and recode factors to make data more clear.
- Also, describe any additional R packages you used outside of those covered in class.
- Describe and show code for how you combined multiple data files and any cleaning that was necessary for that.
- Some repetition of what you do in your clean_data.R file is fine and encouraged if it helps explain what you did.
Organization, clarity, cleanliness of the page
- Make sure to remove excessive warnings, use clean easy-to-read code (without side scrolling), organize with sections, use bullets and other organization tools, etc.
- This page should be self-contained.

What makes a good data set?

Where to keep data?

Sharing your data

Clean data script

Rubric: On this page