Good data organization is the foundation of any research project
Structure data the way that computers need it
Much of your time will be spent in this 'data wrangling' stage
In this lesson, you will learn:
Good data entry practices - formatting data tables in spreadsheets
How to avoid common formatting mistakes
Approaches for handling dates in spreadsheets
Basic quality control and data manipulation in spreadsheets
Exporting data from spreadsheets
How many people have used spreadsheets in their research?
What kind of operations do you do in spreadsheets?
Which ones do you think spreadsheets are good for?
How many people have accidentally done something that made them frustrated or sad?
We can use spreadsheets for:
Subsetting and sorting data
Problems with spreadsheets:
Generating tables for publications in a spreadsheet is not optimal
The graphical, drag and drop nature of spreadsheet programs makes it very difficult, if not impossible, to replicate your steps - When doing calculations in a spreadsheet, it's easy to accidentally apply a slightly different formula to multiple adjacent cells
Using spreadsheets for data entry and cleaning
Clean data prior to importation into a statistical analysis program
Check your data quality
Produce preliminary summary statistics
Formatting data tables in spreadsheets
Most common mistake: rely on context, notes in the margin, spatial layout of data and fields to convey information
Humans can (usually) interpret these things, but computers don't view information the same way
We have to set up our data for the computer to understand it in order to analyze it fast and efficiently
Keeping track of your analyses
Don't modify original dataset
Create a new file/tab with cleaned/analyzed data
Keep track of steps used for data cleaning/analysis
Structuring data in spreadsheets
Put all variables in columns
Put each observation in its own row
Don't combine multiple pieces of information in one cell
Leave raw data raw - don't change it!
Export cleaned data to text-based format (CSV) - anyone can use the data and it is required by most data repositories