Data organization in spreadsheets

Bianca Peterson
24 October 2017

Data Organization in Spreadsheets

This presentation is based on the Data Carpentry Spreadsheet Ecology Lesson

Introduction

  • Good data organization is the foundation of any research project
  • Structure data the way that computers need it
  • Much of your time will be spent in this 'data wrangling' stage

In this lesson, you will learn:

  • Good data entry practices - formatting data tables in spreadsheets
  • How to avoid common formatting mistakes
  • Approaches for handling dates in spreadsheets
  • Basic quality control and data manipulation in spreadsheets
  • Exporting data from spreadsheets

Questions

  • How many people have used spreadsheets in their research?
  • What kind of operations do you do in spreadsheets?
  • Which ones do you think spreadsheets are good for?
  • How many people have accidentally done something that made them frustrated or sad?

We can use spreadsheets for:

  • Data entry
  • Organizing data
  • Subsetting and sorting data
  • Statistics
  • Plotting

Problems with spreadsheets:

  • Generating tables for publications in a spreadsheet is not optimal
  • The graphical, drag and drop nature of spreadsheet programs makes it very difficult, if not impossible, to replicate your steps - When doing calculations in a spreadsheet, it's easy to accidentally apply a slightly different formula to multiple adjacent cells

Using spreadsheets for data entry and cleaning

  • Clean data prior to importation into a statistical analysis program
  • Check your data quality
  • Produce preliminary summary statistics

Formatting data tables in spreadsheets

  • Most common mistake: rely on context, notes in the margin, spatial layout of data and fields to convey information
  • Humans can (usually) interpret these things, but computers don't view information the same way
  • We have to set up our data for the computer to understand it in order to analyze it fast and efficiently

Keeping track of your analyses

  • Don't modify original dataset
  • Create a new file/tab with cleaned/analyzed data
  • Keep track of steps used for data cleaning/analysis

Structuring data in spreadsheets

  • Put all variables in columns
  • Put each observation in its own row
  • Don't combine multiple pieces of information in one cell
  • Leave raw data raw - don't change it!
  • Export cleaned data to text-based format (CSV) - anyone can use the data and it is required by most data repositories

Survey of small mammals in a desert ecosystem