Analysis Workflow in R
The idea is to break the code into four files, all stored in your project directory. These four files are to be processed in the following order.
- load.R
- This file includes all code associated with loading the data. Usually, it will be a short file reading in data from files.
- clean.R
- This is where you do all the pre-processing of data such as taking care of missing values, merging data frames, handling outliers. By the end of this file, the data should be in a clean state, ready to use. It is much better to do this here rather than clean the data on the original file as this enables you to have a complete record of everything done to the data.
- functions.R
- All of the functions needed to perform the actual analysis are stored here. This file should do nothing other than define the functions you need for analysis. (If you require your own functions for loading or cleaning the data, include them at the top of either load.R or clean.R.) In particular, functions.R should not do anything to the data. This means that you can modify this file and reload it without having to go back and repeat steps 1 & 2 which can take a long time to run for large data sets.
- do.R
- Here is the code to actually do the analysis. This file will use the functions defined in functions.R to do the calculations, produce figures and tables, etc. All figures and tables that end up in your report, paper or thesis should be coded here. Never create figures and tables manually (i.e., with the mouse and menus) as then you can’t easily reproduce.
The main motivation for this set up is for working with large data whereby you don't want to have to reload the data each time you make a change to a subsequent step. Also, keeping my code compartmentalized like this means I can come back to a long forgotten project and quickly read load.R and work out what data I need to update, and then look at do.R to work out what analysis was performed.