2 Primary Research and Writing

You have a sound study design and consensus among your team about the the study methods. Now it’s time to do the primary analyses and write up your findings. Again, the how of doing your research is outside the scope of this Handbook. In this section, I make suggestions for staying organized and working efficiently.

2.1 Set up your data science project

You have a lot of decisions to make as a researcher. Rather than the day-to-day logistics of managing a data science project—what name to give a file, where to export your data—I’d like you to focus on the intricacies of study design and data analysis. In that spirit, here I describe my data science project management system.

If you or your lab group already have a system for setting up data projects—great. If you’d like to take this system as a starting point to iterate on—fantastic, I encourage it!

2.1.1 Naming projects

Each project gets it’s own directory (aka folder), and each of those directories needs a name. I try to keep things simple and consistent, so I use this basic convention for naming projects: Setting-Exposure-Outcome. For example, a project where the setting is California, the exposure is wildfire smoke, and the outcome is preterm birth, I would name the directory: cal-wildfiresmoke-ptb.

2.1.2 Naming files

I like to keep file names short and simple. If you follow the above strategy, then information about the project is already conveyed in the directory structure itself. Also, there’s no need to write specific names for documents every project has, like the manuscript or supplementary materials. I keep these filenames simple, e.g., manuscript01 or supplementary01. With each iteration, you can simply go to the next draft number. I.e., the second draft would be manuscript02. When you ask co-authors for feedback, they can add their initials at the end of the filename (e.g., manuscript02-abc-xyz). When you’re ready to you incorporate co-author feedback, you can again simply move on to the next number (manuscript03).

2.1.3 Organizing your directory

For most data science projects in environmental health, I use this file structure to keep my code, data, and output organized:

setting-exposure-outcome        # root directory using naming convention
├── code                        # scripts to...
│   ├── 0-setup                 # ...attach packages and define global variables
│   ├── 1-data_tidying          # ...prepare and assemble datasets for analyses
│   ├── 2-exposure_assessment   # ...assess exposure to environmental hazards
│   ├── 3-analyses              # ...generate descriptive stats, fit models, etc
│   └── 4-communication         # ...generate figures, tables, other output
├── data                        # 
│   ├── raw                     # untouched datasets with original file names
│   ├── interim                 # intermediate datasets not ready for analysis
│   └── processed               # finalized datasets ready for analysis
├── output                      # 
│   ├── figures                 # save figure components here and assemble them
│   ├── memos                   # data memos - data exploration, prelim analyses
│   ├── results                 # save output from analyses here
└── └── tables                  # save compiled tables here

Download a full template for my research project file structure, including templates for the Concept Note and Project Log.

2.2 Manage your code

Writing clear and well-organized code is a fundamental data science skill. It’s likely that the way we write scientific code is going to change profoundly in the coming years, as tools like ChatGPT evolve and replace some of the work typically done by humans. What won’t change is the need to generate coherent code that you can return to as you work iteratively on your analyses. Your code also needs to be interpretable by collaborators and, if you post it to a public repository (which many journals now require), others in the field. Here are some suggestions for how to get organized at the outset of your project.

2.2.1 Tool: Style Guide

A style guide describes a system for writing consistent and readable code, including syntax, layout, and naming conventions for functions and objects. Again, the goal is to reduce the number of logistical decisions you need to make so you can focus on your research. You don’t need to invent your own system, but you should feel free to iterate on the existing ones. I use The tidyverse style guide to style my R code, and there are other options out there, including the Google R Style Guide. Other languages have their own style guides, such as the PEP 8 Style Guide for Python.

2.2.2 Tool: Version control with Git and GitHub

[Introduce version control] You already do version control with Word documents. You write drafts that you send to others, they make edits and add comments to a new version, and then you respond to those comments in another new version. When writing code, the same kind of process process can get more complicated with codebases.

This is helpful for several reasons: - Version control, so you can go back to older versions of your code if you need to. - Code sharing with your team, so that a collaborator on another machine can read and edit your code. - Open data science, so you can publish and share your code with the world to accompany your eventual paper.

Git is a widely used version control system, and GitHub is a platform for… [describe].

You can use the command line to access GitHub (here’s a tutorial), or simply download and use the GitHub Desktop app.

2.3 Write as you go

I encourage my students to work incrementally and write as they go. There are a few advantages to this. Describe the methods as you do them and while the details are fresh in your mind, rather than weeks or months after you’ve done the work. As you go, if you think of strengths or Limits

2.3.1 Keep good notes

Keep using the project log you started in the Iterative Study Design stage. Update the Meta section as needed (e.g., add new co-authors or datasets) and track decisions in the Daily Log.

2.3.2 Outline the manuscript

If you wrote a high level outline, you already have a starting point for your manuscript. You can directly turn the high level outline into a manuscript outline, add the results and discussion sections, and edit the tenses (e.g., change “we will do xyz” to “we did xyz”). As you do your analyes–maybe while waiting for a script to run–you can work on the manuscript outline.

2.3.3 Finalize your first draft

Once you’re done with your analyses… [expand]

If you haven’t already, decide on which journal you’d like to submit to This is a decision you should make with your adviser, with input from your co-authorship team. It’s not a bad idea to also choose second and third choice journals (and beyond), in the not unlikely scenario that your paper is rejected (discussed in greater depth in Submission and Peer Review section. Once you have your target journal, make sure to carefully read the journal’s Author Guide, which describes their requirements for formatting, word limits, etc.

2.3.4 Tool: Reference Manager

You’re probably using one of these already. References managers like Zotero, Mendeley, EndNote, and PaperPile (my favorite) help you organize your reading, make highlights, and take notes. Integrate your reference manager with your word processor so you can drop in-text citations, automatically generate a list of literature cited, and instantly change the citation format for different journals. The UC Berkeley Library has more information on reference managers here.

2.4 Choose a target journal

There are several factors to consider here, and it’s can be easiest to make this decision in conversation with an experienced mentor or co-author. It’s a good idea to have a list of multiple journals you’d like to submit to and an agreed-upon order, so that when you get a rejection, you can move on to the next one.

Here are some of the questions that come up while you’re thinking about where to submit your work:

Does your manuscript fit the scope of the journal? [expand]
Is the disciplinary audience the right one for you and your work? [expand]
Have they published similar work in the past? It’s a good sign if they journal has published simlar work in the past, showing that your study is in-scope. If there’s a recent publication that is highly similar to yours, however, editors may be less receptive.
What kind of formatting do they require? [expand]

Another consideration is the journal’s impact. Opinions vary widely as to how much you should pay to impact, but it is something. [metrics, what they mean, how to interpret, how much to weigh these in your decision]