Data analysis
“Statistical analysis allows us to put limits on our uncertainty, but not to prove anything.”—Douglas G. Altman (in Altman DG. Practical Statistics for Medical Research. London, UK: Chapman & Hall; 1991.)
Data Science: Courses and books
Essential
Data science: R for Data Science
Data visualization video part1 + video2 and book ggplot2: Elegant Graphics for Data Analysis (3e)
Statistics - A Full Lecture to learn Data Science (4h video)
More material
Fundamentals of Data Visualization - (Wilke)
Quick Start Guide for R Link and courses
ANALYSIS PLAN
Aim:
Promote structured, targeted data analysis for dental researchers using R.
Requirements:
Create and finalize an analysis plan before initiating data analyses with R.
Documentation:
Analysis Plan Guidelines (Per Study Type. see next section)
Responsibilities:
Executing Researcher:
Develop the analysis plan detailing the research question and the intended analysis steps.
Ensure the plan is signed and dated by the Principal Investigator (PI).
Project Leaders:
Guide the executing researcher in establishing the analysis plan.
How to Create an Analysis Plan:
Preliminary Review:
Describe the research question based on literature exploration.
Highlight the study's relevance and what new information it will provide.
Example Research Question:
"Is there a significant link between smoking and periodontal disease in dental clinic attendees in a specific region?"
Types of Studies:
Exploratory: Adjust the analysis based on found data. Clearly mention this is an exploratory study in reports.
Hypothesis-testing: Pre-specify intended analyses, including population details, subgroups, stratifications, and tests.
Randomized Controlled Trial (RCT): The analysis plan should be strictly adhered to.
Data Cleaning:
Create a different script for the data cleaning process, so you will have the raw data and a a clean data, documented in the script. See https://cghlewis.com/blog/data_clean_01/
Exploratory Analysis:
Can be part of the plan to guide the final analysis.
Document findings and decisions, making them reproducible (using R).
Formulating a Concrete Research Question:
Based on a literature review, define the precise question to be addressed.
Defining Analysis Components:
Describe primary and secondary outcomes, necessary data, and statistical techniques.
Consider population (or subgroups), groups for comparison, primary and secondary endpoints, variables (dependent and independent), potential confounders or effect modifiers, and techniques for handling missing data.
Planning the Analysis:
Detail the sequence of analyses (e.g., univariable, multivariable, sub-populations).
Consider if data criteria are met for specific statistical techniques. If unsure, consult a statistician.
Design Empty Tables:
Create them before starting data analysis to determine required analyses.
Publicize Your Plan:
Consider sharing your study protocol (Zenodo, OSF, RSU Dataverse) and analysis plan publicly for transparency.
Check Reporting Guidelines:
Enhance research quality and ensure the plan aligns with standards. See next.
DATA ANALYSIS
Objective
To gain an initial understanding of the dataset and to examine the relationship between exposure and outcome.
Data Prerequisites
Examine continuous variables for their distribution.
Identify the proportion of missing values and anomalies.
Determine Cronbach’s alpha for subscales.
Documentation
Employed syntax for preliminary data analysis.
Documentation of missing data percentages and outliers.
Computation of Cronbach’s alpha for new subscales.
Study Population's Baseline Characteristics
Examine characteristics based on treatment/exposure. Distinctive results from the primary data analysis (such as odds ratios and hazard ratios) should be reported.
Roles and Responsibilities
Executing Researcher: Handle the initial data analysis, which includes examining:
Distribution normality of continuous variables.
Missing values and outliers.
Cronbach’s alpha for new subscales.
Project Leaders: Offer guidance on executing the initial analysis.
Procedure
Data Inspection
Examine the data distribution. Assess frequencies for categorical variables like age and smoking status. Utilize statistics to evaluate continuous variables. Creating visuals like boxplots or histograms can be beneficial to view distributions.Handling Anomalies
Identify outliers in continuous variables and review combinations that seem unusual. Cross-tabulations can be beneficial for categorical variables to detect peculiar combinations.Dealing with Missing Data
Properly examine and code missing values, especially when working with multiple datasets. Consultation with a statistician for handling missing data can be invaluable.Evaluating Distribution
Confirm the normal distribution of all pertinent variables. If variables are not normally distributed, transformations might be considered, though they can be harder to interpret.- Evaluation in Dental Research
Baseline Characteristics: Review the data distribution for each exposure category, considering relevant variables in dentistry, such as smoking status related to oral health. Examine differences between groups and check residual distributions.
Correlations Between Variables: Analyze interactions between variables to refine dental research models. Interaction between variables can be examined using statistical tests.
Assessing Exposure and Outcome Association: Depending on your research question, different analytical methods can be applied. This can range from logistic regression for binary outcomes, ANCOVA or linear regression for continuous outcomes, or survival analyses for time-to-event outcomes.
Post-hoc & sensitivity analyses
Objective
Ensure appropriate and correct execution of post-hoc and sensitivity analyses in dental research studies.
Requirements
Conduct post-hoc and sensitivity analyses as and when necessary.
Documentation
Data Analysis Strategy: Detail the post-hoc tests and sensitivity analyses that will be undertaken.
Syntax: Record the methods and results of the post-hoc tests and sensitivity analyses.
Logbook: Maintain a record of decisions informed by the outcomes of post-hoc tests and sensitivity analyses.
Roles and Responsibilities
Executing Researcher:
Ensure post-hoc tests and sensitivity analyses are conducted appropriately.
Collaborate with supervisors to discuss findings.
Record all findings and decisions as detailed in the documentation section.
Ensure the research manuscript aptly presents results from post-hoc tests and sensitivity analyses.
Project Leaders:
Review the data analysis plan, especially focusing on post-hoc tests and sensitivity analyses.
Discuss sensitivity analysis outcomes with the executing researcher.
Procedure
Post-hoc Analyses in Dental Research:
Conduct post-hoc analyses when a notable relationship emerges between the dependent variable and a categorical, independent variable with more than two categories.
Example: If examining the association between periodontal disease and smoking status in patients with type 2 diabetes, and a significant link is identified, post-hoc analyses like variance analysis and the Tukey test might be used to discern notable differences between smoking categories.
Also, delve into subgroup analyses to determine if associations between two variables differ based on another factor, like age.
Sensitivity Analyses in Dental Research:
Sensitivity analyses evaluate the stability of primary results against variations in assumptions or methodologies.
Example: A sensitivity analysis might check the impact of different life quality score thresholds in a study focusing on the tie between oral health and life quality. Alternatively, another study might use a sensitivity analysis to juxtapose results from the main analysis with those excluding participants with missing data.
It's crucial to label sensitivity analyses as such in reports categorically. They should complement, not substitute, primary analysis. Document decisions derived from post-hoc tests and sensitivity analyses properly.
Data analysis documentation
Objective
To reinforce the reproducibility of dental research analyses using R.
Requirements
Comprehensive documentation of data analysis within an R script (log file) ensuring full reproducibility of pertinent data analyses.
Documentation Protocol
R Script Log File should include:
The explicit research inquiry or analysis purpose.
The source databases utilized in the analyses (like CSV or Excel files).
Complete R code executed.
Inclusion of 'README' Section: Add this to your data files or incorporate a standalone descriptive document. This aids in reproducing results during scenarios like audits, inspections, or journal reviews and ensures data file interoperability.
Roles & Responsibilities
Executing Researcher:
Thoroughly record all procedural steps during data analysis in an R script.
Engage with supervisors ensuring the analysis is transparent and reproducible.
Adhere to all documentation protocol steps.
Guarantee the manuscript encompasses the R script to aid peer reproduction.
Project Leaders:
Periodically review and deliberate on the data analysis using the R script log file.
Guidelines
Ensuring easy reproducibility of data analyses via R in dental research is paramount. Adopt the practice of maintaining an R script detailing each analysis. The script should chronologically begin with the research question and conclude with a tentative or final answer.
The R script serves as a thorough record, facilitating easy retrieval and reproduction of analyses.
Always specify the data file's name and path (e.g., using ‘read.csv’ in R). This clarity helps identify related files and their storage locations.
R scripts should capture every statistical test undertaken, effectively acting as an analysis logbook. Arrange your code logically: start with variable definitions, followed by specific analyses, and group them by tables or figures (e.g., variables, then table 1 analysis, table 2, and so forth).
Pro Tip: Regularly comment on your R scripts (use # followed by text). These comments bolster your data analysis documentation and simplify result reproduction and code reuse.
Handling missing data
Objective
Equip dental researchers with a structured protocol and insights on addressing missing data.
Requirements
Transparent documentation of choices made related to missing data.
If data imputation occurs, its methods must be clearly outlined.
Documentation Protocol
Research Protocol: Strategies to prevent missing data during data collection.
Data Analysis Plan: Elucidate the causes of missing data, its management strategy, and subsequent missing data analyses per specific questionnaire or instrument.
Syntax: Record missing data analyses and any eventual imputation methods. Reference datasets with and without imputed values.
Logbook: Document decisions and rationale for handling missing data.
Roles & Responsibilities
Executing Researcher:
Comprehend the ramifications of neglecting missing data.
Collaborate with supervisors and experts on data handling strategies.
Strictly adhere to the documentation protocol.
Project Leaders:
Encourage the executing researcher to engage with experts and resources.
Review the research protocol and data analysis plan concerning missing data.
Ensure rigorous documentation by the executing researcher.
Research Assistant:
In the absence of guidelines, consult the executing researcher about handling missing data.
Guidelines
1. Understanding Missing Data
Missing data is prevalent in dental research. Its handling is contingent on the volume, nature, cause, and randomness of the missing data.
Ignoring missing data can result in reduced sample size, inflated standard errors, decreased statistical power, and the introduction of bias. Before analysis, always inspect missing data; never casually omit incomplete entries.
2. Addressing Missing Data Across Research Phases:
Data Preparation:
Clarify questionnaire items.
Document reasons for missing data.
Utilize digital platforms for data collection to minimize gaps.
Pre-emptively test instruments to mitigate technical issues.
Data Collection:
Regularly verify data completeness. If gaps arise, strive for completion via raw data checks or respondent engagement.
Data Processing:
Identify missing data's volume, nature, and type.
3. Techniques for Handling Missing Data:
Listwise Deletion: Omits incomplete cases, reducing sample size.
Pairwise Deletion: Excludes cases only for specific analyses.
Imputation: Statistically estimates missing values. Techniques include mean, regression, multiple imputation, etc.
Example in R for identifying missing data:
Identifying Percentage of Missing Values for Each Variable
In a dental research dataset using R, you can identify the percentage of missing values for each variable separately by employing the summary() function.
# Load data
mydata <- read.csv("dental_data.csv")
# Identify missing data for each variable separately
summary(mydata)
Executing the above will yield a summary of the dataset, detailing the number of missing values for each respective variable.
Imputing Missing Data
To handle missing data in a dental research dataset with R, the mice package is a robust solution. This package supports a variety of imputation techniques, such as mean imputation, regression imputation, and multiple imputation.
# Load data
mydata <- read.csv("dental_data.csv")
# Impute missing values using mean imputation
library(mice)
imp <- mice(mydata, method = "mean")
# Extract completed dataset with imputed values
mydata_imputed <- complete(imp)
By following the steps outlined above, the dataset's missing data will be imputed through mean imputation, and a completed dataset with these imputed values will be produced.
Deleting Missing Data
Should the need arise to remove missing data from a dental research dataset using R, the na.omit() function is an effective tool. This function will purge any rows containing missing values.
# Load data
mydata <- read.csv("dental_data.csv")
# Delete rows with missing values
mydata_clean <- na.omit(mydata)
After executing the above, the outcome will be a dataset devoid of rows containing missing values.