From today on, I will take notes of my study ～～～ 2020.3.5
Processing method of missing value —— multiple imputation
1 basic thought
Using Monte Carlo simulation method （MCMC) Interpolate the original data into several complete data sets , Linear regression was used in each dataset （lm) Or generalized linear specification （glm) And so on , And then integrate these complete models together , Evaluate the interpolation model and return the complete data set . This method mainly uses program package mice In mice(
) conduct .
The general steps are as follows ：
Missing data set ——MCMC Estimates are interpolated into several data sets —— Each data set is interpolated and modeled （glm,lm Model ）—— Integrate these models together （pool）—— Evaluation of interpolation model （ Calculation of model coefficients t statistic ）—— Output complete dataset （compute）
2 mice Basic writing format of function
R Language input help(mice) More information is available
mice( data, m=5, method=vetor(“charater”,length=ncol(data)),
data: A matrix or data frame containing complete and missing data
m: The number of multiple interpolations specified , The default value is 5
method: A string , Or a string vector of the same length as the number of columns in the dataset , Specifies the interpolation method for each column in the data , A single string specifies that all columns are interpolated in the same way , The string vector specifies that different columns are interpolated in different ways , The default interpolation method depends on the target column to be interpolated , And by defaultMethod Specify parameters
seed： An integer , For functions set.seed( ) Parameters of , The default value is NA
defaultMethod: A vector , Specifies the interpolation modeling method used for each dataset , There are many ways to choose from ,"pmm" Mean value matching with prediction ,"logreg" The expression was fitted by logistic regression ,“polyreg" The fitting polynomial represents the polynomial ,“polr“ The proportion advantage model is used to fit .
3 , Application examples
>install.packages（"DMwR") >library(DMwR) >data(algae) >sum(is.na(algae)) # Missing value judgment
>install.packages("mice") >library(mice) >imp<-mice(algae[,4:11],seed=1234)
# Create a imp object , The data used is algae The second part of the dataset contains missing values 4 reach 11 Row data , The default interpolation data set is 5 individual >fit<-with(imp,lm(mxPH~.
data=algae[,4:11]) #fit Object is used to set the statistical method >pool=pool(fit) # Summarize the statistical analysis results >options(digits=3
) # Set output result retention 3 Decimals >summary(pool) # use summary Function display pool Statistics for >imp$imp$C1
# View variables C1 Interpolation results in five interpolation data sets >imp$method # Look at the interpolation method used for each variable >algea_compelete=complete(imp,
action=1) # Function usage complete( ) Returns any data set specified in five interpolation data sets , The first one in the interpolation dataset returned here >sum(is.na(
algae_complete)) # Check whether there are missing values in the first data set of interpolation >par(mfrow=c(3,3)) # Call function par( )
Set for R System parameters of graphics , Here, set the graphics output to 3*3 Format of , That is, a plate display 3 That's ok 3 column >stripplot(imp,pch=c(1,8),col=c("grey"
,"1") # utilize mice Functions in packages stripplot( ) Visualization of variable distribution map , It contains interpolation data >par(mfrow=c(1,1))
# After drawing, reset the window to 1*1 Format of
4,FAQ（ That is the problem I met in my study ）
1) function lm( ) midpoint “." What does it stand for ?
lm(mxPH ~ . data = algae[,4:11]) in mxPH ~ . this formula Will be lm() In a given data table mxPH
Is the explained variable , The other variables were used as explanatory variables for linear regression model , In the case of many variables can simplify the code , But the risk is that the regression equation changes when the data column changes , Some are similar to SQL Used in select *
The risk of .
2)seed(1234) What is the use of interpolation ?
seed It's an integer , For functions set.seed(
) Parameters of , Specifies the number of generated fixed random numbers . The number in brackets is just a number , for example set.seed(100) The numbers in brackets should not be interpreted as “ hundred ”, It should be understood as “ A random number with the number of 100 is generated ”. Used to set random number seed , A specific seed can produce a specific pseudo-random sequence , The main purpose of this function , It's about making your simulation repeatable , Because many times we need to take random numbers , But when this code runs again , The result is different , If the same simulation results need to be repeated , You can use it set.seed(). When debugging programs or making presentations , The repeatability of the results is very important , So random number seed is necessary .