- 2019-08-01 16:32
*views 10*- R language
- Data analysis
- Treatment of missing values

For simple data processing , We are basically dealing with complete data sets , But in practical problems, we often encounter data with missing values , It is very important to deal with this kind of data .

General steps for handling missing values

First, we list the general steps to deal with missing values , Have a general understanding of the whole process .

* Identify missing data ;

* Check the cause of missing data ;

* Delete instances containing missing values or interpolate missing values with reasonable values .

Types of missing data

* Complete random deletion （MCAR）

* Random deletion （MAR）

* Nonrandom deletion （NMAR）

Complete random deletion ： If the missing data of a variable is not related to any other observed and unobserved variables , The data is completely random missing .

Random deletion ： If the missing data on a variable is related to other observed variables , Not related to his own unobserved values , The data is missing randomly .

Nonrandom deletion ： If the missing data does not belong to the above two types, it is non random missing .

Identify missing values

To handle missing values , First, we need to identify which data are missing values ,R In language ,NA Represents missing value ,NaN Represents an impossible value ,Inf and -Inf It represents positive infinity and negative infinity . There are corresponding functions is.na(),is.nan(),is.infinite() It can be used to identify missing values , Impossible value and infinite value , The result is TRUE/FALSE.

To count the number of missing values , We can go through it directly sum() Function TRUE/FALSE Make statistics , among TRUE The logical value of is 1,FALSE The logical value of is 0, Similarly, impossible values and infinite values can also be judged by this method .

Explore missing values

For missing values , It is not advisable for us to count him only , This section gives several ways to explore missing values .

one , The chart shows missing values

We can use an icon to show the missing values , stay R In language mice In the bag md.pttern() Function provides a table that can generate a matrix to show missing values , Examples are as follows ：

library(lattice) library(mice) data(sleep,package="VIM") md.pattern(sleep)

The results of the chart and graph are as follows ：

two , Graphic display missing values

md.pattren() Function has given us a clear list of each missing value , But the graph is a more clear way to express the missing value ,VIM A large number of visualization functions are provided in the package , Let's take a look at some of these functions .

aggr() function

library(VIM) aggr(sleep,prop=FALSE,numbers=TRUE)

matrixplot() function

matrixplot()

marginplot() function

marginplot(sleep[c("Gest","Dream")],pch=c(20),col=c("darkgray","red","blue"))

Treatment of missing values

one , delete

For the missing value processing, we first use the first simplest method —— Delete the row with the missing value ,R There are two functions to delete missing values , namely complete.cases() Function sum na.omit() function .

For the processing of deletion, you can use the data directly , There is no demonstration here .

two , multiple imputation

multiple imputation （MI） It is a method to deal with missing values based on repeated simulation , In the face of complex missing value problem ,MI It's a common method , It will generate a complete set of data sets from a data set containing missing values . In this section we will use R In mice The package interpolates the data set .

mice The workflow of the multiple interpolation method in the package is as follows ：

be based on mice Package analysis usually follows the following procedure ：

library(mice) imp <- mice(mydata,m) fit <- with(imp, analysis) pooled <-

pool(fit) summary(pooled)

Process description ：

* mydata Is a matrix or data frame containing missing values

* imp It's an inclusion m A list object of an interpolation dataset , At the same time, it also contains the information to complete the interpolation process . default m The value of is 5.

* analysis Is an expression object , Used to set the m A statistical analysis method of interpolation data sets .

* fit It's an inclusion m A list object of individual statistical analysis results .

* pooled It's one that includes this m A list object of statistical average analysis results .

three , Simple interpolation method

Simple interpolation method , Use a value （ Such as mean value , median , Mode ） To replace missing values in variables . One thing to note is that , These substitutions are random , This means that random errors are not introduced .

four , Other ways to deal with missing values

R The language supports other processing methods for missing values .

Package description

Hmisc Contains multiple functions , Simple interpolation is supported , Multiple interpolation and typical variable interpolation .

mvnmle Maximum likelihood estimation of missing values in multivariate orthonormal distribution data .

cat Multiple imputation of multivariate categorical variables in linear model .

arrayInpute,Seqknn Real time function for processing missing data of microarray .

longitudinalData List of related functions , For example, a series of functions for interpolating missing values of time series .

kmi Methods of dealing with missing values in survival analysis Kaplan-Meier multiple imputation .

mix Multiple imputation of mixed categorical and continuous data in general location model .

pan Multiple interpolation of multivariate panel data or clustering data .

Technology

- Python146 blogs
- Java131 blogs
- Vue86 blogs
- Flow Chart79 blogs
- javascript42 blogs
- C++41 blogs
- programing language38 blogs
- MySQL37 blogs
- more...

Daily Recommendation

©2020-2021 ioDraw All rights reserved

cartoon ： What is? JVM Garbage collection in China ? Huawei Hongmeng system HarmonyOS Learning 9 ： Hongmeng HarmonyOS History and future python Implementation in Mysql Data rollback rollback() And principle analysis The problem of string left rotation android 10.0 Version integration GMS package mysql actual combat 45 speak --- 33 Query large amount of data , Will the memory burst ajax send out post/get data ,java How to receive in the background RocketMQ Multiple namesrv Use of pits encountered 2021 Blue Bridge Cup B group C/C++ Personal records Blockchain journey ( three ) Smart contract and consensus mechanism