1 00:00:00,380 --> 00:00:02,970 all right so here's a little help file 2 00:00:02,970 --> 00:00:05,509 on dealing with missing values in our 3 00:00:05,509 --> 00:00:09,719 mini survey data basically the reason 4 00:00:09,719 --> 00:00:12,240 that we have any concern that there are 5 00:00:12,240 --> 00:00:15,000 missing values is that for each case for 6 00:00:15,000 --> 00:00:16,680 which there's a missing value in any 7 00:00:16,680 --> 00:00:19,230 analysis that includes that case it'll 8 00:00:19,230 --> 00:00:22,320 be omitted so things per se like 9 00:00:22,320 --> 00:00:24,090 multiple regression or something like 10 00:00:24,090 --> 00:00:27,689 that you end up losing a lot of cases if 11 00:00:27,689 --> 00:00:31,439 you have a few missing values scatter 12 00:00:31,439 --> 00:00:32,659 around a number of different variables 13 00:00:32,659 --> 00:00:34,530 so what I've done here is I've 14 00:00:34,530 --> 00:00:39,090 highlighted the cases that are the cases 15 00:00:39,090 --> 00:00:41,370 they had something missing with yellow 16 00:00:41,370 --> 00:00:43,079 highlighting and I just did that by hand 17 00:00:43,079 --> 00:00:44,579 there was no magic formula although I 18 00:00:44,579 --> 00:00:47,840 could have made some sort of conditional 19 00:00:47,840 --> 00:00:50,070 if statement sort of thing that would 20 00:00:50,070 --> 00:00:51,390 have done that but I was feeling lazy 21 00:00:51,390 --> 00:00:54,120 and there weren't that many cases what 22 00:00:54,120 --> 00:00:55,289 I've gone ahead and done is I've 23 00:00:55,289 --> 00:00:58,050 calculated the mean which for this 24 00:00:58,050 --> 00:00:59,670 variable or something missing is one 25 00:00:59,670 --> 00:01:01,559 point six five to one seven for the 26 00:01:01,559 --> 00:01:04,500 median in the mode and just to kind of 27 00:01:04,500 --> 00:01:06,479 think about these are all central 28 00:01:06,479 --> 00:01:10,830 tendencies of this variable and a lot of 29 00:01:10,830 --> 00:01:12,750 a cheap and dirty way to deal with 30 00:01:12,750 --> 00:01:14,100 missing values is substitute in the 31 00:01:14,100 --> 00:01:16,049 central tendency and then for linear 32 00:01:16,049 --> 00:01:18,960 source of variables often times the mean 33 00:01:18,960 --> 00:01:20,490 would be a good choice in this case we 34 00:01:20,490 --> 00:01:24,830 have kind of these ordered levels in 35 00:01:24,830 --> 00:01:28,979 terms of our Likert scales and in that 36 00:01:28,979 --> 00:01:30,840 case sometimes maybe the median might be 37 00:01:30,840 --> 00:01:33,270 superior one advantage of the mean is 38 00:01:33,270 --> 00:01:34,829 that if you substitute the mean it'll be 39 00:01:34,829 --> 00:01:36,390 one point six five two which is 40 00:01:36,390 --> 00:01:37,829 obviously not one of the choices and you 41 00:01:37,829 --> 00:01:39,659 can clearly see which ones were 42 00:01:39,659 --> 00:01:42,270 substituted in this case just for our 43 00:01:42,270 --> 00:01:44,520 purposes here I'm highlighting them by 44 00:01:44,520 --> 00:01:46,470 hand the other thing I'm going to do is 45 00:01:46,470 --> 00:01:49,310 I'm going to put an equation here that 46 00:01:49,310 --> 00:01:52,920 for these cases where they are being 47 00:01:52,920 --> 00:01:54,210 substituted instead of putting the value 48 00:01:54,210 --> 00:01:57,930 there I'm going to put equals to the 49 00:01:57,930 --> 00:01:59,810 median here and I'm gonna make the 50 00:01:59,810 --> 00:02:03,329 twenty nine stay in place there by 51 00:02:03,329 --> 00:02:04,890 putting a dollar sign in front of it 52 00:02:04,890 --> 00:02:06,659 make it an absolute reference and that 53 00:02:06,659 --> 00:02:09,479 means that I can copy this and paste it 54 00:02:09,479 --> 00:02:11,330 in each of these subsequent spots 55 00:02:11,330 --> 00:02:13,890 regardless of which row it's in 56 00:02:13,890 --> 00:02:15,750 it's always going to be grabbing the 57 00:02:15,750 --> 00:02:18,540 value from road 29 for that particular 58 00:02:18,540 --> 00:02:21,960 column all right so I can come here and 59 00:02:21,960 --> 00:02:24,030 I can substitute that in and here when I 60 00:02:24,030 --> 00:02:26,160 substitute it we'll see it changes in 61 00:02:26,160 --> 00:02:35,010 here and so here this one this one and 62 00:02:35,010 --> 00:02:36,390 you can see these ones here 63 00:02:36,390 --> 00:02:38,190 these questions were how much he uses 64 00:02:38,190 --> 00:02:39,720 different statistics software probably 65 00:02:39,720 --> 00:02:42,930 the best guess is actually the median 66 00:02:42,930 --> 00:02:43,980 rather than the mean in that case 67 00:02:43,980 --> 00:02:47,330 because only one person here used that 68 00:02:47,330 --> 00:02:49,230 so anyway this is probably the dominant 69 00:02:49,230 --> 00:02:50,700 category here if someone left it blank 70 00:02:50,700 --> 00:02:52,520 they probably haven't used it 71 00:02:52,520 --> 00:02:56,400 let me see let's drag this over a little 72 00:02:56,400 --> 00:02:58,320 bit and I can also I can fill these 73 00:02:58,320 --> 00:03:00,420 across it'll still work so I don't have 74 00:03:00,420 --> 00:03:02,220 to just paste paste paste I can do a 75 00:03:02,220 --> 00:03:04,830 whole row of them like that can go in 76 00:03:04,830 --> 00:03:07,500 oops I need to recopy so I can copy any 77 00:03:07,500 --> 00:03:09,690 of these paste it in here and it'll work 78 00:03:09,690 --> 00:03:13,170 and it'll keep grabbing the observation 79 00:03:13,170 --> 00:03:17,310 from the twenty-ninth cell so I need to 80 00:03:17,310 --> 00:03:21,480 copy those in there okay and so at the 81 00:03:21,480 --> 00:03:22,920 end of this I'm gonna have a data set 82 00:03:22,920 --> 00:03:26,370 that is almost all values Oh something I 83 00:03:26,370 --> 00:03:29,459 should say is that I calculated the mean 84 00:03:29,459 --> 00:03:31,260 median and mode before I started making 85 00:03:31,260 --> 00:03:33,239 these changes if you hadn't you'd get a 86 00:03:33,239 --> 00:03:35,610 circular reference warning so instead 87 00:03:35,610 --> 00:03:37,440 what you want to do is copy and paste 88 00:03:37,440 --> 00:03:40,140 the values here instead of the formulas 89 00:03:40,140 --> 00:03:42,780 for the mean median mode and that's what 90 00:03:42,780 --> 00:03:45,209 I already did and so that's why we see 91 00:03:45,209 --> 00:03:49,739 that there let's see here's a couple 92 00:03:49,739 --> 00:03:52,650 more that I hadn't highlighted so I'll 93 00:03:52,650 --> 00:03:55,769 copy this and I'll paste it there you 94 00:03:55,769 --> 00:03:57,570 can see I'm pasting the formatting at 95 00:03:57,570 --> 00:03:58,680 the same time which is kind of 96 00:03:58,680 --> 00:04:03,750 convenient boom okay hold on okay and 97 00:04:03,750 --> 00:04:06,810 then we have a case here computers at 98 00:04:06,810 --> 00:04:09,120 home awesome one listed three or more 99 00:04:09,120 --> 00:04:10,680 I'm just gonna go ahead and change that 100 00:04:10,680 --> 00:04:13,709 to a three I know it could be more than 101 00:04:13,709 --> 00:04:16,079 that but for our purposes it's uh we 102 00:04:16,079 --> 00:04:17,149 have low medium and high 103 00:04:17,149 --> 00:04:20,130 more or less okay and I'll look along 104 00:04:20,130 --> 00:04:22,079 here see if there any more missing cases 105 00:04:22,079 --> 00:04:24,720 there are a couple so I'm going to go 106 00:04:24,720 --> 00:04:27,030 ahead and copy formatting 107 00:04:27,030 --> 00:04:28,890 the rule to look for the value from the 108 00:04:28,890 --> 00:04:31,320 29 throw from here I'm gonna paste it 109 00:04:31,320 --> 00:04:33,240 into the remaining items now this was 110 00:04:33,240 --> 00:04:37,170 not very efficient yeah I wouldn't do 111 00:04:37,170 --> 00:04:39,150 this if I had a lot of a lot of cases 112 00:04:39,150 --> 00:04:42,270 but for our purposes I think this will 113 00:04:42,270 --> 00:04:44,970 work and we could talk about how to 114 00:04:44,970 --> 00:04:47,820 automate these sort of a steps later so 115 00:04:47,820 --> 00:04:50,640 now what I'm going to do is I'm going to 116 00:04:50,640 --> 00:04:57,740 copy everything in this sheet like this 117 00:04:57,740 --> 00:05:02,310 copy and then I'm gonna just shoot empty 118 00:05:02,310 --> 00:05:03,840 no it's not I'm gonna go and add yet 119 00:05:03,840 --> 00:05:08,280 another sheet I'm gonna paste special so 120 00:05:08,280 --> 00:05:09,810 right click and then click paste special 121 00:05:09,810 --> 00:05:14,090 words and then I'm gonna click on 122 00:05:14,090 --> 00:05:18,680 formats ok and then I'm gonna click on 123 00:05:18,680 --> 00:05:23,490 paste special values ok and so then I 124 00:05:23,490 --> 00:05:26,070 got my highlighting and I got my numbers 125 00:05:26,070 --> 00:05:32,400 but I didn't get my formulas which I 126 00:05:32,400 --> 00:05:34,890 appreciate so now this data set here I'm 127 00:05:34,890 --> 00:05:36,270 gonna go ahead and clear it clear this 128 00:05:36,270 --> 00:05:39,360 stuff off the bottom clear contents cuz 129 00:05:39,360 --> 00:05:43,500 I don't need it anymore and now I'm 130 00:05:43,500 --> 00:05:45,240 ready to do some analysis on this data 131 00:05:45,240 --> 00:05:46,830 yeah so this would be a good starting 132 00:05:46,830 --> 00:05:48,900 point for that so pretty much we're done 133 00:05:48,900 --> 00:05:50,880 processing the data in terms of missing 134 00:05:50,880 --> 00:05:53,040 cases and stuff gonna get ready to start 135 00:05:53,040 --> 00:05:54,120 doing the next step which would be 136 00:05:54,120 --> 00:05:56,130 constructing an index which I'll make a 137 00:05:56,130 --> 00:05:59,960 brief video about here in a second