Jordan Tanley and Jonathan Wood 2022-07-05

Introduction - Jonathan


The data in this analysis will be the online news popularity dataset. This data has a set of features on articles from over a two year period.

The goal of this project is to determine the number of shares (how many times the article was shared over social media) the article has. We will use this information to predict if an article can be popular by the number of shares.

Notable Variables

While there are 61 variables in the data set, we will not use all of them for this project. The notable variables are the following:


Multiple methods will be used for this project to predict the number of shares a new article can generate, including

Data - Jordan

In order to read in the data using a relative path, be sure to have the data file saved in your working directory.

# read in the data
news <- read_csv("OnlineNewsPopularity/OnlineNewsPopularity.csv")
## Rows: 39644 Columns: 61
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): url
## dbl (60): timedelta, n_tokens_title, n_tokens_content, n_unique_tokens, n_non_stop_words, n_non_stop_unique_tokens, num_hrefs, num_self_hrefs, num_img...
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# sneek peek at the dataset
news %>%
  head() %>%
url timedelta n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words n_non_stop_unique_tokens num_hrefs num_self_hrefs num_imgs num_videos average_token_length num_keywords data_channel_is_lifestyle data_channel_is_entertainment data_channel_is_bus data_channel_is_socmed data_channel_is_tech data_channel_is_world kw_min_min kw_max_min kw_avg_min kw_min_max kw_max_max kw_avg_max kw_min_avg kw_max_avg kw_avg_avg self_reference_min_shares self_reference_max_shares self_reference_avg_sharess weekday_is_monday weekday_is_tuesday weekday_is_wednesday weekday_is_thursday weekday_is_friday weekday_is_saturday weekday_is_sunday is_weekend LDA_00 LDA_01 LDA_02 LDA_03 LDA_04 global_subjectivity global_sentiment_polarity global_rate_positive_words global_rate_negative_words rate_positive_words rate_negative_words avg_positive_polarity min_positive_polarity max_positive_polarity avg_negative_polarity min_negative_polarity max_negative_polarity title_subjectivity title_sentiment_polarity abs_title_subjectivity abs_title_sentiment_polarity shares 731 12 219 0.6635945 1 0.8153846 4 2 1 0 4.680365 5 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 496 496 496.000 1 0 0 0 0 0 0 0 0.5003312 0.3782789 0.0400047 0.0412626 0.0401225 0.5216171 0.0925620 0.0456621 0.0136986 0.7692308 0.2307692 0.3786364 0.1000000 0.7 -0.3500000 -0.600 -0.2000000 0.5000000 -0.1875000 0.0000000 0.1875000 593 731 9 255 0.6047431 1 0.7919463 3 1 1 0 4.913725 4 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000 1 0 0 0 0 0 0 0 0.7997557 0.0500467 0.0500963 0.0501007 0.0500007 0.3412458 0.1489478 0.0431373 0.0156863 0.7333333 0.2666667 0.2869146 0.0333333 0.7 -0.1187500 -0.125 -0.1000000 0.0000000 0.0000000 0.5000000 0.0000000 711 731 9 211 0.5751295 1 0.6638655 3 1 1 0 4.393365 6 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 918 918 918.000 1 0 0 0 0 0 0 0 0.2177923 0.0333345 0.0333514 0.0333335 0.6821883 0.7022222 0.3233333 0.0568720 0.0094787 0.8571429 0.1428571 0.4958333 0.1000000 1.0 -0.4666667 -0.800 -0.1333333 0.0000000 0.0000000 0.5000000 0.0000000 1500 731 9 531 0.5037879 1 0.6656347 9 0 1 0 4.404896 7 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000 1 0 0 0 0 0 0 0 0.0285732 0.4192996 0.4946508 0.0289047 0.0285716 0.4298497 0.1007047 0.0414313 0.0207156 0.6666667 0.3333333 0.3859652 0.1363636 0.8 -0.3696970 -0.600 -0.1666667 0.0000000 0.0000000 0.5000000 0.0000000 1200 731 13 1072 0.4156456 1 0.5408895 19 19 20 0 4.682836 7 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 545 16000 3151.158 1 0 0 0 0 0 0 0 0.0286328 0.0287936 0.0285752 0.0285717 0.8854268 0.5135021 0.2810035 0.0746269 0.0121269 0.8602151 0.1397849 0.4111274 0.0333333 1.0 -0.2201923 -0.500 -0.0500000 0.4545455 0.1363636 0.0454545 0.1363636 505 731 10 370 0.5598886 1 0.6981982 2 2 0 0 4.359459 9 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 8500 8500 8500.000 1 0 0 0 0 0 0 0 0.0222453 0.3067176 0.0222313 0.0222243 0.6265816 0.4374086 0.0711842 0.0297297 0.0270270 0.5238095 0.4761905 0.3506100 0.1363636 0.6 -0.1950000 -0.400 -0.1000000 0.6428571 0.2142857 0.1428571 0.2142857 855
# Creating a weekday variable (basically undoing the 7 dummy variables that came with the data) for EDA
news$weekday <- ifelse(news$weekday_is_friday == 1, "Friday",
                       ifelse(news$weekday_is_monday == 1, "Monday",
                              ifelse(news$weekday_is_tuesday == 1, "Tuesday",
                                     ifelse(news$weekday_is_wednesday == 1, "Wednesday",
                                            ifelse(news$weekday_is_thursday == 1, "Thursday",
                                                   ifelse(news$weekday_is_saturday == 1, "Saturday", 

Next, let’s subset the data so that we can only look at the data channel of interest. We will look at articles with the “Social Media” data channel.

# Subset the data to  one of the parameterized data channels and drop unnecessary variables
chan <- paste0("data_channel_is_", params$channel)

## [1] "data_channel_is_lifestyle"
filtered_channel <- news %>% 
                as_tibble() %>% 
                filter(news[chan] == 1) %>% 
                select(-c(url, timedelta))

# take a peek at the data
filtered_channel %>%

Summarizations - Both (at least 3 plots each)

For the numerical summaries, we can look at several aspects. Contingency tables allow us to examine frequencies of categorical variables. The first output below, for example, shows the counts for each weekday. Similarly, the fifth table outputted shows the frequencies of number of tokens in the article content. Another set of summary statistics to look at are the 5 Number Summaries. These provide the minmum, 1st quantile, median, 3rd quantile, and maximum for a particular variable. Additionally, it may also be helful to look at the average. These are helpful in determining the skewness (if mean = median vs. mean < or > median) and helps in looking for outliers (anything outside (Q3 - Q1)1.5 from the median is generally considered an outlier). Below, the 5 Number summaries (plus mean) are shown for Shares, Number of words in the content, Number of words in the content for the upper quantile of Shares, number of images in the article, number of videos in the article, positive word rate, and negative word rate.

# Contingency table of frequencies for days of the week, added caption for clarity
      col.names = c("Weekday", "Frequency"), 
      caption = "Contingency table of frequencies for days of the week")
Weekday Frequency
Friday 305
Monday 322
Saturday 182
Sunday 210
Thursday 358
Tuesday 334
Wednesday 388

Contingency table of frequencies for days of the week

# Numerical Summary of Shares, added caption for clarity
filtered_channel %>% summarise(Minimum = min(shares), 
                          Q1 = quantile(shares, prob = 0.25), 
                          Average = mean(shares), 
                          Median = median(shares), 
                          Q3 = quantile(shares, prob = 0.75), 
                          Maximum = max(shares)) %>% 
                kable(caption = "Numerical Summary of Shares")
Minimum Q1 Average Median Q3 Maximum
28 1100 3682.123 1700 3250 208300

Numerical Summary of Shares

# Numerical Summary of Number of words in the content, added caption for clarity
filtered_channel %>% summarise(Minimum = min(n_tokens_content), 
                          Q1 = quantile(n_tokens_content, prob = 0.25), 
                          Average = mean(n_tokens_content), 
                          Median = median(n_tokens_content), 
                          Q3 = quantile(n_tokens_content, prob = 0.75), 
                          Maximum = max(n_tokens_content)) %>% 
                kable(caption = "Numerical Summary of Number of words in the content")
Minimum Q1 Average Median Q3 Maximum
0 308.5 621.3273 502 795 8474

Numerical Summary of Number of words in the content

# Numerical Summary of Number of words in the content for the upper quantile of Shares, added caption for clarity
filtered_channel %>% filter(shares > quantile(shares, prob = 0.75)) %>%
                summarise(Minimum = min(n_tokens_content), 
                          Q1 = quantile(n_tokens_content, prob = 0.25), 
                          Average = mean(n_tokens_content), 
                          Median = median(n_tokens_content), 
                          Q3 = quantile(n_tokens_content, prob = 0.75), 
                          Maximum = max(n_tokens_content)) %>% 
                kable(caption = "Numerical Summary of Number of words in the content for the upper quantile of Shares")
Minimum Q1 Average Median Q3 Maximum
0 306 679.1752 505 883 8474

Numerical Summary of Number of words in the content for the upper quantile of Shares

  col.names = c("Tokens", "Frequency"), 
  caption = "Contingency table of frequencies for number of tokens in the article content")
Tokens Frequency
0 22
68 1
77 1
81 2
85 1
89 1
91 2
96 1
97 1
98 1
99 2
101 1
102 1
103 1
105 2
106 1
107 1
110 2
112 1
113 2
114 2
115 3
116 5
117 2
118 1
120 1
121 3
123 6
124 1
126 1
128 1
129 1
130 1
131 2
132 1
135 1
137 3
138 3
139 2
140 2
141 3
143 1
144 2
145 1
146 2
147 3
149 4
150 3
151 1
153 1
154 3
155 1
156 2
157 2
158 3
159 4
160 1
161 2
163 3
165 1
166 2
167 1
169 4
171 4
172 1
173 3
174 4
175 4
178 3
179 3
180 2
181 1
182 2
183 2
184 3
187 3
188 1
189 2
191 2
192 1
193 2
194 2
195 5
196 1
197 5
198 2
199 4
200 2
202 4
203 5
204 5
205 1
206 4
207 3
208 1
209 2
210 6
211 3
212 1
213 3
214 6
215 4
216 4
217 3
218 2
219 1
221 2
222 7
223 2
224 3
225 1
226 4
228 2
229 3
230 4
231 2
232 3
234 2
235 4
236 2
237 4
238 2
239 3
240 4
241 3
242 3
243 7
244 1
246 3
247 4
249 6
250 1
251 1
252 3
253 2
254 4
255 2
256 7
257 1
258 6
259 2
260 1
261 7
262 2
263 3
264 2
265 6
266 6
267 4
268 6
269 4
270 5
271 3
272 5
273 5
275 1
276 9
277 3
278 3
279 4
280 2
281 2
282 4
283 5
285 1
286 2
287 3
289 2
290 1
291 6
292 5
293 4
294 5
295 2
296 1
297 5
298 2
299 3
300 4
301 2
303 1
304 1
305 4
306 3
307 2
308 3
309 3
310 2
311 2
312 3
313 4
314 1
315 5
316 3
317 4
318 6
319 3
320 3
322 3
323 4
324 1
325 7
326 5
327 3
328 2
329 2
330 3
331 4
332 6
333 1
334 4
335 5
336 3
337 5
338 5
339 3
340 4
341 2
342 8
344 4
345 3
346 2
347 4
348 4
349 2
350 4
351 2
352 5
353 4
354 5
355 4
356 3
357 2
358 2
359 1
360 3
361 2
362 3
363 1
364 3
365 3
366 3
367 5
368 5
369 1
370 4
371 1
372 1
373 1
374 2
375 2
376 4
377 1
378 5
379 5
380 3
381 3
382 5
383 5
385 1
386 1
387 3
388 3
389 1
390 2
391 2
393 3
394 4
395 5
396 2
397 5
398 4
399 6
400 3
401 2
402 3
403 3
404 2
405 3
406 3
407 3
408 4
409 6
410 4
411 4
412 4
413 3
414 4
415 2
416 3
417 2
418 3
419 2
420 3
421 1
422 1
423 3
424 1
425 1
427 2
428 4
429 3
430 3
432 2
434 4
436 2
437 1
438 1
439 2
440 5
441 4
442 2
443 2
444 1
445 4
446 6
447 2
448 10
451 1
452 3
453 5
454 2
455 1
456 2
457 6
458 2
459 4
460 1
461 2
463 3
464 2
465 1
466 2
467 1
468 3
469 1
470 1
471 3
473 2
475 1
476 2
477 1
478 2
480 4
481 2
482 2
483 3
484 1
485 5
486 4
487 3
488 2
489 2
490 4
491 2
492 1
494 1
495 2
496 2
497 2
498 4
499 4
500 2
501 2
502 2
503 4
504 2
505 4
506 3
507 1
508 2
509 3
510 4
511 4
512 1
513 3
514 4
516 3
517 4
518 4
519 1
521 2
522 2
524 1
525 1
526 1
527 4
528 1
529 4
530 5
531 2
532 1
533 2
534 3
535 6
536 3
537 1
538 2
539 3
540 2
541 3
542 2
543 2
544 4
545 2
546 1
547 1
548 7
549 5
550 2
551 2
552 1
554 3
555 3
556 1
557 4
558 1
560 2
561 2
562 3
563 1
564 1
565 3
566 1
567 3
568 4
569 2
570 3
571 6
572 2
573 3
574 2
575 3
576 4
577 2
578 3
579 1
580 3
581 1
582 3
583 3
584 3
586 1
587 1
588 2
589 4
590 2
591 4
592 2
593 1
594 3
595 2
596 2
598 1
599 3
600 1
601 3
602 2
603 1
604 2
605 1
606 2
607 1
608 1
609 2
610 2
611 3
612 3
613 2
614 3
615 3
616 1
619 2
620 2
621 1
622 2
623 1
624 2
625 2
626 3
627 2
628 2
629 1
630 1
631 1
632 2
633 1
634 1
635 1
636 3
637 3
638 4
639 5
640 2
641 2
642 1
643 1
645 3
646 2
647 2
651 2
652 2
653 2
654 1
655 1
659 2
660 3
661 3
662 1
663 5
664 2
665 1
666 1
667 2
668 1
670 3
671 2
672 3
673 3
674 1
675 3
676 1
677 2
679 1
680 1
681 1
682 1
683 4
687 2
689 2
690 3
691 3
692 1
695 4
696 2
697 2
698 4
699 3
702 3
703 1
704 2
706 1
707 4
708 3
710 1
712 3
713 2
715 2
716 3
717 2
719 3
720 3
723 2
724 1
725 3
726 3
727 2
728 1
729 1
730 2
731 3
732 5
733 1
734 4
736 1
739 3
743 2
744 1
745 2
746 1
747 1
748 1
749 3
752 1
754 2
755 1
756 1
757 2
759 2
760 1
762 5
763 3
764 1
768 4
769 2
771 1
772 1
773 1
774 3
776 2
781 1
783 2
784 2
785 1
788 1
791 1
792 2
793 3
794 2
795 2
796 1
799 1
801 2
802 1
803 2
804 2
806 2
807 3
808 2
809 3
810 1
811 1
812 2
814 2
816 1
817 2
818 2
819 3
820 1
822 1
823 1
824 3
825 1
826 1
827 1
828 2
830 2
831 2
832 2
837 1
839 1
840 2
841 1
842 1
843 1
847 2
851 2
852 1
854 2
855 2
857 2
858 1
859 5
860 2
861 2
863 7
864 2
865 1
868 1
869 1
870 1
871 1
872 2
873 3
874 1
877 1
878 2
879 1
880 1
881 2
883 3
886 1
888 1
889 1
890 2
891 2
893 2
894 1
895 4
897 1
899 4
900 1
901 1
902 2
903 1
906 2
907 3
908 1
909 1
910 1
911 2
912 2
913 2
914 1
916 1
917 3
918 1
919 2
920 3
921 1
926 1
927 1
928 2
930 2
931 1
932 3
934 1
935 3
937 1
939 2
940 2
941 1
942 1
947 1
948 3
950 1
953 2
955 3
957 1
958 1
960 2
962 1
963 1
964 1
965 3
968 1
970 2
971 1
972 2
974 1
977 4
983 3
984 1
986 1
988 1
992 2
999 2
1000 1
1002 1
1005 1
1006 1
1007 2
1009 1
1010 1
1011 1
1013 1
1014 1
1015 2
1018 2
1019 1
1020 3
1021 2
1022 2
1023 1
1024 3
1026 1
1028 3
1029 1
1030 1
1034 4
1035 1
1037 1
1041 2
1044 1
1045 2
1046 1
1050 1
1053 2
1058 1
1059 1
1060 2
1064 1
1067 1
1069 1
1074 1
1075 1
1076 1
1079 1
1080 1
1084 2
1085 1
1089 1
1091 1
1092 1
1094 1
1095 1
1096 1
1099 2
1100 1
1103 2
1105 3
1106 1
1107 1
1110 1
1111 1
1117 1
1118 1
1119 1
1122 1
1127 1
1132 1
1133 2
1134 1
1135 1
1138 1
1140 2
1141 3
1144 2
1145 1
1146 1
1147 2
1148 1
1151 1
1153 1
1158 2
1159 1
1160 1
1167 1
1168 2
1172 1
1175 1
1177 1
1180 1
1181 1
1182 1
1183 2
1184 2
1185 1
1190 1
1191 2
1197 1
1198 1
1200 2
1205 1
1210 1
1215 2
1223 2
1224 1
1225 1
1226 1
1227 2
1229 1
1238 1
1239 1
1240 1
1245 1
1247 1
1248 1
1249 1
1256 1
1257 1
1266 2
1267 1
1271 1
1278 1
1279 2
1281 1
1282 1
1284 1
1285 2
1286 1
1287 1
1293 1
1294 2
1295 1
1303 1
1307 1
1311 2
1314 1
1316 3
1318 1
1322 1
1323 1
1326 1
1327 1
1331 1
1338 1
1344 1
1347 1
1348 1
1349 1
1359 1
1361 1
1365 1
1368 1
1370 1
1375 1
1376 1
1378 2
1381 1
1388 1
1393 1
1398 1
1401 1
1411 1
1416 1
1418 1
1427 1
1434 1
1435 1
1437 1
1441 1
1444 1
1445 1
1447 1
1460 1
1471 2
1477 1
1487 2
1494 1
1498 1
1506 1
1508 1
1519 1
1524 1
1525 1
1528 1
1530 1
1539 1
1542 2
1546 2
1547 1
1549 1
1556 1
1557 1
1559 1
1572 1
1578 1
1579 1
1582 1
1587 1
1601 1
1611 1
1613 1
1618 1
1643 1
1671 1
1672 2
1701 1
1727 1
1752 1
1784 2
1798 1
1809 1
1821 1
1822 1
1834 1
1866 1
1889 1
1901 1
1906 1
1939 2
1963 1
1971 1
1976 1
1987 1
2034 1
2055 1
2061 1
2071 1
2119 1
2147 1
2168 1
2182 1
2199 1
2301 1
2392 1
2405 1
2448 1
2454 1
2509 1
2542 1
2571 1
2781 1
2873 1
2907 1
3007 1
3083 1
3233 1
3727 1
4089 1
4130 1
7004 1
7034 1
7185 1
7413 1
7764 1
8474 1

Contingency table of frequencies for number of tokens in the article content

# Summarizing the number of images in the article
filtered_channel %>% 
  summarise(Minimum = min(num_imgs), 
      Q1 = quantile(num_imgs, prob = 0.25), 
      Average = mean(num_imgs), 
      Median = median(num_imgs), 
      Q3 = quantile(num_imgs, prob = 0.75), 
      Maximum = max(num_imgs)) %>% 
  kable(caption = "Numerical summary of number of images in an article")
Minimum Q1 Average Median Q3 Maximum
0 1 4.904717 1 8 111

Numerical summary of number of images in an article

# Summarizing the number of videos in the article
filtered_channel %>% 
  summarise(Minimum = min(num_videos), 
      Q1 = quantile(num_videos, prob = 0.25), 
      Average = mean(num_videos), 
      Median = median(num_videos), 
      Q3 = quantile(num_videos, prob = 0.75), 
      Maximum = max(num_videos)) %>% 
  kable(caption = "Numerical summary of number of videos in an article")
Minimum Q1 Average Median Q3 Maximum
0 0 0.4749881 0 0 50

Numerical summary of number of videos in an article

# Summarizing the number of positive word rate
filtered_channel %>% 
  summarise(Minimum = min(rate_positive_words), 
      Q1 = quantile(rate_positive_words, prob = 0.25), 
      Average = mean(rate_positive_words), 
      Median = median(rate_positive_words), 
      Q3 = quantile(rate_positive_words, prob = 0.75), 
      Maximum = max(rate_positive_words)) %>% 
  kable(caption = "Numerical Summary of the rate of positive words in an article")
Minimum Q1 Average Median Q3 Maximum
0 0.6624941 0.7226337 0.7377049 0.8125 1

Numerical Summary of the rate of positive words in an article

# Summarizing the number of negative word rate
filtered_channel %>% 
  summarise(Minimum = min(rate_negative_words), 
      Q1 = quantile(rate_negative_words, prob = 0.25), 
      Average = mean(rate_negative_words), 
      Median = median(rate_negative_words), 
      Q3 = quantile(rate_negative_words, prob = 0.75), 
      Maximum = max(rate_negative_words)) %>% 
  kable(caption = "Numerical Summary of the rate of negative words in an article")
Minimum Q1 Average Median Q3 Maximum
0 0.1836735 0.2668851 0.2580645 0.3333333 1

Numerical Summary of the rate of negative words in an article

The graphical summaries more dramatically show the trends in the data, including skewness and outliers. The boxplots below show a visual representation of the 5 Number summaries for Shares, split up by weekday, and shares split up by text sentiment polarity. Boxplots make it even easier to look out for outliers (look for the dots separated from the main boxplot). Next, we can examine several scatterplots. Scatterplots allow us to look at one numerical variable vs another to see if there is any correlation between them. Look out for any plots that have most of the points on a diagonal line! There are four scatterplots below, investigating shares vs Number of words in the content, Number of words in the title, rate of positive words, and rate of negative words. Finally, a histogram can show the overall distribution of a numerical variable, including skewness. The histogram below sows the distribution of the shares variable. Look for a left or right tail to signify skewness, and look out for multiple peaks to signify a multi-modal variable.

# Boxplot of Shares for Each Weekday, colored gray with classic theme, added labels and title
ggplot(filtered_channel, aes(x = weekday, y = shares)) + 
          geom_boxplot(fill = "grey") + 
          labs(x = "Weekday", title = "Boxplot of Shares for Each Weekday", y = "Shares") + 

# Scatterplot of Number of words in the content vs Shares, colored gray with classic theme, added labels and title
ggplot(filtered_channel, aes(x = n_tokens_content, y = shares)) + 
          geom_point(color = "grey") +
          labs(x = "Number of words in the content", y = "Shares", 
               title = "Scatterplot of Number of words in the content vs Shares") +

# Scatterplot of Number of words in the title vs Shares, colored gray with classic theme, added labels and title
ggplot(filtered_channel, aes(x = n_tokens_title, y = shares)) + 
          geom_point(color = "grey") +
          labs(x = "Number of words in the title", y = "Shares", 
               title = "Scatterplot of Number of words in the title vs Shares") +

ggplot(filtered_channel, aes(x=shares)) +
  geom_histogram(color="grey", binwidth = 2000) +
  labs(x = "Shares", 
               title = "Histogram of number of shares") +

ggplot(filtered_channel, aes(x=rate_positive_words, y=shares)) +
  geom_point(color="grey") +
  labs(x = "rate of positive words in an article", y = "Shares", 
               title = "Scatterplot of rate of positive words in an article vs shares") +

ggplot(filtered_channel, aes(x=rate_negative_words, y=shares)) +
  geom_point(color="grey") +
  labs(x = "rate of negative words in an article", y = "Shares", 
               title = "Scatterplot of rate of negative words in an article vs shares") +

ggplot(filtered_channel, aes(x=global_sentiment_polarity, y=shares)) +
  geom_point(color="grey") +
  labs(x = "global sentiment polarity in an article", y = "Shares", 
               title = "Scatterplot of global sentiment polarity in an article vs shares") +

# drop the weekday variable created for EDA (will get in the way for our models if we don't drop it)
filtered_channel <- subset(filtered_channel, select = -c(weekday))


Splitting the Data

First, let’s split up the data into a testing set and a training set using the proportions: 70% training and 30% testing.

# Split the data into a training and test set (70/30 split)
# indices
train <- sample(1:nrow(filtered_channel), size = nrow(filtered_channel)*.70)
test <- setdiff(1:nrow(filtered_channel), train)

# training and testing subsets
Training <- filtered_channel[train, ]
Testing <- filtered_channel[test, ]

Linear Models

Linear regression models allow us to look at relationships between one response variable and several explanatory variables. A model can also include interaction terms and even higher order terms. The general form for a linear model is Y_i = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + E_i, where each x_i represents a predictor variable and the “…” can include more predictors, interactions and/or higher order terms. Since our goal is to predict shares, we will be using these models to predict of a subset of the data created for training, and then we will later test the models on the other subsetted data set aside for testing.

Linear Model #1: - Jordan

# linear model on training dataset with 5-fold cv
fit1 <- train(shares ~ . , data = Training, method = "lm",
              preProcess = c("center", "scale"), 
              trControl = trainControl(method = "cv", number = 5))

Linear Model #2: - Jonathan

lm_fit <- train(
  shares ~ .^2,
  preProcess = c("center", "scale"), 
  trControl = trainControl(method = "cv", number = 5)

Random Forest - Jordan

Random Forest is a tree based method for fitting predictive models, that averages across all trees. One may choose to use a tree based method due to their prediction accuracy, the fact that predictors do not need to be scaled, no statistical assumptions, and a built-in variable selection process. Random forest, in particular, randomly selects a subset of m = p / 3 predictors. This corrects the bagging issue where every bootstrap contains a strong predictor for the first split.

# fandom forest model on training dataset with 5-fold cv
ranfor <- train(shares ~ ., data = Training, method = "rf", preProcess = c("center", "scale"),
                trControl = trainControl(method = "cv", number = 5), 
                tuneGrid = expand.grid(mtry = c(1:round(ncol(Training)/3))))
## Random Forest 
## 1469 samples
##   58 predictor
## Pre-processing: centered (58), scaled (58) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1176, 1175, 1175, 1174, 1176 
## Resampling results across tuning parameters:
##   mtry  RMSE      Rsquared     MAE     
##    1    8647.401  0.013806063  3339.815
##    2    8670.942  0.016487776  3416.647
##    3    8746.959  0.012640866  3453.693
##    4    8793.486  0.010810707  3490.870
##    5    8820.633  0.010464841  3515.555
##    6    8890.127  0.010299153  3534.856
##    7    8954.947  0.006461325  3577.440
##    8    8950.874  0.009611146  3564.044
##    9    9050.297  0.007999605  3592.140
##   10    9062.713  0.008020117  3603.272
##   11    9140.680  0.008393644  3629.264
##   12    9135.058  0.006962249  3631.516
##   13    9208.666  0.006922996  3654.880
##   14    9254.574  0.006711465  3659.443
##   15    9318.889  0.007400555  3685.630
##   16    9326.816  0.006721992  3683.519
##   17    9394.413  0.006586585  3708.637
##   18    9412.480  0.005168434  3712.670
##   19    9464.970  0.006418984  3694.991
##   20    9507.476  0.005502456  3729.256
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 1.

Boosted Tree - Jonathan

Boosted Trees is an ensemble model similar to bagging where it builds multiple tree models. The previous tree built is used to build a new tree by taking into account the errors of the previous tree.

tune_grid <- expand.grid(
  n.trees = c(5, 10, 50, 100),
  interaction.depth = c(1,2,3, 4),
  shrinkage = 0.1,
  n.minobsinnode = 10

bt_fit <- train(
  shares ~ .,
  preProcess = c("center", "scale"), 
  tuneGrid = tune_grid,
  trControl = trainControl(method = "cv", number = 5)
## Stochastic Gradient Boosting 
## 1469 samples
##   58 predictor
## Pre-processing: centered (58), scaled (58) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1174, 1175, 1175, 1176, 1176 
## Resampling results across tuning parameters:
##   interaction.depth  n.trees  RMSE      Rsquared     MAE     
##   1                    5      8165.421  0.005096493  3371.056
##   1                   10      8196.132  0.007551899  3362.414
##   1                   50      8381.475  0.008373412  3407.874
##   1                  100      8440.339  0.008902974  3452.295
##   2                    5      8148.357  0.008164890  3399.161
##   2                   10      8201.933  0.007609891  3389.915
##   2                   50      8405.176  0.012781046  3439.845
##   2                  100      8562.265  0.010497550  3536.516
##   3                    5      8165.285  0.008663343  3377.172
##   3                   10      8199.797  0.011038976  3378.280
##   3                   50      8383.000  0.014013506  3465.067
##   3                  100      8534.508  0.014561547  3554.871
##   4                    5      8149.778  0.004361652  3364.159
##   4                   10      8190.586  0.010058920  3378.623
##   4                   50      8460.894  0.017205681  3559.072
##   4                  100      8519.830  0.015968110  3627.882
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## Tuning parameter 'n.minobsinnode' was held constant at a value
##  of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 5, interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10.

Comparison - Jordan

Finally, let’s compare our four models: 2 linear models, 1 random forest model, and 1 boosted tree model.

# random forest prediction on testing model and its performance
predRF <- predict(ranfor, newdata = Testing)
RF <- postResample(predRF, Testing$shares)

# linear model 1 prediction on testing model and its performance
predlm1 <- predict(fit1, newdata = Testing)
LM <- postResample(predlm1, Testing$shares)

# linear model 2 prediction on testing model and its performance
predlm2 <- predict(lm_fit, newdata = Testing)
LM2 <- postResample(predlm2, Testing$shares)

# boosted tree prediction on testing model and its performance
predbt <- predict(bt_fit, newdata = Testing)
BT <- postResample(predbt, Testing$shares)

# combine each of the performance stats for the models and add a column with the model names
dat <- data.frame(rbind(t(data.frame(LM)), t(data.frame(RF)), t(data.frame(LM2)), t(data.frame(BT))))
df <- as_tibble(rownames_to_column(dat, "models"))

# find the model with the lowesr RMSE
best <- df %>% filter(RMSE == min(RMSE)) %>% select(models)

# print "The Best fitting model according to RMSE is [insert model name for lowest RMSE here]"
paste("The Best fitting model according to RMSE is", best$models, sep = " ")
## [1] "The Best fitting model according to RMSE is RF"