Data Wrangling with R – Data-Driven Growth

Building Connections with R + LinkedIn Data

Functionality Document: Data Processing and Analysis Workflow in R

This document outlines the functionality of a data processing and analysis workflow implemented in R.

The workflow comprises multiple steps, from data importation and cleaning through transformation, reshaping, and exporting the final results.

Overview:
The script is designed to demonstrate a series of data wrangling steps using R’s `dplyr` and `tidyr` packages, which are part of the tidyverse.

These steps are crucial for preparing raw data for further analysis and reporting.

Step 1: Setup and Data Import

Objective: Initialize the R environment and import necessary libraries.

Create a synthetic dataset to simulate real-world data processing.

Functionality:

Check for the availability of required packages (`dplyr`, `tidyr`) and install them if not already installed.
Generate a synthetic dataset with randomized data for users including their interactions across different post types and dates.
Display the initial state of the data to provide a snapshot for verification.

Step 2: Data Cleaning

Objective: Inspect the dataset for any anomalies or missing values and provide a summary of the data.

Functionality:

Summarize the data to check for NA values and get a statistical summary of the dataset.
Print the summary to provide insights into the range, median, mean, and missing values across different columns.

Step 3: Data Transformation
Objective: Select relevant columns and aggregate data to prepare for efficient data analysis and merging.

Functionality:

Select specific columns needed for further analysis to simplify the dataset.
Aggregate data by `user_id` to summarize likes, comments, and shares, ensuring a unique identifier for subsequent joins.
Display the transformed data to verify correct aggregation and selection.

Step 4: Data Reshaping
Objective: Reshape the data to facilitate different types of analysis; converting from wide to long format and vice versa as needed.

Functionality:

Use `spread()` to transform the data into a wide format based on post type and likes, enhancing readability and accessibility for certain types of analysis.
Convert the wide format back to a long format using `gather()`, which is useful for other analytical approaches that require a long format.
Print both reshaped datasets to ensure accurate transformation.

Step 5: Joining Data
Objective: Merge the original dataset with the summarized data to enrich the dataset with aggregate metrics.

Functionality:

Perform a left join to combine the original data with the aggregated data on `user_id`, ensuring that all records from the original data are preserved and supplemented with summarized metrics.
Display the joined data to validate the merge operation.

Step 6: Data Export
Objective: Output the final cleaned and processed data to a CSV file for use in further analysis or reporting.

Functionality:

Export the processed data to a CSV file, ensuring it is ready for any subsequent analysis or external reporting needs.
Print the final dataset to provide a preview of the exported data.

Conclusion:
This workflow provides a comprehensive approach to handling, transforming, and preparing data for deeper analysis. It showcases how R can be effectively used to manage data integrity and prepare data for analytical projects or reports. Each step is crucial for ensuring the data is accurate, relevant, and ready for insights.

				
					> # Step 1: Setup and Data Import
> # Load necessary libraries
> library(tidyverse)
> library(tidyr)
> 
> # Create a synthetic dataset
> set.seed(0)
> data <- data.frame(
+   user_id = sample(1:100, 20, replace = TRUE),
+   post_type = sample(c("Image", "Video", "Text"), 20, replace = TRUE),
+   likes = sample(1:300, 20, replace = TRUE),
+   comments = sample(0:100, 20, replace = TRUE),
+   shares = sample(0:50, 20, replace = TRUE),
+   date_posted = seq(as.Date("2021-01-01"), by="day", length.out=20)
+ )
> 
> print("Original Data:")
[1] "Original Data:"
> print(head(data))
  user_id post_type likes comments shares date_posted
1      14     Image    70       69     30  2021-01-01
2      68     Image   121       74     37  2021-01-02
3      39     Image    40       80     16  2021-01-03
4       1     Video   172       99      8  2021-01-04
5      34     Image    25       12     38  2021-01-05
6      87     Image   248       39     22  2021-01-06
> 
> # Step 2: Data Cleaning
> # Check for NA values and inspect the data
> summary(data)
    user_id       post_type             likes           comments         shares       date_posted        
 Min.   : 1.00   Length:20          Min.   : 14.00   Min.   :12.00   Min.   : 0.00   Min.   :2021-01-01  
 1st Qu.:30.75   Class :character   1st Qu.: 43.75   1st Qu.:25.75   1st Qu.:21.00   1st Qu.:2021-01-05  
 Median :56.50   Mode  :character   Median :145.00   Median :45.50   Median :30.50   Median :2021-01-10  
 Mean   :53.35                      Mean   :142.50   Mean   :51.20   Mean   :29.55   Mean   :2021-01-10  
 3rd Qu.:79.75                      3rd Qu.:212.00   3rd Qu.:80.75   3rd Qu.:38.75   3rd Qu.:2021-01-15  
 Max.   :97.00                      Max.   :298.00   Max.   :99.00   Max.   :50.00   Max.   :2021-01-20  
> print("Data Summary:")
[1] "Data Summary:"
> print(summary(data))
    user_id       post_type             likes           comments         shares       date_posted        
 Min.   : 1.00   Length:20          Min.   : 14.00   Min.   :12.00   Min.   : 0.00   Min.   :2021-01-01  
 1st Qu.:30.75   Class :character   1st Qu.: 43.75   1st Qu.:25.75   1st Qu.:21.00   1st Qu.:2021-01-05  
 Median :56.50   Mode  :character   Median :145.00   Median :45.50   Median :30.50   Median :2021-01-10  
 Mean   :53.35                      Mean   :142.50   Mean   :51.20   Mean   :29.55   Mean   :2021-01-10  
 3rd Qu.:79.75                      3rd Qu.:212.00   3rd Qu.:80.75   3rd Qu.:38.75   3rd Qu.:2021-01-15  
 Max.   :97.00                      Max.   :298.00   Max.   :99.00   Max.   :50.00   Max.   :2021-01-20  
> 
> # Step 3: Data Transformation
> # Select columns for further processing
> data_selected <- select(data, user_id, post_type, likes, comments, shares)
> print("Selected Data:")
[1] "Selected Data:"
> print(head(data_selected))
  user_id post_type likes comments shares
1      14     Image    70       69     30
2      68     Image   121       74     37
3      39     Image    40       80     16
4       1     Video   172       99      8
5      34     Image    25       12     38
6      87     Image   248       39     22
> 
> # Summarize data to ensure unique user_id before join
> data_selected_summary <- data_selected %>%
+   group_by(user_id) %>%
+   summarise(
+     total_likes = sum(likes),
+     total_comments = sum(comments),
+     total_shares = sum(shares),
+     .groups = 'drop'  # This option is used to drop grouping after summarisation
+   )
> print("Summarized Data for Join:")
[1] "Summarized Data for Join:"
> print(head(data_selected_summary))
# A tibble: 6 x 4
  user_id total_likes total_comments total_shares
    <int>       <int>          <int>        <int>
1       1         172             99            8
2       7         230             47           28
3      14         109            116           79
4      21          45             21           31
5      34          25             12           38
6      39          40             80           16
> 
> # Step 4: Data Reshaping
> # Convert data to wide format based on post_type
> data_wide <- spread(data, key = post_type, value = likes)
> print("Data in Wide Format:")
[1] "Data in Wide Format:"
> print(head(data_wide))
  user_id comments shares date_posted Image Text Video
1       1       99      8  2021-01-04    NA   NA   172
2       7       47     28  2021-01-17    NA   NA   230
3      14       47     49  2021-01-08    NA   NA    39
4      14       69     30  2021-01-01    70   NA    NA
5      21       21     31  2021-01-14    NA   NA    45
6      34       12     38  2021-01-05    25   NA    NA
> 
> # Convert back to long format
> data_long <- gather(data_wide, key = 'post_metric', value = 'metric_value', Image, Video, Text)
> print("Data in Long Format:")
[1] "Data in Long Format:"
> print(head(data_long))
  user_id comments shares date_posted post_metric metric_value
1       1       99      8  2021-01-04       Image           NA
2       7       47     28  2021-01-17       Image           NA
3      14       47     49  2021-01-08       Image           NA
4      14       69     30  2021-01-01       Image           70
5      21       21     31  2021-01-14       Image           NA
6      34       12     38  2021-01-05       Image           25
> 
> # Step 5: Joining Data
> # Example of joining two datasets (after ensuring one-to-many or one-to-one relationships)
> data_joined <- left_join(data, data_selected_summary, by = "user_id")
> print("Joined Data:")
[1] "Joined Data:"
> print(head(data_joined))
  user_id post_type likes comments shares date_posted total_likes total_comments total_shares
1      14     Image    70       69     30  2021-01-01         109            116           79
2      68     Image   121       74     37  2021-01-02         121             74           37
3      39     Image    40       80     16  2021-01-03          40             80           16
4       1     Video   172       99      8  2021-01-04         172             99            8
5      34     Image    25       12     38  2021-01-05          25             12           38
6      87     Image   248       39     22  2021-01-06         248             39           22
> 
> # Step 6: Data Export
> # Write the cleaned and joined data to a new CSV file
> write.csv(data_joined, "cleaned_social_media_data.csv", row.names = FALSE)
> 
> # Print the final cleaned and joined data
> print("Final Cleaned and Joined Data:")
[1] "Final Cleaned and Joined Data:"
> print(head(data_joined))
  user_id post_type likes comments shares date_posted total_likes total_comments total_shares
1      14     Image    70       69     30  2021-01-01         109            116           79
2      68     Image   121       74     37  2021-01-02         121             74           37
3      39     Image    40       80     16  2021-01-03          40             80           16
4       1     Video   172       99      8  2021-01-04         172             99            8
5      34     Image    25       12     38  2021-01-05          25             12           38
6      87     Image   248       39     22  2021-01-06         248             39           22
>