Use “select” and “group_by” to filter variables.

Part 2.

Previous article Part 1.

Use select and group_by to filter variables, create a variable with cool and useful reviewers (or your favorite ones).

#I filter Italian reviews and those reviewers who have assessed more Italian restaurants. This is an aspect we will further analyze in the future, but here I assume that reviewers with more reviews are more reliable than those who only write a few of them (or just one).

I create a new data frame with the number of reviews and each reviewer. In my data frame, the names of the columns are different from the DataCamp analysis,  since user-name does not exist in my dataset. I use reviewer_name instead.

library(readr)
library(dplyr
italian <- read_csv("italian.csv")
View(italian)
number_reviews_italian <- italian %>% 
  select(reviewer_review_count, reviewer_name) %>%	#select() takes the columns I want
  group_by(reviewer_name) %>%         #group_by() and summarise() provide separate summaries
  summarise(total_reviews = n())
number_reviews_italian

The result is a list of IDs and the total number of reviews that reviewers have made.

Here I recommend to try different column names to explore various solutions (and ideas). For example, the following code (because I find that IDs are not helpful).

#filter names
number_reviews_italian2 <- italian %>% 
  select(reviewer_name, reviewer_review_count) %>%
  group_by(reviewer_name) %>% 
  summarise(total_reviews = n())
number_reviews_italian2
# Now print the table of total_reviews
table(number_reviews_italian$total_reviews)
# Print the average number of reviews per users. 
mean(number_reviews_italian$total_reviews)
[1] 1.716171


summary(number_reviews_italian)

At this point, we can create new variables. In fact, when we read reviews, we want to know the truth, and possibly we want to know it (the truth) from someone who tells it nicely. I have noticed that the data set provides two interesting column names: reviewer_cool and reviewer_useful. I select these variables, together with the names of the businesses they have reviewed, the names of the cities, and their comments (text). This way, we will obtain a selection with the coolest reviewers and a list of the best Italian restaurants (those with more than 4 stars). I call them “top_cool”.

top_cool<- italian %>% 
  select(reviewer_name, reviewer_cool, reviewer_useful, business_stars, stars, business_name, business_city, text) %>% 
  filter(reviewer_cool >= 1000, reviewer_useful >= 1000, business_stars >=4)
top_cool

I print this table in a new csv file to use later.

library(xlsx)
write.csv(top_cool, "top_cool.csv")

Part 3. In the next step we search for a good restaurant in Phoenix.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s