Useful Reviews with R

Yelp datasets are available for downlaod on the company website.

Dataset agreement

They can be used for personal, educational, and academic purposes and for data analysis. I am using R to analyze reviews and find a pattern that might be helpful for various purposes. Since these files contain many observations, I have decided to focus on some specific aspects, for example how to find a good Italian restaurant in Phoenix.

The following analysis was taken and adapted from a free course available on Datacamp.

The Datacamp analysis considers Indian restaurant reviews and uses the “Yelp star reviews for restaurants data set”. Since the course uses pre-loaded datasets to simulate the analysis on their website, I have downloaded the CSV datasets from Data.World to reproduce the same codes on RStudio.

I have applied a few changes to the codes, then I have adapted the same analysis to a different category: Italian restaurants. Being Italian, I am familiar with Italian food and I have decided to explore them, but you can do the same with Chinese, Mexican, Japanese cuisine, or other businesses you choose from the business_categories list.

I have split this analysis into many parts

Image 1: Tagliatelle al ragù, author Mirna Rossi


Part 1.

Merge files, filter Italian restaurants, create a map using latitude and longitude.

I start with the same codes Datacamp used.

On RStudio, I set the working web directory, then I clean the environment.

Session > Set Working Directory > Choose Directory

#I upload the packages I need 
#Now I load and explore the three csv files I will use:
businesses <- read_csv("businesses.csv")

reviews <- read_csv("reviews.csv")

users <- read_csv("users.csv")

I combine the three files, but I use full_join instead of inner_join from the dplyr package, since my dataset has a few no matching values, and inner_join does not work in this case.

ru2  <- full_join(reviews, users)
# combine ru2 with the businesses data set
rub2 <- full_join(ru2, businesses)
#I check the rub2 data set

I have merged them and I have a big data set with many business categories.

This step is key, because I might decide to pullout precise categories I am interested in.

In my case, I filter all the reviews that are not about Italian restaurants and I use grepl() and subset().

#grepl() searches for matches in strings or string vectors and it returns TRUE in case the specified pattern is found, otherwise it returns FALSE.  I want to create a binary function true/false that in a column identifies Italian and not Italian restaurants.

rub2$is_italian <- grepl("Italian", rub2$business_categories) == TRUE

Select only reviews for Italian restaurants.

Subset() creates subsets from data frames, vectors, or matrices. I filter out all non-Italian reviews and assign the remaining (Italian) reviews to data frame italian.

italian <- subset(rub2, is_italian == TRUE)

and explore it

#I change the name of the first column (it will be helpful later)
names(italian)[1] <- "ID"
> colnames(italian)

Print the dataset Italian

write.csv(italian, "italian.csv")

Create a map of Italian restaurants in Arizona

Image 2. Map of Italian restaurants from the dataset

You can obtain this map using the following code

italian <- read_csv("italian.csv")
italian %>% 
mapview(italian, xcol = "business_longitude", ycol = "business_latitude", crs = 4269, grid = FALSE)

Next article – part 2: use select and group_by to filter variables.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s