Useful Reviews with R

Yelp datasets are available for downlaod on the company website.

Dataset agreement

They can be used for personal, educational, and academic purposes and for data analysis. I am using R to analyze reviews and find a pattern that might be helpful for various purposes. Since these files contain many observations, I have decided to focus on some specific aspects, for example how to find a good Italian restaurant in Phoenix.

The following analysis was taken and adapted from a free course available on Datacamp.

The Datacamp analysis considers Indian restaurant reviews and uses the “Yelp star reviews for restaurants data set”. Since the course uses pre-loaded datasets to simulate the analysis on their website, I have downloaded the csv datasets from Data.World to reproduce the same codes on RStudio.

I have applied a few changes to the codes, then I have adapted the same analysis to a different category: Italian restaurants. Being Italian, I am familiar with Italian food and I have decided to explore them, but you can do the same with Chinese, Mexican, Japanese cuisine, or other businesses you choose from the business_categories list.

I have split this analysis in many parts.

Yelp datasets are available for downlaod on the company website.

Dataset agreement

They can be used for personal, educational, and academic purposes and for data analysis. I am using R to analyze reviews and find a pattern that might be helpful for various purposes. Since these files contain many observations, I have decided to focus on some specific aspects, for example how to find a good Italian restaurant in Phoenix.

The following analysis was taken and adapted from a free course available on Datacamp.

The Datacamp analysis considers Indian restaurant reviews and uses the “Yelp star reviews for restaurants data set”. Since the course uses pre-loaded datasets to simulate the analysis on their website, I have downloaded the csv datasets from Data.World to reproduce the same codes on RStudio.

I have applied a few changes to the codes, then I have adapted the same analysis to a different category: Italian restaurants. Being Italian, I am familiar with Italian food and I have decided to explore them, but you can do the same with Chinese, Mexican, Japanese cuisine, or other businesses you choose from the business_categories list.

I have split this analysis in many parts.

Image 1: Tagliatelle al ragù, author Mirna Rossi

Analysis

Part 1.

Merge files, filter Italian restaurants, create a map using latitude and longitude.

I start with the same codes Datacamp used.

On RStudio, I set the working webdirectory, then I clean the environment.

Session > Set Working Directory > Choose Directory

rm(list=ls())
#I upload the packages I need 
library(readr)
library(dplyr)
#Now I load and explore the three csv files I will use:
businesses <- read_csv("businesses.csv")
summary(businesses)
View(businesses)

reviews <- read_csv("reviews.csv")
summary(reviews)
View(reviews)

users <- read_csv("users.csv")
summary(users)
View(users)

I combine the three files, but I use full_join instead of inner_join from the dplyr package, since my dataset has a few no matching values, and inner_join does not work in this case.

ru2  <- full_join(reviews, users)
# combine ru2 with the businesses data set
rub2 <- full_join(ru2, businesses)
#I check the rub2 data set
summary(rub2)

I have merged them and I have a big data set with many business categories.

This step is key, because I might decide to pullout precise categories I am interested in.

In my case, I filter all the reviews that are not about Italian restaurants and I use grepl() and subset().

#grepl() searches for matches in strings or string vectors and it returns TRUE in case the specified pattern is found, otherwise it returns FALSE.  I want to create a binary function true/false that in a column identifies Italian and not Italian restaurants.

rub2$is_italian <- grepl("Italian", rub2$business_categories) == TRUE
rub2$is_italian

Select only reviews for Italian restaurants.

Subset() creates subsets from data frames, vectors, or matrices. I filter out all non-Italian reviews and assign the remaining (Italian) reviews to data frame italian.

italian <- subset(rub2, is_italian == TRUE)

and explore it

italian
str(italian)
summary(italian)
#I change the name of the first column (it will be helpful later)
names(italian)[1] <- "ID"
> colnames(italian)

Print the dataset Italian

library(readxl)
write.csv(italian, "italian.csv")

Create a map of Italian restaurants in Arizona

Image 2. Map of Italian restaurants from the dataset

You can obtain this map using the following code

library(tidyverse)
library(sf)
library(mapview)
library(readr)
italian <- read_csv("italian.csv")
View(italian)
italian %>% 
  glimpse()
mapview(italian, xcol = "business_longitude", ycol = "business_latitude", crs = 4269, grid = FALSE)

Next article – part 2: use select and group_by to filter variables.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s