Yelp datasets are available for downlaod on the company website.
They can be used for personal, educational, and academic purposes and for data analysis. I am using R to analyze reviews and find a pattern that might be helpful for various purposes. Since these files contain many observations, I have decided to focus on some specific aspects, for example how to find a good Italian restaurant in Phoenix.
The following analysis was taken and adapted from a free course available on Datacamp.
The Datacamp analysis considers Indian restaurant reviews and uses the “Yelp star reviews for restaurants data set”. Since the course uses pre-loaded datasets to simulate the analysis on their website, I have downloaded the CSV datasets from Data.World to reproduce the same codes on RStudio.
I have applied a few changes to the codes, then I have adapted the same analysis to a different category: Italian restaurants. Being Italian, I am familiar with Italian food and I have decided to explore them, but you can do the same with Chinese, Mexican, Japanese cuisine, or other businesses you choose from the business_categories list.
I have split this analysis into many parts
Merge files, filter Italian restaurants, create a map using latitude and longitude.
I start with the same codes Datacamp used.
On RStudio, I set the working web directory, then I clean the environment.
Session > Set Working Directory > Choose Directory
rm(list=ls()) #I upload the packages I need library(readr) library(dplyr) #Now I load and explore the three csv files I will use: businesses <- read_csv("businesses.csv") summary(businesses) View(businesses) reviews <- read_csv("reviews.csv") summary(reviews) View(reviews) users <- read_csv("users.csv") summary(users) View(users)
I combine the three files, but I use full_join instead of inner_join from the dplyr package, since my dataset has a few no matching values, and inner_join does not work in this case.
ru2 <- full_join(reviews, users) # combine ru2 with the businesses data set rub2 <- full_join(ru2, businesses) #I check the rub2 data set summary(rub2)
I have merged them and I have a big data set with many business categories.
This step is key, because I might decide to pullout precise categories I am interested in.
In my case, I filter all the reviews that are not about Italian restaurants and I use grepl() and subset().
#grepl() searches for matches in strings or string vectors and it returns TRUE in case the specified pattern is found, otherwise it returns FALSE. I want to create a binary function true/false that in a column identifies Italian and not Italian restaurants.
rub2$is_italian <- grepl("Italian", rub2$business_categories) == TRUE rub2$is_italian
Select only reviews for Italian restaurants.
Subset() creates subsets from data frames, vectors, or matrices. I filter out all non-Italian reviews and assign the remaining (Italian) reviews to data frame italian.
italian <- subset(rub2, is_italian == TRUE)
and explore it
italian str(italian) summary(italian) #I change the name of the first column (it will be helpful later) names(italian) <- "ID" > colnames(italian)
Print the dataset Italian
library(readxl) write.csv(italian, "italian.csv")
Create a map of Italian restaurants in Arizona
You can obtain this map using the following code
library(tidyverse) library(sf) library(mapview) library(readr) italian <- read_csv("italian.csv") View(italian) italian %>% glimpse() mapview(italian, xcol = "business_longitude", ycol = "business_latitude", crs = 4269, grid = FALSE)
Next article – part 2: use select and group_by to filter variables.