Section 1 - Group 14

Abhishek Goyal (FT251004)
Anubhav Verma (FT251021)
Harshit Vashisht (FT251029)
N.V. Padmanayana (FT251048)
Waagmee Singh (FT251105)

Setup of database

import numpy as np
import scipy.stats as spst
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import duckdb

We are going to use duckdb as our database engine. You can refer to the documentation here: Documentation. First, we will create a database called yelp.db in duckdb. We will persist this db on the harddrive because it might be too big to handle in your system RAM. That said, duckdb has in-RAM database capabilities as well. Of course, if you have a lot of RAM, you can use Pandas directly or use the in-RAM capabilities of duckdb. Once we execute the command below, we will have created the required db and connected to it as well. Once we have created the db, it will appear in the root folder of your project.

con = duckdb.connect("yelpdata.db")

Now, we will import the csv files into four different tables namely restos, resto_reviews, users and user_friends. This is a onetime execution. You don’t need to execute the codecell below everytime you do the analysis. Once the tables are created, they will persist on the db in your root folder.

restos table comes from restuarants_train.csv (There is also a restuarants_test.csv. You do not need to worry about this test data now. This will be used to conduct a hackathon to predict star ratings later if we get time. ). The columns correspond to business.json in https://www.yelp.com/dataset/documentation/main

resto_reviews table comes from restaurant_reviews.csv. The columns correspond to review.json in https://www.yelp.com/dataset/documentation/main

users table comes from user.csv. The columns correspond to user.json in https://www.yelp.com/dataset/documentation/main

user_friends table comes from user_friends_full.csv. It was created by me. It shows the number of friends of each user in the social network of yelp.

con.sql("""
    CREATE TABLE restos AS
    FROM read_csv('yelp.restaurants_train.csv', sample_size = -1);
""")

con.sql("""
    CREATE TABLE resto_reviews AS
    FROM read_csv('yelp.restaurant_reviews.csv', sample_size = -1);
""")

con.sql("""
    CREATE TABLE users AS
    FROM read_csv('yelp.user.csv', sample_size = -1);
""")

con.sql("""
    CREATE TABLE user_friends AS
    FROM read_csv('yelp.user_friends_full.csv', sample_size = -1);
""")

Let us look at the first 5 rows of the restos table.

con.sql("""
    SELECT * FROM restos 
    LIMIT 5;
""")
┌──────────────────────┬─────────────────────┬───────────┬───┬──────────────────────┬──────────────────────┐
│         name         │       address       │   city    │ … │ attributes.Accepts…  │ attributes.HairSpe…  │
│       varchar        │       varchar       │  varchar  │   │       boolean        │       varchar        │
├──────────────────────┼─────────────────────┼───────────┼───┼──────────────────────┼──────────────────────┤
│ Oskar Blues Taproom  │ 921 Pearl St        │ Boulder   │ … │ NULL                 │ NULL                 │
│ Flying Elephants a…  │ 7000 NE Airport Way │ Portland  │ … │ NULL                 │ NULL                 │
│ Bob Likes Thai Food  │ 3755 Main St        │ Vancouver │ … │ NULL                 │ NULL                 │
│ Boxwood Biscuit      │ 740 S High St       │ Columbus  │ … │ NULL                 │ NULL                 │
│ Mr G's Pizza & Subs  │ 474 Lowell St       │ Peabody   │ … │ NULL                 │ NULL                 │
├──────────────────────┴─────────────────────┴───────────┴───┴──────────────────────┴──────────────────────┤
│ 5 rows                                                                              60 columns (5 shown) │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────┘

We can convert the results from duckdb into a pandas dataframe by appending .df() function as given below. This pandas dataframe can be used for further analysis in python.

df_restos = con.sql("""
    SELECT * FROM restos 
    LIMIT 5;
""").df()
df_restos
name address city state postal_code latitude longitude stars review_count is_open ... attributes.Smoking attributes.DriveThru attributes.BYOBCorkage attributes.Corkage attributes.RestaurantsCounterService attributes.DietaryRestrictions attributes.AgesAllowed attributes.Open24Hours attributes.AcceptsInsurance attributes.HairSpecializesIn
0 Oskar Blues Taproom 921 Pearl St Boulder CO 80302 40.017544 -105.283348 4.0 86 1 ... None None None None NaN None None NaN NaN None
1 Flying Elephants at PDX 7000 NE Airport Way Portland OR 97218 45.588906 -122.593331 4.0 126 1 ... None None None None NaN None None NaN NaN None
2 Bob Likes Thai Food 3755 Main St Vancouver BC V5V 49.251342 -123.101333 3.5 169 1 ... None None None None NaN None None NaN NaN None
3 Boxwood Biscuit 740 S High St Columbus OH 43206 39.947007 -82.997471 4.5 11 1 ... None None None None NaN None None NaN NaN None
4 Mr G's Pizza & Subs 474 Lowell St Peabody MA 01960 42.541155 -70.973438 4.0 39 1 ... None None None None NaN None None NaN NaN None

5 rows × 60 columns

Let us look at the columns in each of the tables.

con.sql("""
    DESCRIBE restos
""").df()
column_name column_type null key default extra
0 name VARCHAR YES None None None
1 address VARCHAR YES None None None
2 city VARCHAR YES None None None
3 state VARCHAR YES None None None
4 postal_code VARCHAR YES None None None
5 latitude DOUBLE YES None None None
6 longitude DOUBLE YES None None None
7 stars DOUBLE YES None None None
8 review_count BIGINT YES None None None
9 is_open BIGINT YES None None None
10 attributes.RestaurantsTableService VARCHAR YES None None None
11 attributes.WiFi VARCHAR YES None None None
12 attributes.BikeParking VARCHAR YES None None None
13 attributes.BusinessParking VARCHAR YES None None None
14 attributes.BusinessAcceptsCreditCards VARCHAR YES None None None
15 attributes.RestaurantsReservations VARCHAR YES None None None
16 attributes.WheelchairAccessible VARCHAR YES None None None
17 attributes.Caters VARCHAR YES None None None
18 attributes.OutdoorSeating VARCHAR YES None None None
19 attributes.RestaurantsGoodForGroups VARCHAR YES None None None
20 attributes.HappyHour VARCHAR YES None None None
21 attributes.BusinessAcceptsBitcoin BOOLEAN YES None None None
22 attributes.RestaurantsPriceRange2 VARCHAR YES None None None
23 attributes.Ambience VARCHAR YES None None None
24 attributes.HasTV VARCHAR YES None None None
25 attributes.Alcohol VARCHAR YES None None None
26 attributes.GoodForMeal VARCHAR YES None None None
27 attributes.DogsAllowed VARCHAR YES None None None
28 attributes.RestaurantsTakeOut VARCHAR YES None None None
29 attributes.NoiseLevel VARCHAR YES None None None
30 attributes.RestaurantsAttire VARCHAR YES None None None
31 attributes.RestaurantsDelivery VARCHAR YES None None None
32 categories VARCHAR YES None None None
33 hours.Monday VARCHAR YES None None None
34 hours.Tuesday VARCHAR YES None None None
35 hours.Wednesday VARCHAR YES None None None
36 hours.Thursday VARCHAR YES None None None
37 hours.Friday VARCHAR YES None None None
38 hours.Saturday VARCHAR YES None None None
39 hours.Sunday VARCHAR YES None None None
40 int_business_id BIGINT YES None None None
41 attributes.GoodForKids VARCHAR YES None None None
42 attributes.ByAppointmentOnly BOOLEAN YES None None None
43 attributes VARCHAR YES None None None
44 hours VARCHAR YES None None None
45 attributes.Music VARCHAR YES None None None
46 attributes.GoodForDancing VARCHAR YES None None None
47 attributes.BestNights VARCHAR YES None None None
48 attributes.BYOB VARCHAR YES None None None
49 attributes.CoatCheck VARCHAR YES None None None
50 attributes.Smoking VARCHAR YES None None None
51 attributes.DriveThru VARCHAR YES None None None
52 attributes.BYOBCorkage VARCHAR YES None None None
53 attributes.Corkage VARCHAR YES None None None
54 attributes.RestaurantsCounterService BOOLEAN YES None None None
55 attributes.DietaryRestrictions VARCHAR YES None None None
56 attributes.AgesAllowed VARCHAR YES None None None
57 attributes.Open24Hours BOOLEAN YES None None None
58 attributes.AcceptsInsurance BOOLEAN YES None None None
59 attributes.HairSpecializesIn VARCHAR YES None None None
con.sql("""
    DESCRIBE resto_reviews
""")
┌────────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│    column_name     │ column_type │  null   │   key   │ default │  extra  │
│      varchar       │   varchar   │ varchar │ varchar │ varchar │ varchar │
├────────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ stars              │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ useful             │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ funny              │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ cool               │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ text               │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ date               │ TIMESTAMP   │ YES     │ NULL    │ NULL    │ NULL    │
│ int_business_id    │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ int_user_id        │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ int_rest_review_id │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
└────────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘
con.sql("""
    DESCRIBE users
""")
┌────────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│    column_name     │ column_type │  null   │   key   │ default │  extra  │
│      varchar       │   varchar   │ varchar │ varchar │ varchar │ varchar │
├────────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ review_count       │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ yelping_since      │ TIMESTAMP   │ YES     │ NULL    │ NULL    │ NULL    │
│ useful             │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ funny              │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ cool               │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ elite              │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ fans               │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ average_stars      │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ compliment_hot     │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ compliment_more    │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ compliment_profile │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ compliment_cute    │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ compliment_list    │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ compliment_note    │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ compliment_plain   │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ compliment_cool    │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ compliment_funny   │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ compliment_writer  │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ compliment_photos  │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ int_user_id        │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
├────────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤
│ 20 rows                                                        6 columns │
└──────────────────────────────────────────────────────────────────────────┘
con.sql("""
    DESCRIBE user_friends
""")
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │  null   │   key   │ default │  extra  │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ int_user_id │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ num_friends │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘

Descriptive analysis

Average stars received by restaurants.

con.sql("""
    SELECT avg(stars) FROM restos
""")
┌───────────────────┐
│    avg(stars)     │
│      double       │
├───────────────────┤
│ 3.527858606557377 │
└───────────────────┘

We will select the stars column from restos and plot a histogram of stars.

df_temp = con.sql("""
    SELECT stars FROM restos
""").df()
df_temp
stars
0 4.0
1 4.0
2 3.5
3 4.5
4 4.0
... ...
48795 2.0
48796 3.0
48797 3.0
48798 4.0
48799 4.5

48800 rows × 1 columns

sns.displot(data = df_temp, x = "stars", kind = "hist", color = "Green")

Average stars on yelp received by restaurants grouped by state.

df_temp = con.sql("""
    SELECT state, avg(stars) as avg_stars FROM restos 
    GROUP BY state
    ORDER BY avg_stars DESC
""").df()
df_temp
state avg_stars
0 ABE 4.500000
1 NH 4.000000
2 OR 3.752074
3 TX 3.656262
4 CO 3.587216
5 FL 3.521030
6 MN 3.500000
7 MA 3.480819
8 BC 3.454463
9 WA 3.450202
10 OH 3.421853
11 GA 3.399796
12 KY 2.000000
13 VA 2.000000
14 KS 2.000000
15 WY 1.500000

Let us look at how many restaurants provide WiFi.

con.sql("""
    SELECT DISTINCT "attributes.WiFi" FROM restos
""").df()
attributes.WiFi
0 u'no'
1 u'paid'
2 'paid'
3 u'free'
4 None
5 'free'
6 'no'
7 None
df_temp = con.sql("""
    SELECT "attributes.WiFi" AS wifi, count("attributes.WiFi") AS count FROM restos
    GROUP BY wifi
""").df()
df_temp
wifi count
0 'free' 5496
1 u'paid' 184
2 u'free' 14994
3 None 29
4 u'no' 11706
5 'no' 4394
6 'paid' 81
7 None 0

Of course we will need to correct the data because, as you can see, the values of the variable are not clean. There seems to be repetitions of the same thing mentioned in two different ways.

df_new = pd.DataFrame({
    "wifi": [
        "None",
        "Free",
        "No",
        "Paid"
    ],
    "count": [
        (df_temp["count"][0] + df_temp["count"][7]),
        (df_temp["count"][1] + df_temp["count"][3]),
        (df_temp["count"][2] + df_temp["count"][5]),
        (df_temp["count"][4] + df_temp["count"][6])
    ]
})
df_new
wifi count
0 None 5496
1 Free 213
2 No 19388
3 Paid 11787

Let us find aggregates of useful votes recevied by reviews grouped by star ratings.

df_temp = con.sql("""
    SELECT stars, sum(useful) as sum_useful, avg(useful) as avg_useful, max(useful) AS max_useful FROM resto_reviews
    GROUP BY stars
    ORDER BY stars ASC
""").df()
df_temp
stars sum_useful avg_useful max_useful
0 1 239163.0 1.256445 146
1 2 184718.0 1.192214 260
2 3 216230.0 1.000694 177
3 4 454666.0 1.036895 411
4 5 556257.0 0.825011 223
sns.catplot(
    data=df_temp, 
    x="stars", 
    y="sum_useful", 
    kind="bar", 
    palette='magma',
    height=6, 
    aspect=1.5,
    edgecolor="black"
)
C:\Users\A\AppData\Local\Temp\ipykernel_15224\656945628.py:1: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.catplot(

Part 1 - Exploratory Data Analysis

Data cleaning and merging

df_restos = con.sql('select * from restos').df()
df_resto_reviews = con.sql('select * from resto_reviews').df()
df_users = con.sql('select * from users').df()
df_user_friends = con.sql('select * from user_friends').df()

The tables for each of the CSV files look as below

df_restos.head(5)
name address city state postal_code latitude longitude stars review_count is_open ... attributes.Smoking attributes.DriveThru attributes.BYOBCorkage attributes.Corkage attributes.RestaurantsCounterService attributes.DietaryRestrictions attributes.AgesAllowed attributes.Open24Hours attributes.AcceptsInsurance attributes.HairSpecializesIn
0 Oskar Blues Taproom 921 Pearl St Boulder CO 80302 40.017544 -105.283348 4.0 86 1 ... None None None None NaN None None NaN NaN None
1 Flying Elephants at PDX 7000 NE Airport Way Portland OR 97218 45.588906 -122.593331 4.0 126 1 ... None None None None NaN None None NaN NaN None
2 Bob Likes Thai Food 3755 Main St Vancouver BC V5V 49.251342 -123.101333 3.5 169 1 ... None None None None NaN None None NaN NaN None
3 Boxwood Biscuit 740 S High St Columbus OH 43206 39.947007 -82.997471 4.5 11 1 ... None None None None NaN None None NaN NaN None
4 Mr G's Pizza & Subs 474 Lowell St Peabody MA 01960 42.541155 -70.973438 4.0 39 1 ... None None None None NaN None None NaN NaN None

5 rows × 60 columns

df_resto_reviews.head(5)
stars useful funny cool text date int_business_id int_user_id int_rest_review_id
0 2 1 1 1 I've stayed at many Marriott and Renaissance M... 2010-01-08 02:29:15 4954 6319642 2
1 2 0 0 0 The setting is perfectly adequate, and the foo... 2006-04-16 02:58:44 14180 292901 5
2 5 0 0 0 I work in the Pru and this is the most afforda... 2014-05-07 18:10:21 11779 6336225 7
3 5 5 3 3 I loved everything about this place. I've only... 2014-02-05 21:09:05 3216 552519 12
4 4 0 0 0 I think their rice dishes are way better than ... 2017-05-26 03:05:46 8748 544027 16
df_users.head(5)
review_count yelping_since useful funny cool elite fans average_stars compliment_hot compliment_more compliment_profile compliment_cute compliment_list compliment_note compliment_plain compliment_cool compliment_funny compliment_writer compliment_photos int_user_id
0 1220 2005-03-14 20:26:35 15038 10030 11291 2006,2007,2008,2009,2010,2011,2012,2013,2014 1357 3.85 1710 163 190 361 147 1212 5691 2541 2541 815 323 1
1 2136 2007-08-10 19:01:51 21272 10289 18046 2007,2008,2009,2010,2011,2012,2013,2014,2015,2... 1025 4.09 1632 87 94 232 96 1187 3293 2205 2205 472 294 1248
2 119 2007-02-07 15:47:53 188 128 130 2010,2011 16 3.76 22 1 3 0 0 5 20 31 31 3 1 11604
3 987 2009-02-09 16:14:29 7234 4722 4035 2009,2010,2011,2012,2013,2014 420 3.77 1180 129 93 219 90 1120 4510 1566 1566 391 326 2278
4 495 2008-03-03 04:57:05 1577 727 1124 2009,2010,2011 47 3.72 248 19 32 16 15 77 131 310 310 98 44 13481
df_user_friends.head(5)
int_user_id num_friends
0 1 5813
1 1248 6296
2 11604 835
3 2278 1452
4 13481 532
df_restos.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48800 entries, 0 to 48799
Data columns (total 60 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   name                                   48800 non-null  object 
 1   address                                48378 non-null  object 
 2   city                                   48800 non-null  object 
 3   state                                  48800 non-null  object 
 4   postal_code                            48776 non-null  object 
 5   latitude                               48800 non-null  float64
 6   longitude                              48800 non-null  float64
 7   stars                                  48800 non-null  float64
 8   review_count                           48800 non-null  int64  
 9   is_open                                48800 non-null  int64  
 10  attributes.RestaurantsTableService     18523 non-null  object 
 11  attributes.WiFi                        36884 non-null  object 
 12  attributes.BikeParking                 33715 non-null  object 
 13  attributes.BusinessParking             44235 non-null  object 
 14  attributes.BusinessAcceptsCreditCards  39552 non-null  object 
 15  attributes.RestaurantsReservations     42037 non-null  object 
 16  attributes.WheelchairAccessible        12628 non-null  object 
 17  attributes.Caters                      33288 non-null  object 
 18  attributes.OutdoorSeating              42636 non-null  object 
 19  attributes.RestaurantsGoodForGroups    40952 non-null  object 
 20  attributes.HappyHour                   12774 non-null  object 
 21  attributes.BusinessAcceptsBitcoin      5952 non-null   object 
 22  attributes.RestaurantsPriceRange2      43154 non-null  object 
 23  attributes.Ambience                    39676 non-null  object 
 24  attributes.HasTV                       39971 non-null  object 
 25  attributes.Alcohol                     39435 non-null  object 
 26  attributes.GoodForMeal                 27935 non-null  object 
 27  attributes.DogsAllowed                 11071 non-null  object 
 28  attributes.RestaurantsTakeOut          45727 non-null  object 
 29  attributes.NoiseLevel                  34457 non-null  object 
 30  attributes.RestaurantsAttire           38941 non-null  object 
 31  attributes.RestaurantsDelivery         44650 non-null  object 
 32  categories                             48800 non-null  object 
 33  hours.Monday                           36880 non-null  object 
 34  hours.Tuesday                          38586 non-null  object 
 35  hours.Wednesday                        40018 non-null  object 
 36  hours.Thursday                         40597 non-null  object 
 37  hours.Friday                           40775 non-null  object 
 38  hours.Saturday                         39370 non-null  object 
 39  hours.Sunday                           34455 non-null  object 
 40  int_business_id                        48800 non-null  int64  
 41  attributes.GoodForKids                 40477 non-null  object 
 42  attributes.ByAppointmentOnly           3413 non-null   object 
 43  attributes                             0 non-null      object 
 44  hours                                  0 non-null      object 
 45  attributes.Music                       5224 non-null   object 
 46  attributes.GoodForDancing              3584 non-null   object 
 47  attributes.BestNights                  4184 non-null   object 
 48  attributes.BYOB                        3247 non-null   object 
 49  attributes.CoatCheck                   3863 non-null   object 
 50  attributes.Smoking                     3245 non-null   object 
 51  attributes.DriveThru                   4582 non-null   object 
 52  attributes.BYOBCorkage                 3513 non-null   object 
 53  attributes.Corkage                     3754 non-null   object 
 54  attributes.RestaurantsCounterService   40 non-null     object 
 55  attributes.DietaryRestrictions         67 non-null     object 
 56  attributes.AgesAllowed                 19 non-null     object 
 57  attributes.Open24Hours                 32 non-null     object 
 58  attributes.AcceptsInsurance            8 non-null      object 
 59  attributes.HairSpecializesIn           3 non-null      object 
dtypes: float64(3), int64(3), object(54)
memory usage: 22.3+ MB
Assuming 30% of total datapoints to be the threshold above which, a column is of no use due to lack of data
# Calculate the threshold for dropping columns
threshold = len(df_restos) * 0.30

# Drop columns with more than 30% null values
df_restos = df_restos.dropna(axis=1, thresh=threshold)
df_restos.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48800 entries, 0 to 48799
Data columns (total 38 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   name                                   48800 non-null  object 
 1   address                                48378 non-null  object 
 2   city                                   48800 non-null  object 
 3   state                                  48800 non-null  object 
 4   postal_code                            48776 non-null  object 
 5   latitude                               48800 non-null  float64
 6   longitude                              48800 non-null  float64
 7   stars                                  48800 non-null  float64
 8   review_count                           48800 non-null  int64  
 9   is_open                                48800 non-null  int64  
 10  attributes.RestaurantsTableService     18523 non-null  object 
 11  attributes.WiFi                        36884 non-null  object 
 12  attributes.BikeParking                 33715 non-null  object 
 13  attributes.BusinessParking             44235 non-null  object 
 14  attributes.BusinessAcceptsCreditCards  39552 non-null  object 
 15  attributes.RestaurantsReservations     42037 non-null  object 
 16  attributes.Caters                      33288 non-null  object 
 17  attributes.OutdoorSeating              42636 non-null  object 
 18  attributes.RestaurantsGoodForGroups    40952 non-null  object 
 19  attributes.RestaurantsPriceRange2      43154 non-null  object 
 20  attributes.Ambience                    39676 non-null  object 
 21  attributes.HasTV                       39971 non-null  object 
 22  attributes.Alcohol                     39435 non-null  object 
 23  attributes.GoodForMeal                 27935 non-null  object 
 24  attributes.RestaurantsTakeOut          45727 non-null  object 
 25  attributes.NoiseLevel                  34457 non-null  object 
 26  attributes.RestaurantsAttire           38941 non-null  object 
 27  attributes.RestaurantsDelivery         44650 non-null  object 
 28  categories                             48800 non-null  object 
 29  hours.Monday                           36880 non-null  object 
 30  hours.Tuesday                          38586 non-null  object 
 31  hours.Wednesday                        40018 non-null  object 
 32  hours.Thursday                         40597 non-null  object 
 33  hours.Friday                           40775 non-null  object 
 34  hours.Saturday                         39370 non-null  object 
 35  hours.Sunday                           34455 non-null  object 
 36  int_business_id                        48800 non-null  int64  
 37  attributes.GoodForKids                 40477 non-null  object 
dtypes: float64(3), int64(3), object(32)
memory usage: 14.1+ MB
# Calculate the threshold for dropping columns
threshold = len(df_resto_reviews) * 0.30

# Drop columns with more than 30% null values
df_resto_reviews = df_resto_reviews.dropna(axis=1, thresh=threshold)
df_resto_reviews.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1674096 entries, 0 to 1674095
Data columns (total 9 columns):
 #   Column              Non-Null Count    Dtype         
---  ------              --------------    -----         
 0   stars               1674096 non-null  int64         
 1   useful              1674096 non-null  int64         
 2   funny               1674096 non-null  int64         
 3   cool                1674096 non-null  int64         
 4   text                1674096 non-null  object        
 5   date                1674096 non-null  datetime64[us]
 6   int_business_id     1674096 non-null  int64         
 7   int_user_id         1674096 non-null  int64         
 8   int_rest_review_id  1674096 non-null  int64         
dtypes: datetime64[us](1), int64(7), object(1)
memory usage: 115.0+ MB
# Calculate the threshold for dropping columns
threshold = len(df_users) * 0.30

# Drop columns with more than 30% null values
df_users = df_users.dropna(axis=1, thresh=threshold)
df_users.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2189457 entries, 0 to 2189456
Data columns (total 19 columns):
 #   Column              Dtype         
---  ------              -----         
 0   review_count        int64         
 1   yelping_since       datetime64[ns]
 2   useful              int64         
 3   funny               int64         
 4   cool                int64         
 5   fans                int64         
 6   average_stars       float64       
 7   compliment_hot      int64         
 8   compliment_more     int64         
 9   compliment_profile  int64         
 10  compliment_cute     int64         
 11  compliment_list     int64         
 12  compliment_note     int64         
 13  compliment_plain    int64         
 14  compliment_cool     int64         
 15  compliment_funny    int64         
 16  compliment_writer   int64         
 17  compliment_photos   int64         
 18  int_user_id         int64         
dtypes: datetime64[ns](1), float64(1), int64(17)
memory usage: 317.4 MB
# Calculate the threshold for dropping columns
threshold = len(df_user_friends) * 0.30

# Drop columns with more than 30% null values
df_user_friends = df_user_friends.dropna(axis=1, thresh=threshold)
df_user_friends.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2189457 entries, 0 to 2189456
Data columns (total 2 columns):
 #   Column       Dtype
---  ------       -----
 0   int_user_id  int64
 1   num_friends  int64
dtypes: int64(2)
memory usage: 33.4 MB

Exploring factors influencing star ratings of the restaurants

Location analysis

# Fetching the data from the database
df_avg_stars_state = con.sql("""
    SELECT state, avg(stars) as avg_stars
    FROM restos
    GROUP BY state
    ORDER BY avg_stars DESC
""").df()

# Creating the plot
plt.figure(figsize=(14, 8))  # Adjusted figure size for better visibility
ax = sns.barplot(data=df_avg_stars_state, x='state', y='avg_stars', hue='state', palette='magma', dodge=False)

# Adjusting the thickness of the bars
for patch in ax.patches:
    patch.set_width(0.9)  # Adjust this value to make bars thicker or thinner

# Rotating x-axis labels
plt.xticks(rotation=90)

# Moving the legend outside the chart area
plt.legend(loc='upper left', bbox_to_anchor=(1, 1), title='State')

# Adjust layout to make space for the legend
plt.tight_layout()

# Display the plot
plt.show()
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.

  • The top three locations with the highest average ratings were ABE, NIH and OR, hence, could be the best locations to consider while opening up a restaurant

Type analysis

import matplotlib.pyplot as plt
import seaborn as sns

# Fetching the data from the database
df_categories = con.sql("""
    SELECT categories, count(*) as count
    FROM restos
    GROUP BY categories
    ORDER BY count DESC
    LIMIT 10
""").df()

# Creating the plot
plt.figure(figsize=(14, 8))  # Adjusted figure size for better visibility
ax = sns.barplot(data=df_categories, x='count', y='categories', hue='categories', palette='magma', dodge=False)

# Adjusting the thickness of the bars
for patch in ax.patches:
    patch.set_height(0.6)  # Adjust this value to make bars thicker or thinner

# Moving the legend outside the chart area
plt.legend(loc='upper left', bbox_to_anchor=(1, 1), title='Categories')

# Adjust layout to make space for the legend
plt.tight_layout()

# Display the plot
plt.show()
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.

  • The most popular categories are Pizza and Chinese, hence, these can be considered to be more suitable for opening up a profitable restaurant

wifi - facilities

# Inner join restos and resto_reviews tables
df_merged = con.sql("""
    SELECT r.int_business_id, r.name, r.stars as resto_avg_stars, 
           rr.stars as review_stars, r."attributes.WiFi", r."attributes.RestaurantsTakeOut"
    FROM restos r
    JOIN resto_reviews rr ON r.int_business_id = rr.int_business_id
""").df()

# Analyze average ratings based on WiFi availability
df_wifi = df_merged.groupby('attributes.WiFi').agg({
    'review_stars': 'mean'
}).reset_index()
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming df_merged is already created and available from your SQL join

# Analyze average ratings based on WiFi availability
df_wifi = df_merged.groupby('attributes.WiFi').agg({
    'review_stars': 'mean'
}).reset_index()

plt.figure(figsize=(10, 6))  # Increase figure size
sns.barplot(data=df_wifi, x='attributes.WiFi', y='review_stars', 
            hue='attributes.WiFi', dodge=False, palette='magma', linewidth=0)
plt.xlabel('WiFi Availability', fontsize=12)
plt.ylabel('Average User Rating', fontsize=12)
plt.title('Impact of WiFi Availability on User Ratings', fontsize=14)
plt.legend(title='WiFi Availability', bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0.)  # Move legend out of chart area
plt.show()
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.

TakeOut - facilities

import seaborn as sns
import matplotlib.pyplot as plt

# Analyze average ratings based on TakeOut availability
df_takeout = df_merged.groupby('attributes.RestaurantsTakeOut').agg({
    'review_stars': 'mean'
}).reset_index()

# Increase figure size and create the bar plot with gradient hue
plt.figure(figsize=(10, 6))
sns.barplot(data=df_takeout, x='attributes.RestaurantsTakeOut', y='review_stars',
            hue='attributes.RestaurantsTakeOut', dodge=False, 
            palette=sns.color_palette("magma", len(df_takeout)), linewidth=0)

# Add labels and title
plt.xlabel('TakeOut Availability', fontsize=12)
plt.ylabel('Average User Rating', fontsize=12)
plt.title('Impact of TakeOut Availability on User Ratings', fontsize=14)

# Move legend outside the chart area
plt.legend(title='TakeOut Availability', bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0.)

# Display the plot
plt.show
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.

Bike parking - facilities

import seaborn as sns
import matplotlib.pyplot as plt

# Query data and compute average stars based on BikeParking attribute
df_bike = con.sql("""
    SELECT "attributes.BikeParking" AS bike, avg(stars) as avg_stars
    FROM restos
    GROUP BY bike
""").df()

# Increase figure size and create the bar plot with thicker bars
plt.figure(figsize=(10, 6))
sns.barplot(data=df_bike, x='bike', y='avg_stars', dodge=False, 
            palette=sns.color_palette("magma", len(df_bike)), linewidth=0)

# Add labels and title
plt.xlabel('Bike Parking Availability', fontsize=12)
plt.ylabel('Average User Rating', fontsize=12)
plt.title('Impact of Bike Parking Availability on User Ratings', fontsize=14)

# Remove legend and move it outside the chart area
plt.legend().remove()

# Display the plot
plt.show()
C:\Users\A\AppData\Local\Temp\ipykernel_15224\4155413326.py:13: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(data=df_bike, x='bike', y='avg_stars', dodge=False,
C:\Users\A\AppData\Local\Temp\ipykernel_15224\4155413326.py:13: UserWarning: The palette list has more values (4) than needed (3), which may not be intended.
  sns.barplot(data=df_bike, x='bike', y='avg_stars', dodge=False,
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.

Inferences of facilities

  • wifi availability free of cost achieved the highest average ratings, hence is important to have this attribute
  • TakeOut does not seem to be an important feature for the customers to rate the restaurants highly, hence are not that useful
  • Average rating for restaurants with bike parking availability was observed to be the highest, hence considered an important feature

Amneties

df_amenities = con.sql("""
    SELECT 
        "attributes.OutdoorSeating", "attributes.HappyHour",
        "attributes.Alcohol", "attributes.BusinessAcceptsCreditCards",
        avg(stars) as avg_stars
    FROM restos
    GROUP BY 
        "attributes.OutdoorSeating", "attributes.HappyHour",
        "attributes.Alcohol", "attributes.BusinessAcceptsCreditCards"
    ORDER BY avg_stars DESC
""").df()
df_amenities.head(10)
attributes.OutdoorSeating attributes.HappyHour attributes.Alcohol attributes.BusinessAcceptsCreditCards avg_stars
0 True False None False 5.0
1 True False u'none' False 4.7
2 None True None False 4.5
3 True False 'beer_and_wine' False 4.5
4 None False 'full_bar' None 4.5
5 True False u'beer_and_wine' False 4.5
6 True True u'none' False 4.5
7 None False u'beer_and_wine' False 4.5
8 None True u'none' True 4.5
9 None None u'none' None 4.5
  • Top amneties relating to higher ratings include Outdoor Seating, Happy hours. Credit Card acceptance

Exploring the relationship between ratings and review characteristics

import seaborn as sns
import matplotlib.pyplot as plt

# Fetching the data from the database
df_review_chars = con.sql("""
    SELECT useful, funny, cool, stars
    FROM resto_reviews
""").df()

sns.heatmap(
    df_review_chars.corr(), 
    cmap='magma',  
    annot=True, 
    fmt='.1f'
)

# Display the plot
plt.show()

  • The review characteristics and ratings did not reveal any significant linear relationship and hence is not useful for any relational study

Analysis of the reviewer characteristics with rating

df_user_chars = con.sql("""
    SELECT review_count, average_stars, useful
    FROM users
""").df()
df_user_chars.head()
review_count average_stars useful
0 1220 3.85 15038
1 2136 4.09 21272
2 119 3.76 188
3 987 3.77 7234
4 495 3.72 1577
# Correlation matrix
df_user_chars.corr()
review_count average_stars useful
review_count 1.000000 0.036760 0.584678
average_stars 0.036760 1.000000 0.008758
useful 0.584678 0.008758 1.000000

It is observed that with the increase in number of reviews, more people find it useful due to a significant correlation factor of ~0.58

Influence of number of friends on user’s review usefulness

df_friends_reviews = con.sql("""
    SELECT u.num_friends, r.useful
    FROM user_friends u
    JOIN resto_reviews r ON u.int_user_id = r.int_user_id
""").df()
sns.scatterplot(
    data=df_friends_reviews, 
    x='num_friends', 
    y='useful', 
    color=sns.color_palette("magma", as_cmap=True)(0.7)
)

plt.show()

sns.heatmap(df_friends_reviews.corr(), annot = True)

  • There is a low correlation of 0.37 between number of friends and usefulness, hence not much influencial

Analyzing factors affecting length of reviews

df_review_length = con.sql("""
    SELECT length(text) as review_length, stars
    FROM resto_reviews
""").df()

# Correlation

sns.heatmap(df_review_length.corr(), annot = True)

  • A negligible correlation is observed between review length and ratings, hence there is no solid inference from this analysis

Analysing the relation of number of reviews and ratings

# Merge restos and resto_reviews tables
df_merged = con.sql("""
    SELECT r.int_business_id, r.name, r.stars as resto_avg_stars, 
           r.review_count, rr.stars as review_stars
    FROM restos r
    JOIN resto_reviews rr ON r.int_business_id = rr.int_business_id
""").df()

# Group by restaurant to get average rating and review count
df_grouped = df_merged.groupby('int_business_id').agg({
    'resto_avg_stars': 'mean',
    'review_count': 'sum'
}).reset_index()

sns.scatterplot(data=df_grouped, x='review_count', y='resto_avg_stars',color=sns.color_palette("magma", as_cmap=True)(0.7))
plt.xlabel('Total Number of Reviews')
plt.ylabel('Average Restaurant Rating')
plt.title('Relationship Between Review Count and Average Rating')
plt.show()

  • For higher ratings, there are higher number of reviews observed, hence it can be deduced that more number of people are interested to give a good feedback when they are satisfied with the services

Part 2 - Descriptive Analysis

Restaurant analysis

# Descriptive statistics for Restaurants (restos table)
df_restos_stats = con.sql("""
    SELECT 
        avg(stars) AS mean_stars,
        median(stars) AS median_stars,
        mode() WITHIN GROUP (ORDER BY stars) AS mode_stars,
        stddev_pop(stars) AS stddev_stars,
        avg(review_count) AS mean_review_count,
        median(review_count) AS median_review_count,
        mode() WITHIN GROUP (ORDER BY review_count) AS mode_review_count,
        stddev_pop(review_count) AS stddev_review_count
    FROM restos
""").df()

df_restos_stats
mean_stars median_stars mode_stars stddev_stars mean_review_count median_review_count mode_review_count stddev_review_count
0 3.527859 3.5 4.0 0.773841 106.800041 43.0 5 208.174311

Restaurant review analysis

# Descriptive statistics for Restaurant Reviews (resto_reviews table)
df_reviews_stats = con.sql("""
    SELECT 
        avg(stars) AS mean_review_stars,
        median(stars) AS median_review_stars,
        mode() WITHIN GROUP (ORDER BY stars) AS mode_review_stars,
        stddev_pop(stars) AS stddev_review_stars,
        avg(useful) AS mean_useful_votes,
        median(useful) AS median_useful_votes,
        mode() WITHIN GROUP (ORDER BY useful) AS mode_useful_votes,
        stddev_pop(useful) AS stddev_useful_votes
    FROM resto_reviews
""").df()

df_reviews_stats
mean_review_stars median_review_stars mode_review_stars stddev_review_stars mean_useful_votes median_useful_votes mode_useful_votes stddev_useful_votes
0 3.74747 4.0 5 1.364395 0.986224 0.0 0 2.612258

Yelp User analysis

df_users_stats = con.sql("""
    SELECT 
        avg(review_count) AS mean_user_review_count,
        median(review_count) AS median_user_review_count,
        mode() WITHIN GROUP (ORDER BY review_count) AS mode_user_review_count,
        stddev_pop(review_count) AS stddev_user_review_count,
        avg(average_stars) AS mean_user_avg_stars,
        median(average_stars) AS median_user_avg_stars,
        mode() WITHIN GROUP (ORDER BY average_stars) AS mode_user_avg_stars,
        stddev_pop(average_stars) AS stddev_user_avg_stars
    FROM users
""").df()

df_users_stats
mean_user_review_count median_user_review_count mode_user_review_count stddev_user_review_count mean_user_avg_stars median_user_avg_stars mode_user_avg_stars stddev_user_avg_stars
0 21.697721 5.0 1 76.01253 3.653816 3.88 5.0 1.153861

Rating Distribution

plt.figure(figsize=(10, 6))

sns.histplot(
    df_restos['stars'], 
    bins=10, 
    kde=True, 
    color=sns.color_palette("magma", as_cmap=True)(0.7),
    edgecolor='black'
)

# Add labels and title with custom font sizes
plt.xlabel('Restaurant Ratings', fontsize=14, labelpad=10)
plt.ylabel('Frequency', fontsize=14, labelpad=10)
plt.title('Distribution of Restaurant Ratings', fontsize=16, weight='bold', pad=15)

# Customize the ticks on the axes
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Remove top and right spines
sns.despine()

# Add grid lines
plt.grid(True, which='both', linestyle='--', linewidth=0.5)

# Show the plot
plt.show()

Rating analysis from user

df_avg_stars = con.sql("SELECT average_stars FROM users").df()

sns.boxplot(
    data=df_avg_stars, 
    y='average_stars', 
    color=sns.color_palette("magma", as_cmap=True)(0.7)
)
plt.ylabel('Average User Stars', fontsize=12)
plt.title('Distribution of Average User Stars', fontsize=14, weight='bold')
plt.show()

  • The rating distribution has right skewness, which indicates that most of the restaurants have a high rating stating the fact that as long as a basic and acceptable standard is maintained, the ratings do not fall
# Box plot of restaurant ratings by state
df_ratings_state = con.sql("""
    SELECT state, stars 
    FROM restos
""").df()
# Set a more visually appealing style
sns.set(style="whitegrid")

# Create the box plot
plt.figure(figsize=(12, 8))
sns.boxplot(data=df_ratings_state, x='state', y='stars', palette='magma')

# Add labels and title with custom font sizes
plt.xlabel('State', fontsize=14, labelpad=10)
plt.ylabel('Restaurant Ratings', fontsize=14, labelpad=10)
plt.title('Restaurant Ratings by State', fontsize=16, weight='bold', pad=15)

# Customize the x-axis ticks
plt.xticks(rotation=90, fontsize=12)

# Customize the y-axis ticks
plt.yticks(fontsize=12)

# Remove top and right spines
sns.despine()

# Add grid lines
plt.grid(True, which='both', linestyle='--', linewidth=0.5)

# Show the plot
plt.show()
C:\Users\A\AppData\Local\Temp\ipykernel_15224\2670015677.py:6: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(data=df_ratings_state, x='state', y='stars', palette='magma')

  • the restaurants in OR shows more than standard performance, hence showing the range of ratings to be higher than other states, that might infer that the standard of operations relative to expectations are higher in that state, other states hover around the same ratings with a handful of states having extremely low data points to infer meaningful insights

Part - 3 - hypothesis testing

Hypotheses:

  • Null Hypothesis (H₀): There is no significant difference in restaurant ratings between those that allow takeout and those that do not.

  • Alternative Hypothesis (H₁): There is a significant difference in restaurant ratings between those that allow takeout and those that do not.

Considering a level of significance of 5%

from scipy import stats

# Separate data into two groups: Takeout and No Takeout
df_takeout = con.sql("""
    SELECT stars 
    FROM restos 
    WHERE "attributes.RestaurantsTakeOut" = 'True'
""").df()

df_no_takeout = con.sql("""
    SELECT stars 
    FROM restos 
    WHERE "attributes.RestaurantsTakeOut" = 'False'
""").df()

# Perform an independent t-test
t_stat, p_value = stats.ttest_ind(df_takeout['stars'], df_no_takeout['stars'], equal_var=False)

# Print results
print(f"T-statistic: {t_stat}, P-value: {p_value}")

# Conclusion based on the p-value
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference in restaurant ratings between those that allow takeout and those that do not.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in restaurant ratings between those that allow takeout and those that do not.")
T-statistic: -2.9938115671530063, P-value: 0.0027826746728394523
Reject the null hypothesis: There is a significant difference in restaurant ratings between those that allow takeout and those that do not.
  • The above study indicates that there is a perceivable difference between Takeout services offered and not offered, but EDA showed indifference, that means that this feature can not be solely dependent on average factors observed for the sample, hence takeouts are an important feature and needs to be considered while designing the operations of the new restaurant
  • Practical significance is debatable since the EDA and hypothesis testing show two different sides, with the increasing demand for online food deliveries and cloud kitchen, it has become a necessary feature for revenue generation but does not directly connect with the requirement of a takeout

ANOVA test for difference testing based on wifi availability

Hypotheses:

  • Null Hypothesis (H₀): There is no significant difference in restaurant ratings based on wifi availability

  • Alternative Hypothesis (H₁): There is a significant difference in restaurant ratings based on wifi availability

Considering a level of significance of 5%

df_wifi_ratings = con.sql("""
    SELECT "attributes.WiFi" AS wifi, stars 
    FROM restos
    WHERE "attributes.WiFi" IS NOT NULL
""").df()

anova_wifi_result = stats.f_oneway(*[group['stars'].values for name, group in df_wifi_ratings.groupby('wifi')])

# Print results
print(f"F-statistic: {anova_wifi_result.statistic}, P-value: {anova_wifi_result.pvalue}")

# Conclusion based on the p-value
if anova_wifi_result.pvalue < alpha:
    print("Reject the null hypothesis: There is a significant difference in ratings based on WiFi availability.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in ratings based on WiFi availability.")
F-statistic: 127.2351743929415, P-value: 5.846940942623236e-160
Reject the null hypothesis: There is a significant difference in ratings based on WiFi availability.
  • The rejection of null hypothesis means the availability of wifi make a significant difference in the ratings of the restaurant, which can be validated by the outcomes of the EDA, hence should be considered to be an important feature for integration when opening up a restaurant
  • With the current reliance on internet in terms of business or daily life operations both on customer and business end, it is practically possible that the customer experience and the ratings might be affected by the availability of wifi
df_bike_parking = con.sql("""
    SELECT "attributes.BikeParking" AS bike, stars 
    FROM restos
    WHERE "attributes.WiFi" IS NOT NULL
""").df()

anova_bike_result = stats.f_oneway(*[group['stars'].values for name, group in df_bike_parking.groupby('bike')])

# Print results
print(f"F-statistic: {anova_bike_result.statistic}, P-value: {anova_bike_result.pvalue}")

# Conclusion based on the p-value
if anova_bike_result.pvalue < alpha:
    print("Reject the null hypothesis: There is a significant difference in ratings based on bike parking availability.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in ratings based on bike parking availability.")
F-statistic: 317.31463279238386, P-value: 3.894894516054194e-137
Reject the null hypothesis: There is a significant difference in ratings based on bike parking availability.
  • The rejection of null hypothesis means the availability of bike parking make a significant difference in the ratings of the restaurant, which can be validated by the outcomes of the EDA, hence should be considered to be an important feature for integration when opening up a restaurant
  • Yes, this is practically significant as with increasing nummber of vehicles by the day and the increase in online delivery mechanisms, bike parking has become the need of the hour and hence has practical significance

Confidence intervals

Mean rating of restaurants

mean_stars, sem_stars = con.sql("""
    SELECT avg(stars) AS mean, stddev_pop(stars)/sqrt(count(*)) AS sem 
    FROM restos
""").df().iloc[0]

ci_lower, ci_upper = stats.norm.interval(0.95, loc=mean_stars, scale=sem_stars)

print(f"95% Confidence Interval for the Mean Rating of Restaurants: ({ci_lower}, {ci_upper})")
95% Confidence Interval for the Mean Rating of Restaurants: (3.5209928286902916, 3.5347243844244622)

Mean review count of users

# Calculate confidence interval for the mean review count of users
mean_review_count, sem_review_count = con.sql("""
    SELECT avg(review_count) AS mean, stddev_pop(review_count)/sqrt(count(*)) AS sem 
    FROM users
""").df().iloc[0]

ci_lower, ci_upper = stats.norm.interval(0.95, loc=mean_review_count, scale=sem_review_count)

print(f"95% Confidence Interval for the Mean Review Count of Users: ({ci_lower}, {ci_upper})")
95% Confidence Interval for the Mean Review Count of Users: (21.597036329814646, 21.798406467189363)

Proportion of restaurants offering takeaway

# Calculate confidence interval for the proportion of restaurants offering takeaway
takeout_counts = con.sql("""
    SELECT 
        SUM(CASE WHEN "attributes.RestaurantsTakeOut" = 'True' THEN 1 ELSE 0 END) AS takeout_count,
        count(*) AS total_count
    FROM restos
""").df().iloc[0]

proportion = takeout_counts['takeout_count'] / takeout_counts['total_count']
ci_lower, ci_upper = stats.binom.interval(0.95, takeout_counts['total_count'], proportion)

ci_lower = ci_lower / takeout_counts['total_count']
ci_upper = ci_upper / takeout_counts['total_count']

print(f"95% Confidence Interval for the Proportion of Restaurants Offering Takeaway: ({ci_lower}, {ci_upper})")
95% Confidence Interval for the Proportion of Restaurants Offering Takeaway: (0.8709426229508197, 0.8768237704918033)

Part 4 - Linear Regression

What is the relationship between the length of a review (number of words) and the rating given?

To find the answer to the above question, simple linear regression analysis is performed

df_reviews = con.sql("""
    SELECT  stars as rating, length(text) AS review_length
    FROM resto_reviews
""").df()
mod = smf.ols('rating ~ review_length', data = df_reviews)
res = mod.fit()
print(res.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 rating   R-squared:                       0.024
Model:                            OLS   Adj. R-squared:                  0.024
Method:                 Least Squares   F-statistic:                 4.177e+04
Date:                Sat, 31 Aug 2024   Prob (F-statistic):               0.00
Time:                        01:49:27   Log-Likelihood:            -2.8750e+06
No. Observations:             1674096   AIC:                         5.750e+06
Df Residuals:                 1674094   BIC:                         5.750e+06
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         3.9796      0.002   2582.529      0.000       3.977       3.983
review_length    -0.0004   1.95e-06   -204.373      0.000      -0.000      -0.000
==============================================================================
Omnibus:                   179867.951   Durbin-Watson:                   1.999
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           214601.374
Skew:                          -0.845   Prob(JB):                         0.00
Kurtosis:                       2.529   Cond. No.                     1.17e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.17e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Interpretation of the Linear Regression Analysis

The provided output represents the results of an OLS (Ordinary Least Squares) regression analysis, where the dependent variable is the rating given to a restaurant, and the independent variable is review_length, which presumably measures the number of words in a review.

1. Regression Coefficients

  • Intercept (3.9796): This value indicates that when the review_length is zero (i.e., when a review has no words), the expected average rating is approximately 3.9796. This serves as the baseline rating.
  • review_length (-0.0004): The coefficient for review_length is -0.0004, meaning that for every additional word in a review, the rating decreases by approximately 0.0004 points. This negative coefficient suggests a slight inverse relationship between the length of a review and the rating.

2. Statistical Significance

  • P>|t| for review_length (0.000): The p-value for review_length is extremely small (0.000), which indicates that the relationship between review length and rating is statistically significant. In other words, there’s a very low probability that this relationship is due to random chance.
  • t-statistic for review_length (-204.373): The large negative t-value suggests that the effect of review length on rating is strong and negative.

3. Model Fit

  • R-squared (0.024): The R-squared value of 0.024 indicates that only 2.4% of the variability in restaurant ratings can be explained by the length of the review. This suggests that while the relationship between review length and rating is statistically significant, it is not practically significant, as review length only explains a small fraction of the variation in ratings.
  • Adj. R-squared (0.024): The adjusted R-squared is nearly identical to the R-squared value, which is typical in models with a single predictor. This confirms that adding more variables to the model might be necessary to better explain the variability in ratings.

4. Overall Model Significance

  • F-statistic (4.177e+04) and Prob (F-statistic) (0.00): The F-statistic and its associated p-value (0.00) suggest that the model as a whole is statistically significant. This means that the regression model provides a better fit to the data than a model with no predictors.

5. Diagnostic Tests

  • Omnibus, Prob(Omnibus), Skew, Kurtosis: These statistics assess the normality of residuals. The large values and low p-values here suggest that the residuals deviate from normality, which might imply potential issues in the model fit or the need for data transformation.
  • Durbin-Watson (1.999): This value is close to 2, indicating no significant autocorrelation in the residuals.
  • Jarque-Bera (214601.374) and Prob(JB) (0.00): These values also test for normality. The very low p-value indicates that the residuals are not normally distributed.

Summary

  • Relationship: There is a statistically significant but practically weak inverse relationship between the length of a review and the rating given. Longer reviews are associated with slightly lower ratings.

  • Model Fit: The model explains only 2.4% of the variability in ratings, suggesting that other factors not included in the model may better explain the ratings.

  • Model Significance: Despite the weak practical significance, the overall model is statistically significant.

  • Residual Analysis: There might be concerns regarding the normality of residuals, which could suggest potential issues with the model or the need for further analysis.

In conclusion, while there is a measurable relationship between review length and rating, its impact is minimal, and additional variables may need to be considered to build a more predictive model.

How do multiple factors (e.g. restaurant characteristics, aggregate characteristics of reviews etc.) together influence the average ratings received by a restuarant?

To analyse the effect of multiple variables together, multiple linear regression analysis is performed as below

df_merged = con.sql("""
    SELECT r.stars AS avg_rating, 
           r."attributes.RestaurantsTakeOut" AS takeout, 
           r."attributes.WiFi" AS wifi, 
           r."attributes.BusinessAcceptsCreditCards" AS credit_cards,
           r."attributes.RestaurantsPriceRange2" AS price_range,
           AVG(rr.useful) AS avg_useful,
           AVG(length(rr.text)) AS avg_review_length
    FROM restos r
    JOIN resto_reviews rr
    ON r.int_business_id = rr.int_business_id
    GROUP BY r.int_business_id, r.stars, r."attributes.RestaurantsTakeOut", 
             r."attributes.WiFi", r."attributes.BusinessAcceptsCreditCards", 
             r."attributes.RestaurantsPriceRange2"
""").df()


X = df_merged[['takeout', 'wifi', 'credit_cards', 'price_range', 'avg_useful', 'avg_review_length']]
X = pd.get_dummies(X, drop_first=True)  # Convert categorical variables to dummy/indicator variables
y = df_merged['avg_rating']
X = X.astype(int)
X = sm.add_constant(X)  # Adds a constant term to the predictor


model = sm.OLS(y, X).fit()
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             avg_rating   R-squared:                       0.043
Model:                            OLS   Adj. R-squared:                  0.042
Method:                 Least Squares   F-statistic:                     133.5
Date:                Mon, 26 Aug 2024   Prob (F-statistic):               0.00
Time:                        23:28:50   Log-Likelihood:                -54670.
No. Observations:               48091   AIC:                         1.094e+05
Df Residuals:                   48074   BIC:                         1.095e+05
Df Model:                          16                                         
Covariance Type:            nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                 3.6955      0.015    244.255      0.000       3.666       3.725
avg_useful            0.0247      0.003      8.797      0.000       0.019       0.030
avg_review_length    -0.0004   1.54e-05    -25.636      0.000      -0.000      -0.000
takeout_None          0.3326      0.028     11.852      0.000       0.278       0.388
takeout_True         -0.1044      0.012     -8.808      0.000      -0.128      -0.081
wifi_'no'             0.0514      0.013      3.951      0.000       0.026       0.077
wifi_'paid'          -0.2172      0.084     -2.585      0.010      -0.382      -0.053
wifi_None            -0.5127      0.146     -3.519      0.000      -0.798      -0.227
wifi_u'free'          0.2649      0.009     30.403      0.000       0.248       0.282
wifi_u'no'            0.1889      0.009     20.422      0.000       0.171       0.207
wifi_u'paid'         -0.0731      0.056     -1.297      0.195      -0.183       0.037
credit_cards_None    -0.7313      0.436     -1.679      0.093      -1.585       0.122
credit_cards_True    -0.0413      0.009     -4.761      0.000      -0.058      -0.024
price_range_2         0.0453      0.007      6.257      0.000       0.031       0.060
price_range_3         0.1668      0.020      8.470      0.000       0.128       0.205
price_range_4         0.2417      0.051      4.778      0.000       0.143       0.341
price_range_None      0.0548      0.286      0.192      0.848      -0.506       0.615
==============================================================================
Omnibus:                     1999.221   Durbin-Watson:                   1.954
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             2255.880
Skew:                          -0.526   Prob(JB):                         0.00
Kurtosis:                       3.143   Cond. No.                     7.70e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.7e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Interpretation of the Multiple Regression Analysis

The provided output represents the results of an OLS (Ordinary Least Squares) multiple regression analysis, where the dependent variable is the avg_rating given to a restaurant. The independent variables include several restaurant characteristics and aggregate review characteristics.

1. Regression Coefficients

  • Intercept (3.6955): This value represents the expected average rating for a restaurant when all other independent variables are held at their baseline (zero). The baseline average rating is approximately 3.6955.

  • avg_useful (0.0247): For every unit increase in the average usefulness score of reviews, the restaurant’s average rating is expected to increase by 0.0247. This suggests that more useful reviews are positively associated with higher average ratings, and the effect is statistically significant (p < 0.05).

  • avg_review_length (-0.0004): Similar to the previous regression, longer reviews are associated with slightly lower ratings. The coefficient is -0.0004, indicating that each additional word in a review slightly decreases the average rating.

  • takeout_None (-0.3326): Restaurants that do not have any data on takeout tend to have average ratings that are 0.3326 points lower than those that do offer takeout. This negative relationship is statistically significant.

  • takeout_True (0.1444): Conversely, offering takeout is associated with a 0.1444 point increase in average ratings, reinforcing the positive impact of this service on customer satisfaction.

  • wifi variables:

    • wifi_‘no’ (-0.0514): Restaurants without Wi-Fi tend to have slightly lower average ratings.
    • wifi_‘paid’ (-0.2172): Restaurants that offer paid Wi-Fi have significantly lower average ratings, possibly indicating customer dissatisfaction with having to pay for this service.
    • wifi_‘free’ (0.2189): Offering free Wi-Fi is positively associated with higher average ratings.
  • credit_cards variables:

    • credit_cards_None (-0.2783): No data on accepting credit cards is associated with significantly lower average ratings.
    • credit_cards_True (0.0435): Accepting credit cards positively influences average ratings, though the effect size is small but statistically significant.
  • price_range variables:

    • price_range_2 (0.0583), price_range_3 (0.1668), price_range_4 (0.2417): Higher price ranges are positively associated with average ratings, with the effect increasing as the price range increases. This suggests that customers may associate higher prices with better quality or value.
    • price_range_None (0.0548): Restaurants with no specified price range have slightly higher average ratings, though the effect size is small.

2. Statistical Significance

  • P>|t| values: Most of the independent variables have p-values < 0.05, indicating that their relationships with the average rating are statistically significant. This means that these factors are likely to have a real impact on the ratings received by restaurants.

3. Model Fit

  • R-squared (0.043): The R-squared value of 0.043 indicates that the model explains 4.3% of the variability in restaurant ratings. While this is low, it is not unusual for models dealing with human behavior and opinions, which can be influenced by a wide range of factors not captured in the model.
  • Adj. R-squared (0.042): The adjusted R-squared is nearly identical to the R-squared value, suggesting that the model does not suffer from overfitting.

4. Overall Model Significance

  • F-statistic (133.5) and Prob (F-statistic) (0.00): The F-statistic and its p-value suggest that the model as a whole is statistically significant, meaning that the combined effect of all the included variables significantly explains some portion of the variance in average ratings.

5. Diagnostic Tests

  • Omnibus, Prob(Omnibus), Skew, Kurtosis: These statistics indicate potential issues with the normality of residuals, as evidenced by the significant p-values. This could suggest that the model might not fully capture the complexity of the relationship between the predictors and the average rating.
  • Durbin-Watson (1.954): This value is close to 2, indicating that there is no significant autocorrelation in the residuals.
  • Jarque-Bera (2255.880) and Prob(JB) (0.00): These values also suggest that the residuals are not normally distributed, which might indicate the need for model refinement or transformation of variables.

Summary

  • Relationship of Factors with Average Rating:

    • Positive Factors: High usefulness of reviews, offering free Wi-Fi, accepting credit cards, and higher price ranges are associated with higher average ratings.
    • Negative Factors: Longer reviews, absence of data on takeout service, paid Wi-Fi, and absence of data on accepting credit cards are associated with lower average ratings.
  • Model Fit: Although the model is statistically significant, the R-squared value indicates that only a small portion of the variation in ratings is explained by these variables, suggesting that other factors not included in the model may play a more significant role.

  • Implications: The results indicate that certain restaurant characteristics, such as service availability (e.g., Wi-Fi, takeout) and payment options, can influence customer satisfaction as reflected in ratings. However, the overall weak explanatory power of the model suggests that ratings are likely influenced by many other factors, including possibly subjective or context-specific elements not captured here.

  • Further Analysis: Given the low R-squared and issues with residual normality, further analysis could involve exploring additional variables, interactions, or non-linear relationships to better capture the complexity of factors influencing restaurant ratings.

Checking for multicolinearity using Variance Inflation Factor

from statsmodels.stats.outliers_influence import variance_inflation_factor

# Calculate VIF for each feature
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]

print(vif_data)
              feature        VIF
0               const  19.347782
1          avg_useful   1.069207
2   avg_review_length   1.143809
3        takeout_None   1.163913
4        takeout_True   1.287069
5           wifi_'no'   1.176796
6         wifi_'paid'   1.003779
7           wifi_None   1.006446
8        wifi_u'free'   1.371194
9          wifi_u'no'   1.323372
10       wifi_u'paid'   1.011204
11  credit_cards_None   1.000331
12  credit_cards_True   1.052585
13      price_range_2   1.094433
14      price_range_3   1.110604
15      price_range_4   1.020730
16   price_range_None   1.005633

The table provided represents the Variance Inflation Factors (VIFs) for the features in the multiple regression model. VIF is a measure of how much the variance of a regression coefficient is inflated due to multicollinearity among the independent variables in the model.

Interpretation of VIF Values:

  1. VIF Interpretation:
    • VIF = 1: No multicollinearity. The feature is not correlated with other features.
    • 1 < VIF < 5: Moderate multicollinearity. The feature is correlated with other features, but not severely.
    • VIF > 5: High multicollinearity. The feature is highly correlated with other features, which can affect the stability of the regression coefficients.
  2. Analysis of the Provided VIF Values:
    • const (19.347782): The constant (intercept) has a very high VIF. This is expected and usually not a cause for concern because the intercept is not a predictor variable, but rather a base level for the model.
    • avg_useful (1.069207) to wifi_u’paid’ (1.011204): All these variables have VIFs close to 1, indicating very low multicollinearity. This means that each of these variables provides independent information and is not redundant with other variables in the model.
    • price_range_2 (1.094433) to price_range_4 (1.020730): The price range variables also have low VIFs, suggesting that they do not suffer from significant multicollinearity.

Summary:

  • Low Multicollinearity: Most of the VIF values are close to 1, which suggests that the model does not suffer from significant multicollinearity among the independent variables. This is a good sign as it means that the estimated coefficients are reliable and not unduly influenced by correlations among the predictor variables.

  • High VIF for Constant: The high VIF for the intercept is typical and not a cause for concern. It does not affect the interpretation of the other variables.

Conclusion:

The regression model appears to have well-behaved independent variables with respect to multicollinearity. The low VIF values indicate that each variable adds unique information to the model, enhancing the reliability of the coefficient estimates. This suggests that the relationships identified in the regression analysis are likely to be accurate and not distorted by multicollinearity.