Section 1 - Group 14

Abhishek Goyal (FT251004)
Anubhav Verma (FT251021)
Harshit Vashisht (FT251029)
N.V. Padmanayana (FT251048)
Waagmee Singh (FT251105)

Setup of database

import numpy as np
import scipy.stats as spst
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import duckdb

We are going to use duckdb as our database engine. You can refer to the documentation here: Documentation. First, we will create a database called yelp.db in duckdb. We will persist this db on the harddrive because it might be too big to handle in your system RAM. That said, duckdb has in-RAM database capabilities as well. Of course, if you have a lot of RAM, you can use Pandas directly or use the in-RAM capabilities of duckdb. Once we execute the command below, we will have created the required db and connected to it as well. Once we have created the db, it will appear in the root folder of your project.

con = duckdb.connect("yelpdata.db")

Now, we will import the csv files into four different tables namely restos, resto_reviews, users and user_friends. This is a onetime execution. You don’t need to execute the codecell below everytime you do the analysis. Once the tables are created, they will persist on the db in your root folder.

restos table comes from restuarants_train.csv (There is also a restuarants_test.csv. You do not need to worry about this test data now. This will be used to conduct a hackathon to predict star ratings later if we get time. ). The columns correspond to business.json in https://www.yelp.com/dataset/documentation/main

resto_reviews table comes from restaurant_reviews.csv. The columns correspond to review.json in https://www.yelp.com/dataset/documentation/main

users table comes from user.csv. The columns correspond to user.json in https://www.yelp.com/dataset/documentation/main

user_friends table comes from user_friends_full.csv. It was created by me. It shows the number of friends of each user in the social network of yelp.

con.sql("""
    CREATE TABLE restos AS
    FROM read_csv('yelp.restaurants_train.csv', sample_size = -1);
""")

con.sql("""
    CREATE TABLE resto_reviews AS
    FROM read_csv('yelp.restaurant_reviews.csv', sample_size = -1);
""")

con.sql("""
    CREATE TABLE users AS
    FROM read_csv('yelp.user.csv', sample_size = -1);
""")

con.sql("""
    CREATE TABLE user_friends AS
    FROM read_csv('yelp.user_friends_full.csv', sample_size = -1);
""")

Let us look at the first 5 rows of the restos table.

con.sql("""
    SELECT * FROM restos 
    LIMIT 5;
""")

┌──────────────────────┬─────────────────────┬───────────┬───┬──────────────────────┬──────────────────────┐
│         name         │       address       │   city    │ … │ attributes.Accepts…  │ attributes.HairSpe…  │
│       varchar        │       varchar       │  varchar  │   │       boolean        │       varchar        │
├──────────────────────┼─────────────────────┼───────────┼───┼──────────────────────┼──────────────────────┤
│ Oskar Blues Taproom  │ 921 Pearl St        │ Boulder   │ … │ NULL                 │ NULL                 │
│ Flying Elephants a…  │ 7000 NE Airport Way │ Portland  │ … │ NULL                 │ NULL                 │
│ Bob Likes Thai Food  │ 3755 Main St        │ Vancouver │ … │ NULL                 │ NULL                 │
│ Boxwood Biscuit      │ 740 S High St       │ Columbus  │ … │ NULL                 │ NULL                 │
│ Mr G's Pizza & Subs  │ 474 Lowell St       │ Peabody   │ … │ NULL                 │ NULL                 │
├──────────────────────┴─────────────────────┴───────────┴───┴──────────────────────┴──────────────────────┤
│ 5 rows                                                                              60 columns (5 shown) │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────┘

We can convert the results from duckdb into a pandas dataframe by appending .df() function as given below. This pandas dataframe can be used for further analysis in python.

df_restos = con.sql("""
    SELECT * FROM restos 
    LIMIT 5;
""").df()
df_restos

	name	address	city	state	postal_code	latitude	longitude	stars	review_count	is_open	...	attributes.Smoking	attributes.DriveThru	attributes.BYOBCorkage	attributes.Corkage	attributes.RestaurantsCounterService	attributes.DietaryRestrictions	attributes.AgesAllowed	attributes.Open24Hours	attributes.AcceptsInsurance	attributes.HairSpecializesIn
0	Oskar Blues Taproom	921 Pearl St	Boulder	CO	80302	40.017544	-105.283348	4.0	86	1	...	None	None	None	None	NaN	None	None	NaN	NaN	None
1	Flying Elephants at PDX	7000 NE Airport Way	Portland	OR	97218	45.588906	-122.593331	4.0	126	1	...	None	None	None	None	NaN	None	None	NaN	NaN	None
2	Bob Likes Thai Food	3755 Main St	Vancouver	BC	V5V	49.251342	-123.101333	3.5	169	1	...	None	None	None	None	NaN	None	None	NaN	NaN	None
3	Boxwood Biscuit	740 S High St	Columbus	OH	43206	39.947007	-82.997471	4.5	11	1	...	None	None	None	None	NaN	None	None	NaN	NaN	None
4	Mr G's Pizza & Subs	474 Lowell St	Peabody	MA	01960	42.541155	-70.973438	4.0	39	1	...	None	None	None	None	NaN	None	None	NaN	NaN	None

5 rows × 60 columns

Let us look at the columns in each of the tables.

con.sql("""
    DESCRIBE restos
""").df()

	column_name	column_type	null	key	default	extra
0	name	VARCHAR	YES	None	None	None
1	address	VARCHAR	YES	None	None	None
2	city	VARCHAR	YES	None	None	None
3	state	VARCHAR	YES	None	None	None
4	postal_code	VARCHAR	YES	None	None	None
5	latitude	DOUBLE	YES	None	None	None
6	longitude	DOUBLE	YES	None	None	None
7	stars	DOUBLE	YES	None	None	None
8	review_count	BIGINT	YES	None	None	None
9	is_open	BIGINT	YES	None	None	None
10	attributes.RestaurantsTableService	VARCHAR	YES	None	None	None
11	attributes.WiFi	VARCHAR	YES	None	None	None
12	attributes.BikeParking	VARCHAR	YES	None	None	None
13	attributes.BusinessParking	VARCHAR	YES	None	None	None
14	attributes.BusinessAcceptsCreditCards	VARCHAR	YES	None	None	None
15	attributes.RestaurantsReservations	VARCHAR	YES	None	None	None
16	attributes.WheelchairAccessible	VARCHAR	YES	None	None	None
17	attributes.Caters	VARCHAR	YES	None	None	None
18	attributes.OutdoorSeating	VARCHAR	YES	None	None	None
19	attributes.RestaurantsGoodForGroups	VARCHAR	YES	None	None	None
20	attributes.HappyHour	VARCHAR	YES	None	None	None
21	attributes.BusinessAcceptsBitcoin	BOOLEAN	YES	None	None	None
22	attributes.RestaurantsPriceRange2	VARCHAR	YES	None	None	None
23	attributes.Ambience	VARCHAR	YES	None	None	None
24	attributes.HasTV	VARCHAR	YES	None	None	None
25	attributes.Alcohol	VARCHAR	YES	None	None	None
26	attributes.GoodForMeal	VARCHAR	YES	None	None	None
27	attributes.DogsAllowed	VARCHAR	YES	None	None	None
28	attributes.RestaurantsTakeOut	VARCHAR	YES	None	None	None
29	attributes.NoiseLevel	VARCHAR	YES	None	None	None
30	attributes.RestaurantsAttire	VARCHAR	YES	None	None	None
31	attributes.RestaurantsDelivery	VARCHAR	YES	None	None	None
32	categories	VARCHAR	YES	None	None	None
33	hours.Monday	VARCHAR	YES	None	None	None
34	hours.Tuesday	VARCHAR	YES	None	None	None
35	hours.Wednesday	VARCHAR	YES	None	None	None
36	hours.Thursday	VARCHAR	YES	None	None	None
37	hours.Friday	VARCHAR	YES	None	None	None
38	hours.Saturday	VARCHAR	YES	None	None	None
39	hours.Sunday	VARCHAR	YES	None	None	None
40	int_business_id	BIGINT	YES	None	None	None
41	attributes.GoodForKids	VARCHAR	YES	None	None	None
42	attributes.ByAppointmentOnly	BOOLEAN	YES	None	None	None
43	attributes	VARCHAR	YES	None	None	None
44	hours	VARCHAR	YES	None	None	None
45	attributes.Music	VARCHAR	YES	None	None	None
46	attributes.GoodForDancing	VARCHAR	YES	None	None	None
47	attributes.BestNights	VARCHAR	YES	None	None	None
48	attributes.BYOB	VARCHAR	YES	None	None	None
49	attributes.CoatCheck	VARCHAR	YES	None	None	None
50	attributes.Smoking	VARCHAR	YES	None	None	None
51	attributes.DriveThru	VARCHAR	YES	None	None	None
52	attributes.BYOBCorkage	VARCHAR	YES	None	None	None
53	attributes.Corkage	VARCHAR	YES	None	None	None
54	attributes.RestaurantsCounterService	BOOLEAN	YES	None	None	None
55	attributes.DietaryRestrictions	VARCHAR	YES	None	None	None
56	attributes.AgesAllowed	VARCHAR	YES	None	None	None
57	attributes.Open24Hours	BOOLEAN	YES	None	None	None
58	attributes.AcceptsInsurance	BOOLEAN	YES	None	None	None
59	attributes.HairSpecializesIn	VARCHAR	YES	None	None	None

con.sql("""
    DESCRIBE resto_reviews
""")

┌────────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│    column_name     │ column_type │  null   │   key   │ default │  extra  │
│      varchar       │   varchar   │ varchar │ varchar │ varchar │ varchar │
├────────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ stars              │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ useful             │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ funny              │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ cool               │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ text               │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ date               │ TIMESTAMP   │ YES     │ NULL    │ NULL    │ NULL    │
│ int_business_id    │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ int_user_id        │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ int_rest_review_id │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
└────────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘

con.sql("""
    DESCRIBE users
""")

┌────────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│    column_name     │ column_type │  null   │   key   │ default │  extra  │
│      varchar       │   varchar   │ varchar │ varchar │ varchar │ varchar │
├────────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ review_count       │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ yelping_since      │ TIMESTAMP   │ YES     │ NULL    │ NULL    │ NULL    │
│ useful             │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ funny              │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ cool               │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ elite              │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ fans               │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ average_stars      │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ compliment_hot     │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ compliment_more    │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ compliment_profile │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ compliment_cute    │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ compliment_list    │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ compliment_note    │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ compliment_plain   │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ compliment_cool    │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ compliment_funny   │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ compliment_writer  │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ compliment_photos  │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ int_user_id        │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
├────────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤
│ 20 rows                                                        6 columns │
└──────────────────────────────────────────────────────────────────────────┘

con.sql("""
    DESCRIBE user_friends
""")

┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │  null   │   key   │ default │  extra  │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ int_user_id │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ num_friends │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘

Descriptive analysis

Average stars received by restaurants.

con.sql("""
    SELECT avg(stars) FROM restos
""")

┌───────────────────┐
│    avg(stars)     │
│      double       │
├───────────────────┤
│ 3.527858606557377 │
└───────────────────┘

We will select the stars column from restos and plot a histogram of stars.

df_temp = con.sql("""
    SELECT stars FROM restos
""").df()
df_temp

	stars
0	4.0
1	4.0
2	3.5
3	4.5
4	4.0
...	...
48795	2.0
48796	3.0
48797	3.0
48798	4.0
48799	4.5

48800 rows × 1 columns

sns.displot(data = df_temp, x = "stars", kind = "hist", color = "Green")

Average stars on yelp received by restaurants grouped by state.

df_temp = con.sql("""
    SELECT state, avg(stars) as avg_stars FROM restos 
    GROUP BY state
    ORDER BY avg_stars DESC
""").df()
df_temp

	state	avg_stars
0	ABE	4.500000
1	NH	4.000000
2	OR	3.752074
3	TX	3.656262
4	CO	3.587216
5	FL	3.521030
6	MN	3.500000
7	MA	3.480819
8	BC	3.454463
9	WA	3.450202
10	OH	3.421853
11	GA	3.399796
12	KY	2.000000
13	VA	2.000000
14	KS	2.000000
15	WY	1.500000

Let us look at how many restaurants provide WiFi.

con.sql("""
    SELECT DISTINCT "attributes.WiFi" FROM restos
""").df()

	attributes.WiFi
0	u'no'
1	u'paid'
2	'paid'
3	u'free'
4	None
5	'free'
6	'no'
7	None

df_temp = con.sql("""
    SELECT "attributes.WiFi" AS wifi, count("attributes.WiFi") AS count FROM restos
    GROUP BY wifi
""").df()
df_temp

	wifi	count
0	'free'	5496
1	u'paid'	184
2	u'free'	14994
3	None	29
4	u'no'	11706
5	'no'	4394
6	'paid'	81
7	None	0

Of course we will need to correct the data because, as you can see, the values of the variable are not clean. There seems to be repetitions of the same thing mentioned in two different ways.

df_new = pd.DataFrame({
    "wifi": [
        "None",
        "Free",
        "No",
        "Paid"
    ],
    "count": [
        (df_temp["count"][0] + df_temp["count"][7]),
        (df_temp["count"][1] + df_temp["count"][3]),
        (df_temp["count"][2] + df_temp["count"][5]),
        (df_temp["count"][4] + df_temp["count"][6])
    ]
})
df_new

	wifi	count
0	None	5496
1	Free	213
2	No	19388
3	Paid	11787

Let us find aggregates of useful votes recevied by reviews grouped by star ratings.

df_temp = con.sql("""
    SELECT stars, sum(useful) as sum_useful, avg(useful) as avg_useful, max(useful) AS max_useful FROM resto_reviews
    GROUP BY stars
    ORDER BY stars ASC
""").df()
df_temp

	stars	sum_useful	avg_useful	max_useful
0	1	239163.0	1.256445	146
1	2	184718.0	1.192214	260
2	3	216230.0	1.000694	177
3	4	454666.0	1.036895	411
4	5	556257.0	0.825011	223

sns.catplot(
    data=df_temp, 
    x="stars", 
    y="sum_useful", 
    kind="bar", 
    palette='magma',
    height=6, 
    aspect=1.5,
    edgecolor="black"
)

C:\Users\A\AppData\Local\Temp\ipykernel_15224\656945628.py:1: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.catplot(

Part 1 - Exploratory Data Analysis

Data cleaning and merging

df_restos = con.sql('select * from restos').df()
df_resto_reviews = con.sql('select * from resto_reviews').df()
df_users = con.sql('select * from users').df()
df_user_friends = con.sql('select * from user_friends').df()

The tables for each of the CSV files look as below

df_restos.head(5)

	name	address	city	state	postal_code	latitude	longitude	stars	review_count	is_open	...	attributes.Smoking	attributes.DriveThru	attributes.BYOBCorkage	attributes.Corkage	attributes.RestaurantsCounterService	attributes.DietaryRestrictions	attributes.AgesAllowed	attributes.Open24Hours	attributes.AcceptsInsurance	attributes.HairSpecializesIn
0	Oskar Blues Taproom	921 Pearl St	Boulder	CO	80302	40.017544	-105.283348	4.0	86	1	...	None	None	None	None	NaN	None	None	NaN	NaN	None
1	Flying Elephants at PDX	7000 NE Airport Way	Portland	OR	97218	45.588906	-122.593331	4.0	126	1	...	None	None	None	None	NaN	None	None	NaN	NaN	None
2	Bob Likes Thai Food	3755 Main St	Vancouver	BC	V5V	49.251342	-123.101333	3.5	169	1	...	None	None	None	None	NaN	None	None	NaN	NaN	None
3	Boxwood Biscuit	740 S High St	Columbus	OH	43206	39.947007	-82.997471	4.5	11	1	...	None	None	None	None	NaN	None	None	NaN	NaN	None
4	Mr G's Pizza & Subs	474 Lowell St	Peabody	MA	01960	42.541155	-70.973438	4.0	39	1	...	None	None	None	None	NaN	None	None	NaN	NaN	None

5 rows × 60 columns

df_resto_reviews.head(5)

	stars	useful	funny	cool	text	date	int_business_id	int_user_id	int_rest_review_id
0	2	1	1	1	I've stayed at many Marriott and Renaissance M...	2010-01-08 02:29:15	4954	6319642	2
1	2	0	0	0	The setting is perfectly adequate, and the foo...	2006-04-16 02:58:44	14180	292901	5
2	5	0	0	0	I work in the Pru and this is the most afforda...	2014-05-07 18:10:21	11779	6336225	7
3	5	5	3	3	I loved everything about this place. I've only...	2014-02-05 21:09:05	3216	552519	12
4	4	0	0	0	I think their rice dishes are way better than ...	2017-05-26 03:05:46	8748	544027	16

df_users.head(5)

	review_count	yelping_since	useful	funny	cool	elite	fans	average_stars	compliment_hot	compliment_more	compliment_profile	compliment_cute	compliment_list	compliment_note	compliment_plain	compliment_cool	compliment_funny	compliment_writer	compliment_photos	int_user_id
0	1220	2005-03-14 20:26:35	15038	10030	11291	2006,2007,2008,2009,2010,2011,2012,2013,2014	1357	3.85	1710	163	190	361	147	1212	5691	2541	2541	815	323	1
1	2136	2007-08-10 19:01:51	21272	10289	18046	2007,2008,2009,2010,2011,2012,2013,2014,2015,2...	1025	4.09	1632	87	94	232	96	1187	3293	2205	2205	472	294	1248
2	119	2007-02-07 15:47:53	188	128	130	2010,2011	16	3.76	22	1	3	0	0	5	20	31	31	3	1	11604
3	987	2009-02-09 16:14:29	7234	4722	4035	2009,2010,2011,2012,2013,2014	420	3.77	1180	129	93	219	90	1120	4510	1566	1566	391	326	2278
4	495	2008-03-03 04:57:05	1577	727	1124	2009,2010,2011	47	3.72	248	19	32	16	15	77	131	310	310	98	44	13481

df_user_friends.head(5)

	int_user_id	num_friends
0	1	5813
1	1248	6296
2	11604	835
3	2278	1452
4	13481	532

df_restos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48800 entries, 0 to 48799
Data columns (total 60 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   name                                   48800 non-null  object 
 1   address                                48378 non-null  object 
 2   city                                   48800 non-null  object 
 3   state                                  48800 non-null  object 
 4   postal_code                            48776 non-null  object 
 5   latitude                               48800 non-null  float64
 6   longitude                              48800 non-null  float64
 7   stars                                  48800 non-null  float64
 8   review_count                           48800 non-null  int64  
 9   is_open                                48800 non-null  int64  
 10  attributes.RestaurantsTableService     18523 non-null  object 
 11  attributes.WiFi                        36884 non-null  object 
 12  attributes.BikeParking                 33715 non-null  object 
 13  attributes.BusinessParking             44235 non-null  object 
 14  attributes.BusinessAcceptsCreditCards  39552 non-null  object 
 15  attributes.RestaurantsReservations     42037 non-null  object 
 16  attributes.WheelchairAccessible        12628 non-null  object 
 17  attributes.Caters                      33288 non-null  object 
 18  attributes.OutdoorSeating              42636 non-null  object 
 19  attributes.RestaurantsGoodForGroups    40952 non-null  object 
 20  attributes.HappyHour                   12774 non-null  object 
 21  attributes.BusinessAcceptsBitcoin      5952 non-null   object 
 22  attributes.RestaurantsPriceRange2      43154 non-null  object 
 23  attributes.Ambience                    39676 non-null  object 
 24  attributes.HasTV                       39971 non-null  object 
 25  attributes.Alcohol                     39435 non-null  object 
 26  attributes.GoodForMeal                 27935 non-null  object 
 27  attributes.DogsAllowed                 11071 non-null  object 
 28  attributes.RestaurantsTakeOut          45727 non-null  object 
 29  attributes.NoiseLevel                  34457 non-null  object 
 30  attributes.RestaurantsAttire           38941 non-null  object 
 31  attributes.RestaurantsDelivery         44650 non-null  object 
 32  categories                             48800 non-null  object 
 33  hours.Monday                           36880 non-null  object 
 34  hours.Tuesday                          38586 non-null  object 
 35  hours.Wednesday                        40018 non-null  object 
 36  hours.Thursday                         40597 non-null  object 
 37  hours.Friday                           40775 non-null  object 
 38  hours.Saturday                         39370 non-null  object 
 39  hours.Sunday                           34455 non-null  object 
 40  int_business_id                        48800 non-null  int64  
 41  attributes.GoodForKids                 40477 non-null  object 
 42  attributes.ByAppointmentOnly           3413 non-null   object 
 43  attributes                             0 non-null      object 
 44  hours                                  0 non-null      object 
 45  attributes.Music                       5224 non-null   object 
 46  attributes.GoodForDancing              3584 non-null   object 
 47  attributes.BestNights                  4184 non-null   object 
 48  attributes.BYOB                        3247 non-null   object 
 49  attributes.CoatCheck                   3863 non-null   object 
 50  attributes.Smoking                     3245 non-null   object 
 51  attributes.DriveThru                   4582 non-null   object 
 52  attributes.BYOBCorkage                 3513 non-null   object 
 53  attributes.Corkage                     3754 non-null   object 
 54  attributes.RestaurantsCounterService   40 non-null     object 
 55  attributes.DietaryRestrictions         67 non-null     object 
 56  attributes.AgesAllowed                 19 non-null     object 
 57  attributes.Open24Hours                 32 non-null     object 
 58  attributes.AcceptsInsurance            8 non-null      object 
 59  attributes.HairSpecializesIn           3 non-null      object 
dtypes: float64(3), int64(3), object(54)
memory usage: 22.3+ MB

Assuming 30% of total datapoints to be the threshold above which, a column is of no use due to lack of data

# Calculate the threshold for dropping columns
threshold = len(df_restos) * 0.30

# Drop columns with more than 30% null values
df_restos = df_restos.dropna(axis=1, thresh=threshold)

df_restos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48800 entries, 0 to 48799
Data columns (total 38 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   name                                   48800 non-null  object 
 1   address                                48378 non-null  object 
 2   city                                   48800 non-null  object 
 3   state                                  48800 non-null  object 
 4   postal_code                            48776 non-null  object 
 5   latitude                               48800 non-null  float64
 6   longitude                              48800 non-null  float64
 7   stars                                  48800 non-null  float64
 8   review_count                           48800 non-null  int64  
 9   is_open                                48800 non-null  int64  
 10  attributes.RestaurantsTableService     18523 non-null  object 
 11  attributes.WiFi                        36884 non-null  object 
 12  attributes.BikeParking                 33715 non-null  object 
 13  attributes.BusinessParking             44235 non-null  object 
 14  attributes.BusinessAcceptsCreditCards  39552 non-null  object 
 15  attributes.RestaurantsReservations     42037 non-null  object 
 16  attributes.Caters                      33288 non-null  object 
 17  attributes.OutdoorSeating              42636 non-null  object 
 18  attributes.RestaurantsGoodForGroups    40952 non-null  object 
 19  attributes.RestaurantsPriceRange2      43154 non-null  object 
 20  attributes.Ambience                    39676 non-null  object 
 21  attributes.HasTV                       39971 non-null  object 
 22  attributes.Alcohol                     39435 non-null  object 
 23  attributes.GoodForMeal                 27935 non-null  object 
 24  attributes.RestaurantsTakeOut          45727 non-null  object 
 25  attributes.NoiseLevel                  34457 non-null  object 
 26  attributes.RestaurantsAttire           38941 non-null  object 
 27  attributes.RestaurantsDelivery         44650 non-null  object 
 28  categories                             48800 non-null  object 
 29  hours.Monday                           36880 non-null  object 
 30  hours.Tuesday                          38586 non-null  object 
 31  hours.Wednesday                        40018 non-null  object 
 32  hours.Thursday                         40597 non-null  object 
 33  hours.Friday                           40775 non-null  object 
 34  hours.Saturday                         39370 non-null  object 
 35  hours.Sunday                           34455 non-null  object 
 36  int_business_id                        48800 non-null  int64  
 37  attributes.GoodForKids                 40477 non-null  object 
dtypes: float64(3), int64(3), object(32)
memory usage: 14.1+ MB

# Calculate the threshold for dropping columns
threshold = len(df_resto_reviews) * 0.30

# Drop columns with more than 30% null values
df_resto_reviews = df_resto_reviews.dropna(axis=1, thresh=threshold)
df_resto_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1674096 entries, 0 to 1674095
Data columns (total 9 columns):
 #   Column              Non-Null Count    Dtype         
---  ------              --------------    -----         
 0   stars               1674096 non-null  int64         
 1   useful              1674096 non-null  int64         
 2   funny               1674096 non-null  int64         
 3   cool                1674096 non-null  int64         
 4   text                1674096 non-null  object        
 5   date                1674096 non-null  datetime64[us]
 6   int_business_id     1674096 non-null  int64         
 7   int_user_id         1674096 non-null  int64         
 8   int_rest_review_id  1674096 non-null  int64         
dtypes: datetime64[us](1), int64(7), object(1)
memory usage: 115.0+ MB

# Calculate the threshold for dropping columns
threshold = len(df_users) * 0.30

# Drop columns with more than 30% null values
df_users = df_users.dropna(axis=1, thresh=threshold)
df_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2189457 entries, 0 to 2189456
Data columns (total 19 columns):
 #   Column              Dtype         
---  ------              -----         
 0   review_count        int64         
 1   yelping_since       datetime64[ns]
 2   useful              int64         
 3   funny               int64         
 4   cool                int64         
 5   fans                int64         
 6   average_stars       float64       
 7   compliment_hot      int64         
 8   compliment_more     int64         
 9   compliment_profile  int64         
 10  compliment_cute     int64         
 11  compliment_list     int64         
 12  compliment_note     int64         
 13  compliment_plain    int64         
 14  compliment_cool     int64         
 15  compliment_funny    int64         
 16  compliment_writer   int64         
 17  compliment_photos   int64         
 18  int_user_id         int64         
dtypes: datetime64[ns](1), float64(1), int64(17)
memory usage: 317.4 MB

# Calculate the threshold for dropping columns
threshold = len(df_user_friends) * 0.30

# Drop columns with more than 30% null values
df_user_friends = df_user_friends.dropna(axis=1, thresh=threshold)
df_user_friends.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2189457 entries, 0 to 2189456
Data columns (total 2 columns):
 #   Column       Dtype
---  ------       -----
 0   int_user_id  int64
 1   num_friends  int64
dtypes: int64(2)
memory usage: 33.4 MB

Exploring factors influencing star ratings of the restaurants

Location analysis

# Fetching the data from the database
df_avg_stars_state = con.sql("""
    SELECT state, avg(stars) as avg_stars
    FROM restos
    GROUP BY state
    ORDER BY avg_stars DESC
""").df()

# Creating the plot
plt.figure(figsize=(14, 8))  # Adjusted figure size for better visibility
ax = sns.barplot(data=df_avg_stars_state, x='state', y='avg_stars', hue='state', palette='magma', dodge=False)

# Adjusting the thickness of the bars
for patch in ax.patches:
    patch.set_width(0.9)  # Adjust this value to make bars thicker or thinner

# Rotating x-axis labels
plt.xticks(rotation=90)

# Moving the legend outside the chart area
plt.legend(loc='upper left', bbox_to_anchor=(1, 1), title='State')

# Adjust layout to make space for the legend
plt.tight_layout()

# Display the plot
plt.show()

No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.

The top three locations with the highest average ratings were ABE, NIH and OR, hence, could be the best locations to consider while opening up a restaurant

Type analysis

import matplotlib.pyplot as plt
import seaborn as sns

# Fetching the data from the database
df_categories = con.sql("""
    SELECT categories, count(*) as count
    FROM restos
    GROUP BY categories
    ORDER BY count DESC
    LIMIT 10
""").df()

# Creating the plot
plt.figure(figsize=(14, 8))  # Adjusted figure size for better visibility
ax = sns.barplot(data=df_categories, x='count', y='categories', hue='categories', palette='magma', dodge=False)

# Adjusting the thickness of the bars
for patch in ax.patches:
    patch.set_height(0.6)  # Adjust this value to make bars thicker or thinner

# Moving the legend outside the chart area
plt.legend(loc='upper left', bbox_to_anchor=(1, 1), title='Categories')

# Adjust layout to make space for the legend
plt.tight_layout()

# Display the plot
plt.show()

No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.

The most popular categories are Pizza and Chinese, hence, these can be considered to be more suitable for opening up a profitable restaurant

wifi - facilities

# Inner join restos and resto_reviews tables
df_merged = con.sql("""
    SELECT r.int_business_id, r.name, r.stars as resto_avg_stars, 
           rr.stars as review_stars, r."attributes.WiFi", r."attributes.RestaurantsTakeOut"
    FROM restos r
    JOIN resto_reviews rr ON r.int_business_id = rr.int_business_id
""").df()

# Analyze average ratings based on WiFi availability
df_wifi = df_merged.groupby('attributes.WiFi').agg({
    'review_stars': 'mean'
}).reset_index()

import seaborn as sns
import matplotlib.pyplot as plt

# Assuming df_merged is already created and available from your SQL join

# Analyze average ratings based on WiFi availability
df_wifi = df_merged.groupby('attributes.WiFi').agg({
    'review_stars': 'mean'
}).reset_index()

plt.figure(figsize=(10, 6))  # Increase figure size
sns.barplot(data=df_wifi, x='attributes.WiFi', y='review_stars', 
            hue='attributes.WiFi', dodge=False, palette='magma', linewidth=0)
plt.xlabel('WiFi Availability', fontsize=12)
plt.ylabel('Average User Rating', fontsize=12)
plt.title('Impact of WiFi Availability on User Ratings', fontsize=14)
plt.legend(title='WiFi Availability', bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0.)  # Move legend out of chart area
plt.show()

No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.

TakeOut - facilities

import seaborn as sns
import matplotlib.pyplot as plt

# Analyze average ratings based on TakeOut availability
df_takeout = df_merged.groupby('attributes.RestaurantsTakeOut').agg({
    'review_stars': 'mean'
}).reset_index()

# Increase figure size and create the bar plot with gradient hue
plt.figure(figsize=(10, 6))
sns.barplot(data=df_takeout, x='attributes.RestaurantsTakeOut', y='review_stars',
            hue='attributes.RestaurantsTakeOut', dodge=False, 
            palette=sns.color_palette("magma", len(df_takeout)), linewidth=0)

# Add labels and title
plt.xlabel('TakeOut Availability', fontsize=12)
plt.ylabel('Average User Rating', fontsize=12)
plt.title('Impact of TakeOut Availability on User Ratings', fontsize=14)

# Move legend outside the chart area
plt.legend(title='TakeOut Availability', bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0.)

# Display the plot
plt.show

No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.

Bike parking - facilities

import seaborn as sns
import matplotlib.pyplot as plt

# Query data and compute average stars based on BikeParking attribute
df_bike = con.sql("""
    SELECT "attributes.BikeParking" AS bike, avg(stars) as avg_stars
    FROM restos
    GROUP BY bike
""").df()

# Increase figure size and create the bar plot with thicker bars
plt.figure(figsize=(10, 6))
sns.barplot(data=df_bike, x='bike', y='avg_stars', dodge=False, 
            palette=sns.color_palette("magma", len(df_bike)), linewidth=0)

# Add labels and title
plt.xlabel('Bike Parking Availability', fontsize=12)
plt.ylabel('Average User Rating', fontsize=12)
plt.title('Impact of Bike Parking Availability on User Ratings', fontsize=14)

# Remove legend and move it outside the chart area
plt.legend().remove()

# Display the plot
plt.show()

C:\Users\A\AppData\Local\Temp\ipykernel_15224\4155413326.py:13: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(data=df_bike, x='bike', y='avg_stars', dodge=False,
C:\Users\A\AppData\Local\Temp\ipykernel_15224\4155413326.py:13: UserWarning: The palette list has more values (4) than needed (3), which may not be intended.
  sns.barplot(data=df_bike, x='bike', y='avg_stars', dodge=False,
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.

Inferences of facilities

wifi availability free of cost achieved the highest average ratings, hence is important to have this attribute
TakeOut does not seem to be an important feature for the customers to rate the restaurants highly, hence are not that useful
Average rating for restaurants with bike parking availability was observed to be the highest, hence considered an important feature

Amneties

df_amenities = con.sql("""
    SELECT 
        "attributes.OutdoorSeating", "attributes.HappyHour",
        "attributes.Alcohol", "attributes.BusinessAcceptsCreditCards",
        avg(stars) as avg_stars
    FROM restos
    GROUP BY 
        "attributes.OutdoorSeating", "attributes.HappyHour",
        "attributes.Alcohol", "attributes.BusinessAcceptsCreditCards"
    ORDER BY avg_stars DESC
""").df()
df_amenities.head(10)

	attributes.OutdoorSeating	attributes.HappyHour	attributes.Alcohol	attributes.BusinessAcceptsCreditCards	avg_stars
0	True	False	None	False	5.0
1	True	False	u'none'	False	4.7
2	None	True	None	False	4.5
3	True	False	'beer_and_wine'	False	4.5
4	None	False	'full_bar'	None	4.5
5	True	False	u'beer_and_wine'	False	4.5
6	True	True	u'none'	False	4.5
7	None	False	u'beer_and_wine'	False	4.5
8	None	True	u'none'	True	4.5
9	None	None	u'none'	None	4.5

Top amneties relating to higher ratings include Outdoor Seating, Happy hours. Credit Card acceptance

Exploring the relationship between ratings and review characteristics

import seaborn as sns
import matplotlib.pyplot as plt

# Fetching the data from the database
df_review_chars = con.sql("""
    SELECT useful, funny, cool, stars
    FROM resto_reviews
""").df()

sns.heatmap(
    df_review_chars.corr(), 
    cmap='magma',  
    annot=True, 
    fmt='.1f'
)

# Display the plot
plt.show()

The review characteristics and ratings did not reveal any significant linear relationship and hence is not useful for any relational study

Analysis of the reviewer characteristics with rating

df_user_chars = con.sql("""
    SELECT review_count, average_stars, useful
    FROM users
""").df()

df_user_chars.head()

	review_count	average_stars	useful
0	1220	3.85	15038
1	2136	4.09	21272
2	119	3.76	188
3	987	3.77	7234
4	495	3.72	1577

# Correlation matrix
df_user_chars.corr()

	review_count	average_stars	useful
review_count	1.000000	0.036760	0.584678
average_stars	0.036760	1.000000	0.008758
useful	0.584678	0.008758	1.000000

It is observed that with the increase in number of reviews, more people find it useful due to a significant correlation factor of ~0.58

Influence of number of friends on user’s review usefulness

df_friends_reviews = con.sql("""
    SELECT u.num_friends, r.useful
    FROM user_friends u
    JOIN resto_reviews r ON u.int_user_id = r.int_user_id
""").df()

sns.scatterplot(
    data=df_friends_reviews, 
    x='num_friends', 
    y='useful', 
    color=sns.color_palette("magma", as_cmap=True)(0.7)
)

plt.show()

sns.heatmap(df_friends_reviews.corr(), annot = True)

There is a low correlation of 0.37 between number of friends and usefulness, hence not much influencial

Analyzing factors affecting length of reviews

df_review_length = con.sql("""
    SELECT length(text) as review_length, stars
    FROM resto_reviews
""").df()

# Correlation

sns.heatmap(df_review_length.corr(), annot = True)

A negligible correlation is observed between review length and ratings, hence there is no solid inference from this analysis

Analysing the relation of number of reviews and ratings

# Merge restos and resto_reviews tables
df_merged = con.sql("""
    SELECT r.int_business_id, r.name, r.stars as resto_avg_stars, 
           r.review_count, rr.stars as review_stars
    FROM restos r
    JOIN resto_reviews rr ON r.int_business_id = rr.int_business_id
""").df()

# Group by restaurant to get average rating and review count
df_grouped = df_merged.groupby('int_business_id').agg({
    'resto_avg_stars': 'mean',
    'review_count': 'sum'
}).reset_index()

sns.scatterplot(data=df_grouped, x='review_count', y='resto_avg_stars',color=sns.color_palette("magma", as_cmap=True)(0.7))
plt.xlabel('Total Number of Reviews')
plt.ylabel('Average Restaurant Rating')
plt.title('Relationship Between Review Count and Average Rating')
plt.show()

For higher ratings, there are higher number of reviews observed, hence it can be deduced that more number of people are interested to give a good feedback when they are satisfied with the services

Part 2 - Descriptive Analysis

Restaurant analysis

# Descriptive statistics for Restaurants (restos table)
df_restos_stats = con.sql("""
    SELECT 
        avg(stars) AS mean_stars,
        median(stars) AS median_stars,
        mode() WITHIN GROUP (ORDER BY stars) AS mode_stars,
        stddev_pop(stars) AS stddev_stars,
        avg(review_count) AS mean_review_count,
        median(review_count) AS median_review_count,
        mode() WITHIN GROUP (ORDER BY review_count) AS mode_review_count,
        stddev_pop(review_count) AS stddev_review_count
    FROM restos
""").df()

df_restos_stats

	mean_stars	median_stars	mode_stars	stddev_stars	mean_review_count	median_review_count	mode_review_count	stddev_review_count
0	3.527859	3.5	4.0	0.773841	106.800041	43.0	5	208.174311

Restaurant review analysis

# Descriptive statistics for Restaurant Reviews (resto_reviews table)
df_reviews_stats = con.sql("""
    SELECT 
        avg(stars) AS mean_review_stars,
        median(stars) AS median_review_stars,
        mode() WITHIN GROUP (ORDER BY stars) AS mode_review_stars,
        stddev_pop(stars) AS stddev_review_stars,
        avg(useful) AS mean_useful_votes,
        median(useful) AS median_useful_votes,
        mode() WITHIN GROUP (ORDER BY useful) AS mode_useful_votes,
        stddev_pop(useful) AS stddev_useful_votes
    FROM resto_reviews
""").df()

df_reviews_stats

	mean_review_stars	median_review_stars	mode_review_stars	stddev_review_stars	mean_useful_votes	median_useful_votes	mode_useful_votes	stddev_useful_votes
0	3.74747	4.0	5	1.364395	0.986224	0.0	0	2.612258

Yelp User analysis

df_users_stats = con.sql("""
    SELECT 
        avg(review_count) AS mean_user_review_count,
        median(review_count) AS median_user_review_count,
        mode() WITHIN GROUP (ORDER BY review_count) AS mode_user_review_count,
        stddev_pop(review_count) AS stddev_user_review_count,
        avg(average_stars) AS mean_user_avg_stars,
        median(average_stars) AS median_user_avg_stars,
        mode() WITHIN GROUP (ORDER BY average_stars) AS mode_user_avg_stars,
        stddev_pop(average_stars) AS stddev_user_avg_stars
    FROM users
""").df()

df_users_stats

	mean_user_review_count	median_user_review_count	mode_user_review_count	stddev_user_review_count	mean_user_avg_stars	median_user_avg_stars	mode_user_avg_stars	stddev_user_avg_stars
0	21.697721	5.0	1	76.01253	3.653816	3.88	5.0	1.153861

Rating Distribution

plt.figure(figsize=(10, 6))

sns.histplot(
    df_restos['stars'], 
    bins=10, 
    kde=True, 
    color=sns.color_palette("magma", as_cmap=True)(0.7),
    edgecolor='black'
)

# Add labels and title with custom font sizes
plt.xlabel('Restaurant Ratings', fontsize=14, labelpad=10)
plt.ylabel('Frequency', fontsize=14, labelpad=10)
plt.title('Distribution of Restaurant Ratings', fontsize=16, weight='bold', pad=15)

# Customize the ticks on the axes
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# Remove top and right spines
sns.despine()

# Add grid lines
plt.grid(True, which='both', linestyle='--', linewidth=0.5)

# Show the plot
plt.show()

Rating analysis from user

df_avg_stars = con.sql("SELECT average_stars FROM users").df()

sns.boxplot(
    data=df_avg_stars, 
    y='average_stars', 
    color=sns.color_palette("magma", as_cmap=True)(0.7)
)
plt.ylabel('Average User Stars', fontsize=12)
plt.title('Distribution of Average User Stars', fontsize=14, weight='bold')
plt.show()

The rating distribution has right skewness, which indicates that most of the restaurants have a high rating stating the fact that as long as a basic and acceptable standard is maintained, the ratings do not fall

# Box plot of restaurant ratings by state
df_ratings_state = con.sql("""
    SELECT state, stars 
    FROM restos
""").df()

# Set a more visually appealing style
sns.set(style="whitegrid")

# Create the box plot
plt.figure(figsize=(12, 8))
sns.boxplot(data=df_ratings_state, x='state', y='stars', palette='magma')

# Add labels and title with custom font sizes
plt.xlabel('State', fontsize=14, labelpad=10)
plt.ylabel('Restaurant Ratings', fontsize=14, labelpad=10)
plt.title('Restaurant Ratings by State', fontsize=16, weight='bold', pad=15)

# Customize the x-axis ticks
plt.xticks(rotation=90, fontsize=12)

# Customize the y-axis ticks
plt.yticks(fontsize=12)

# Remove top and right spines
sns.despine()

# Add grid lines
plt.grid(True, which='both', linestyle='--', linewidth=0.5)

# Show the plot
plt.show()

C:\Users\A\AppData\Local\Temp\ipykernel_15224\2670015677.py:6: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(data=df_ratings_state, x='state', y='stars', palette='magma')

the restaurants in OR shows more than standard performance, hence showing the range of ratings to be higher than other states, that might infer that the standard of operations relative to expectations are higher in that state, other states hover around the same ratings with a handful of states having extremely low data points to infer meaningful insights

Part - 3 - hypothesis testing

Hypotheses:

Null Hypothesis (H₀): There is no significant difference in restaurant ratings between those that allow takeout and those that do not.
Alternative Hypothesis (H₁): There is a significant difference in restaurant ratings between those that allow takeout and those that do not.

Considering a level of significance of 5%

from scipy import stats

# Separate data into two groups: Takeout and No Takeout
df_takeout = con.sql("""
    SELECT stars 
    FROM restos 
    WHERE "attributes.RestaurantsTakeOut" = 'True'
""").df()

df_no_takeout = con.sql("""
    SELECT stars 
    FROM restos 
    WHERE "attributes.RestaurantsTakeOut" = 'False'
""").df()

# Perform an independent t-test
t_stat, p_value = stats.ttest_ind(df_takeout['stars'], df_no_takeout['stars'], equal_var=False)

# Print results
print(f"T-statistic: {t_stat}, P-value: {p_value}")

# Conclusion based on the p-value
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference in restaurant ratings between those that allow takeout and those that do not.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in restaurant ratings between those that allow takeout and those that do not.")

T-statistic: -2.9938115671530063, P-value: 0.0027826746728394523
Reject the null hypothesis: There is a significant difference in restaurant ratings between those that allow takeout and those that do not.

The above study indicates that there is a perceivable difference between Takeout services offered and not offered, but EDA showed indifference, that means that this feature can not be solely dependent on average factors observed for the sample, hence takeouts are an important feature and needs to be considered while designing the operations of the new restaurant
Practical significance is debatable since the EDA and hypothesis testing show two different sides, with the increasing demand for online food deliveries and cloud kitchen, it has become a necessary feature for revenue generation but does not directly connect with the requirement of a takeout

ANOVA test for difference testing based on wifi availability

Hypotheses:

Null Hypothesis (H₀): There is no significant difference in restaurant ratings based on wifi availability
Alternative Hypothesis (H₁): There is a significant difference in restaurant ratings based on wifi availability

Considering a level of significance of 5%

df_wifi_ratings = con.sql("""
    SELECT "attributes.WiFi" AS wifi, stars 
    FROM restos
    WHERE "attributes.WiFi" IS NOT NULL
""").df()

anova_wifi_result = stats.f_oneway(*[group['stars'].values for name, group in df_wifi_ratings.groupby('wifi')])

# Print results
print(f"F-statistic: {anova_wifi_result.statistic}, P-value: {anova_wifi_result.pvalue}")

# Conclusion based on the p-value
if anova_wifi_result.pvalue < alpha:
    print("Reject the null hypothesis: There is a significant difference in ratings based on WiFi availability.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in ratings based on WiFi availability.")

F-statistic: 127.2351743929415, P-value: 5.846940942623236e-160
Reject the null hypothesis: There is a significant difference in ratings based on WiFi availability.

The rejection of null hypothesis means the availability of wifi make a significant difference in the ratings of the restaurant, which can be validated by the outcomes of the EDA, hence should be considered to be an important feature for integration when opening up a restaurant
With the current reliance on internet in terms of business or daily life operations both on customer and business end, it is practically possible that the customer experience and the ratings might be affected by the availability of wifi

df_bike_parking = con.sql("""
    SELECT "attributes.BikeParking" AS bike, stars 
    FROM restos
    WHERE "attributes.WiFi" IS NOT NULL
""").df()

anova_bike_result = stats.f_oneway(*[group['stars'].values for name, group in df_bike_parking.groupby('bike')])

# Print results
print(f"F-statistic: {anova_bike_result.statistic}, P-value: {anova_bike_result.pvalue}")

# Conclusion based on the p-value
if anova_bike_result.pvalue < alpha:
    print("Reject the null hypothesis: There is a significant difference in ratings based on bike parking availability.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in ratings based on bike parking availability.")

F-statistic: 317.31463279238386, P-value: 3.894894516054194e-137
Reject the null hypothesis: There is a significant difference in ratings based on bike parking availability.

The rejection of null hypothesis means the availability of bike parking make a significant difference in the ratings of the restaurant, which can be validated by the outcomes of the EDA, hence should be considered to be an important feature for integration when opening up a restaurant
Yes, this is practically significant as with increasing nummber of vehicles by the day and the increase in online delivery mechanisms, bike parking has become the need of the hour and hence has practical significance

Confidence intervals

Mean rating of restaurants

mean_stars, sem_stars = con.sql("""
    SELECT avg(stars) AS mean, stddev_pop(stars)/sqrt(count(*)) AS sem 
    FROM restos
""").df().iloc[0]

ci_lower, ci_upper = stats.norm.interval(0.95, loc=mean_stars, scale=sem_stars)

print(f"95% Confidence Interval for the Mean Rating of Restaurants: ({ci_lower}, {ci_upper})")

95% Confidence Interval for the Mean Rating of Restaurants: (3.5209928286902916, 3.5347243844244622)

Mean review count of users

# Calculate confidence interval for the mean review count of users
mean_review_count, sem_review_count = con.sql("""
    SELECT avg(review_count) AS mean, stddev_pop(review_count)/sqrt(count(*)) AS sem 
    FROM users
""").df().iloc[0]

ci_lower, ci_upper = stats.norm.interval(0.95, loc=mean_review_count, scale=sem_review_count)

print(f"95% Confidence Interval for the Mean Review Count of Users: ({ci_lower}, {ci_upper})")

95% Confidence Interval for the Mean Review Count of Users: (21.597036329814646, 21.798406467189363)

Proportion of restaurants offering takeaway

# Calculate confidence interval for the proportion of restaurants offering takeaway
takeout_counts = con.sql("""
    SELECT 
        SUM(CASE WHEN "attributes.RestaurantsTakeOut" = 'True' THEN 1 ELSE 0 END) AS takeout_count,
        count(*) AS total_count
    FROM restos
""").df().iloc[0]

proportion = takeout_counts['takeout_count'] / takeout_counts['total_count']
ci_lower, ci_upper = stats.binom.interval(0.95, takeout_counts['total_count'], proportion)

ci_lower = ci_lower / takeout_counts['total_count']
ci_upper = ci_upper / takeout_counts['total_count']

print(f"95% Confidence Interval for the Proportion of Restaurants Offering Takeaway: ({ci_lower}, {ci_upper})")

95% Confidence Interval for the Proportion of Restaurants Offering Takeaway: (0.8709426229508197, 0.8768237704918033)

Part 4 - Linear Regression

What is the relationship between the length of a review (number of words) and the rating given?

To find the answer to the above question, simple linear regression analysis is performed

df_reviews = con.sql("""
    SELECT  stars as rating, length(text) AS review_length
    FROM resto_reviews
""").df()
mod = smf.ols('rating ~ review_length', data = df_reviews)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 rating   R-squared:                       0.024
Model:                            OLS   Adj. R-squared:                  0.024
Method:                 Least Squares   F-statistic:                 4.177e+04
Date:                Sat, 31 Aug 2024   Prob (F-statistic):               0.00
Time:                        01:49:27   Log-Likelihood:            -2.8750e+06
No. Observations:             1674096   AIC:                         5.750e+06
Df Residuals:                 1674094   BIC:                         5.750e+06
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         3.9796      0.002   2582.529      0.000       3.977       3.983
review_length    -0.0004   1.95e-06   -204.373      0.000      -0.000      -0.000
==============================================================================
Omnibus:                   179867.951   Durbin-Watson:                   1.999
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           214601.374
Skew:                          -0.845   Prob(JB):                         0.00
Kurtosis:                       2.529   Cond. No.                     1.17e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.17e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Interpretation of the Linear Regression Analysis

The provided output represents the results of an OLS (Ordinary Least Squares) regression analysis, where the dependent variable is the rating given to a restaurant, and the independent variable is review_length, which presumably measures the number of words in a review.

1. Regression Coefficients

Intercept (3.9796): This value indicates that when the review_length is zero (i.e., when a review has no words), the expected average rating is approximately 3.9796. This serves as the baseline rating.
review_length (-0.0004): The coefficient for review_length is -0.0004, meaning that for every additional word in a review, the rating decreases by approximately 0.0004 points. This negative coefficient suggests a slight inverse relationship between the length of a review and the rating.

2. Statistical Significance

P>|t| for review_length (0.000): The p-value for review_length is extremely small (0.000), which indicates that the relationship between review length and rating is statistically significant. In other words, there’s a very low probability that this relationship is due to random chance.
t-statistic for review_length (-204.373): The large negative t-value suggests that the effect of review length on rating is strong and negative.

3. Model Fit

R-squared (0.024): The R-squared value of 0.024 indicates that only 2.4% of the variability in restaurant ratings can be explained by the length of the review. This suggests that while the relationship between review length and rating is statistically significant, it is not practically significant, as review length only explains a small fraction of the variation in ratings.
Adj. R-squared (0.024): The adjusted R-squared is nearly identical to the R-squared value, which is typical in models with a single predictor. This confirms that adding more variables to the model might be necessary to better explain the variability in ratings.

4. Overall Model Significance

F-statistic (4.177e+04) and Prob (F-statistic) (0.00): The F-statistic and its associated p-value (0.00) suggest that the model as a whole is statistically significant. This means that the regression model provides a better fit to the data than a model with no predictors.

5. Diagnostic Tests

Omnibus, Prob(Omnibus), Skew, Kurtosis: These statistics assess the normality of residuals. The large values and low p-values here suggest that the residuals deviate from normality, which might imply potential issues in the model fit or the need for data transformation.
Durbin-Watson (1.999): This value is close to 2, indicating no significant autocorrelation in the residuals.
Jarque-Bera (214601.374) and Prob(JB) (0.00): These values also test for normality. The very low p-value indicates that the residuals are not normally distributed.

Summary

Relationship: There is a statistically significant but practically weak inverse relationship between the length of a review and the rating given. Longer reviews are associated with slightly lower ratings.
Model Fit: The model explains only 2.4% of the variability in ratings, suggesting that other factors not included in the model may better explain the ratings.
Model Significance: Despite the weak practical significance, the overall model is statistically significant.
Residual Analysis: There might be concerns regarding the normality of residuals, which could suggest potential issues with the model or the need for further analysis.

In conclusion, while there is a measurable relationship between review length and rating, its impact is minimal, and additional variables may need to be considered to build a more predictive model.

How do multiple factors (e.g. restaurant characteristics, aggregate characteristics of reviews etc.) together influence the average ratings received by a restuarant?

To analyse the effect of multiple variables together, multiple linear regression analysis is performed as below

df_merged = con.sql("""
    SELECT r.stars AS avg_rating, 
           r."attributes.RestaurantsTakeOut" AS takeout, 
           r."attributes.WiFi" AS wifi, 
           r."attributes.BusinessAcceptsCreditCards" AS credit_cards,
           r."attributes.RestaurantsPriceRange2" AS price_range,
           AVG(rr.useful) AS avg_useful,
           AVG(length(rr.text)) AS avg_review_length
    FROM restos r
    JOIN resto_reviews rr
    ON r.int_business_id = rr.int_business_id
    GROUP BY r.int_business_id, r.stars, r."attributes.RestaurantsTakeOut", 
             r."attributes.WiFi", r."attributes.BusinessAcceptsCreditCards", 
             r."attributes.RestaurantsPriceRange2"
""").df()


X = df_merged[['takeout', 'wifi', 'credit_cards', 'price_range', 'avg_useful', 'avg_review_length']]
X = pd.get_dummies(X, drop_first=True)  # Convert categorical variables to dummy/indicator variables
y = df_merged['avg_rating']
X = X.astype(int)
X = sm.add_constant(X)  # Adds a constant term to the predictor


model = sm.OLS(y, X).fit()
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:             avg_rating   R-squared:                       0.043
Model:                            OLS   Adj. R-squared:                  0.042
Method:                 Least Squares   F-statistic:                     133.5
Date:                Mon, 26 Aug 2024   Prob (F-statistic):               0.00
Time:                        23:28:50   Log-Likelihood:                -54670.
No. Observations:               48091   AIC:                         1.094e+05
Df Residuals:                   48074   BIC:                         1.095e+05
Df Model:                          16                                         
Covariance Type:            nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                 3.6955      0.015    244.255      0.000       3.666       3.725
avg_useful            0.0247      0.003      8.797      0.000       0.019       0.030
avg_review_length    -0.0004   1.54e-05    -25.636      0.000      -0.000      -0.000
takeout_None          0.3326      0.028     11.852      0.000       0.278       0.388
takeout_True         -0.1044      0.012     -8.808      0.000      -0.128      -0.081
wifi_'no'             0.0514      0.013      3.951      0.000       0.026       0.077
wifi_'paid'          -0.2172      0.084     -2.585      0.010      -0.382      -0.053
wifi_None            -0.5127      0.146     -3.519      0.000      -0.798      -0.227
wifi_u'free'          0.2649      0.009     30.403      0.000       0.248       0.282
wifi_u'no'            0.1889      0.009     20.422      0.000       0.171       0.207
wifi_u'paid'         -0.0731      0.056     -1.297      0.195      -0.183       0.037
credit_cards_None    -0.7313      0.436     -1.679      0.093      -1.585       0.122
credit_cards_True    -0.0413      0.009     -4.761      0.000      -0.058      -0.024
price_range_2         0.0453      0.007      6.257      0.000       0.031       0.060
price_range_3         0.1668      0.020      8.470      0.000       0.128       0.205
price_range_4         0.2417      0.051      4.778      0.000       0.143       0.341
price_range_None      0.0548      0.286      0.192      0.848      -0.506       0.615
==============================================================================
Omnibus:                     1999.221   Durbin-Watson:                   1.954
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             2255.880
Skew:                          -0.526   Prob(JB):                         0.00
Kurtosis:                       3.143   Cond. No.                     7.70e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.7e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Interpretation of the Multiple Regression Analysis

The provided output represents the results of an OLS (Ordinary Least Squares) multiple regression analysis, where the dependent variable is the avg_rating given to a restaurant. The independent variables include several restaurant characteristics and aggregate review characteristics.

1. Regression Coefficients

Intercept (3.6955): This value represents the expected average rating for a restaurant when all other independent variables are held at their baseline (zero). The baseline average rating is approximately 3.6955.
avg_useful (0.0247): For every unit increase in the average usefulness score of reviews, the restaurant’s average rating is expected to increase by 0.0247. This suggests that more useful reviews are positively associated with higher average ratings, and the effect is statistically significant (p < 0.05).
avg_review_length (-0.0004): Similar to the previous regression, longer reviews are associated with slightly lower ratings. The coefficient is -0.0004, indicating that each additional word in a review slightly decreases the average rating.
takeout_None (-0.3326): Restaurants that do not have any data on takeout tend to have average ratings that are 0.3326 points lower than those that do offer takeout. This negative relationship is statistically significant.
takeout_True (0.1444): Conversely, offering takeout is associated with a 0.1444 point increase in average ratings, reinforcing the positive impact of this service on customer satisfaction.
wifi variables:
- wifi_‘no’ (-0.0514): Restaurants without Wi-Fi tend to have slightly lower average ratings.
- wifi_‘paid’ (-0.2172): Restaurants that offer paid Wi-Fi have significantly lower average ratings, possibly indicating customer dissatisfaction with having to pay for this service.
- wifi_‘free’ (0.2189): Offering free Wi-Fi is positively associated with higher average ratings.
credit_cards variables:
- credit_cards_None (-0.2783): No data on accepting credit cards is associated with significantly lower average ratings.
- credit_cards_True (0.0435): Accepting credit cards positively influences average ratings, though the effect size is small but statistically significant.
price_range variables:
- price_range_2 (0.0583), price_range_3 (0.1668), price_range_4 (0.2417): Higher price ranges are positively associated with average ratings, with the effect increasing as the price range increases. This suggests that customers may associate higher prices with better quality or value.
- price_range_None (0.0548): Restaurants with no specified price range have slightly higher average ratings, though the effect size is small.

2. Statistical Significance

P>|t| values: Most of the independent variables have p-values < 0.05, indicating that their relationships with the average rating are statistically significant. This means that these factors are likely to have a real impact on the ratings received by restaurants.

3. Model Fit

R-squared (0.043): The R-squared value of 0.043 indicates that the model explains 4.3% of the variability in restaurant ratings. While this is low, it is not unusual for models dealing with human behavior and opinions, which can be influenced by a wide range of factors not captured in the model.
Adj. R-squared (0.042): The adjusted R-squared is nearly identical to the R-squared value, suggesting that the model does not suffer from overfitting.

4. Overall Model Significance

F-statistic (133.5) and Prob (F-statistic) (0.00): The F-statistic and its p-value suggest that the model as a whole is statistically significant, meaning that the combined effect of all the included variables significantly explains some portion of the variance in average ratings.

5. Diagnostic Tests

Omnibus, Prob(Omnibus), Skew, Kurtosis: These statistics indicate potential issues with the normality of residuals, as evidenced by the significant p-values. This could suggest that the model might not fully capture the complexity of the relationship between the predictors and the average rating.
Durbin-Watson (1.954): This value is close to 2, indicating that there is no significant autocorrelation in the residuals.
Jarque-Bera (2255.880) and Prob(JB) (0.00): These values also suggest that the residuals are not normally distributed, which might indicate the need for model refinement or transformation of variables.

Summary

Relationship of Factors with Average Rating:
- Positive Factors: High usefulness of reviews, offering free Wi-Fi, accepting credit cards, and higher price ranges are associated with higher average ratings.
- Negative Factors: Longer reviews, absence of data on takeout service, paid Wi-Fi, and absence of data on accepting credit cards are associated with lower average ratings.
Model Fit: Although the model is statistically significant, the R-squared value indicates that only a small portion of the variation in ratings is explained by these variables, suggesting that other factors not included in the model may play a more significant role.
Implications: The results indicate that certain restaurant characteristics, such as service availability (e.g., Wi-Fi, takeout) and payment options, can influence customer satisfaction as reflected in ratings. However, the overall weak explanatory power of the model suggests that ratings are likely influenced by many other factors, including possibly subjective or context-specific elements not captured here.
Further Analysis: Given the low R-squared and issues with residual normality, further analysis could involve exploring additional variables, interactions, or non-linear relationships to better capture the complexity of factors influencing restaurant ratings.

Checking for multicolinearity using Variance Inflation Factor

from statsmodels.stats.outliers_influence import variance_inflation_factor

# Calculate VIF for each feature
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]

print(vif_data)

              feature        VIF
0               const  19.347782
1          avg_useful   1.069207
2   avg_review_length   1.143809
3        takeout_None   1.163913
4        takeout_True   1.287069
5           wifi_'no'   1.176796
6         wifi_'paid'   1.003779
7           wifi_None   1.006446
8        wifi_u'free'   1.371194
9          wifi_u'no'   1.323372
10       wifi_u'paid'   1.011204
11  credit_cards_None   1.000331
12  credit_cards_True   1.052585
13      price_range_2   1.094433
14      price_range_3   1.110604
15      price_range_4   1.020730
16   price_range_None   1.005633

The table provided represents the Variance Inflation Factors (VIFs) for the features in the multiple regression model. VIF is a measure of how much the variance of a regression coefficient is inflated due to multicollinearity among the independent variables in the model.

Interpretation of VIF Values:

VIF Interpretation:
- VIF = 1: No multicollinearity. The feature is not correlated with other features.
- 1 < VIF < 5: Moderate multicollinearity. The feature is correlated with other features, but not severely.
- VIF > 5: High multicollinearity. The feature is highly correlated with other features, which can affect the stability of the regression coefficients.
Analysis of the Provided VIF Values:
- const (19.347782): The constant (intercept) has a very high VIF. This is expected and usually not a cause for concern because the intercept is not a predictor variable, but rather a base level for the model.
- avg_useful (1.069207) to wifi_u’paid’ (1.011204): All these variables have VIFs close to 1, indicating very low multicollinearity. This means that each of these variables provides independent information and is not redundant with other variables in the model.
- price_range_2 (1.094433) to price_range_4 (1.020730): The price range variables also have low VIFs, suggesting that they do not suffer from significant multicollinearity.

Summary:

Low Multicollinearity: Most of the VIF values are close to 1, which suggests that the model does not suffer from significant multicollinearity among the independent variables. This is a good sign as it means that the estimated coefficients are reliable and not unduly influenced by correlations among the predictor variables.
High VIF for Constant: The high VIF for the intercept is typical and not a cause for concern. It does not affect the interpretation of the other variables.

Conclusion:

The regression model appears to have well-behaved independent variables with respect to multicollinearity. The low VIF values indicate that each variable adds unique information to the model, enhancing the reliability of the coefficient estimates. This suggests that the relationships identified in the regression analysis are likely to be accurate and not distorted by multicollinearity.