AirBnb Seattle Insights.
Diving into the insights of Seattle Airbnb
This blog answers the key insights from the Airbnb Seattle dataset using the CRISP-DM approach.
Introduction
There are many competitive analysis when it comes to AirBnb Seattle data, many questions have been answered right from which host had the highest number of listings, monthly analysis on bookings, cost analysis with respect to neighborhood and amenities.
Every business firms always have certain number of questions that needs to addressed. The easier way to answer such questions are using related data collected and utilizing right data-mining techniques. The intention here is “You can have data without information, but you cannot have information without data.” — Daniel Keys Moran.
The answer to every question starts from understanding the business or the context behind the question that is exactly where Crisp-Dm methodology comes into picture, the process of Crisp-DM is as follows:
- Business Understanding: This process is all about understanding the outcome of this project. This stage is crucial than any other step in the process is because the analytics of the project could lead to major decisions.
- Data Understanding: This steps starts with collecting the data surrounding the question domain. Understanding the description behind each every dimension in data, describing the dataset is always included in this step to understand the spread of data i.e exploring the dataset with appropriate graphs, tables. Verifying the data quality ex: Addressing missing values, error in column data etc.
- Data Preparation: This step starts with merging the datasets if there are multiple data sources. Aggregating columns, handling the missing values based on the type of columns using mode, mean or even depending on the column importance dropping columns which may not be useful in analysis, creating dummy columns for categorical data.
- Modelling: Modelling technique that is being used in the model, listing all the techniques, assumptions and most importantly designing a model to test the results obtained.
Business questions:
Lets begin the crisp DM procedure with the business questions that needs to be answered.
- The busiest month in Seattle AirBnb.
- Which is the busiest month and what are the price variations during the year.
- Most expensive vs Least expensive neighborhood along the different timeline.
- What are the amenities i.e which features in houses that are playing crucial role while fixing the price.
Selection of Datasets.
Link to the dataset: https://www.kaggle.com/airbnb/seattle/data
The Airbnb Seattle dataset consists 3 sets of data in a zipped folder.
- Listings : This dataset includes description of each dataset and the reviews for the listings based on various factors.
- Reviews: This dataset includes unique id for each reviewer and detailed comments.
- Calendar: This includes listing id, price and availability for a various dates.
Preparing and Understanding the datasets.
- Listings:
Understanding the the listing datasets includes finding the number of missing values in columns, obtaining descriptive statistics and understanding the correlation between the columns.
license 3818
square_feet 3721
monthly_price 2301
security_deposit 1952
weekly_price 1809
Above are the columns with highest number of missing values, out 3818 entries license, square_feet , monthly_price, security deposit and weekly price have more than 50% missing values hence it is dropped from the data-frame.
The number of listings per neighborhood.
Capitol hill, central area and downtown are the top three places where the listings are high. From the initial description we can infer that these were the places where visitors were high and hence the listings were high.
Correlation among the dimensions in Listings.
As we observe the heatmap we can infer that mostly review scores and values are highly correlated to one another, guests and bedrooms, bathrooms and bed are some of the dimensions which are related to one another.
Calendar dataset.
Out of 1393570 records in calendar the price column has 459028 null values, this column must be handled by replacing the missing values with the mean of the other prices in the neighborhood.
The next step in Crisp DM process is to get ready these datasets by performing data wrangling procedures, handling the missing values, getting the dummy columns up for categorical variables.
To perform the data cleaning we shall create a function which will give the final data frames which are ready for analysis.
The dataset is finally ready for our analysis, lets begin the analysis with distribution of price along the number of listings.
The prices are right skewed or positively skewed, indicating most of the prices are on the higher side above the mean value in this case we can infer mean is around 120$.
It’s time to answer our first question which is busiest time to visit the Seattle city.
If we closely observe graph we can infer that from April to July i.e. the summers are basically busier as we can see the number of listings are low and it gradually pics up after July.
Also important is that we need to observe the price variation, it is general the prices are high during the summers.
The results are as usual the prices starts to rise from the month of May and reaches peak in mid June and then starts to decline till November. The upper trend is again observed as the Christmas dates are nearing.
Our second question which needs to be answered was which are the neighborhoods that are most expensive and on the other hand which are affordable.
The highest average price is downtown which is not surprising as it was expected answer. The other inference found was irrespective of the neighborhood the price kept increasing during summer.
The most important thing is to understand how the prices were fixed for the listings, what are all the amenities or factors decided the price apart from the neighborhood is the next question?
It is clear from the graph that number of bedrooms, bathrooms, month and some of the reviews are some of the key features that are leading the way for fixing the price.
Overall the summers are busy in Seattle city in terms of bookings and also the pricing is on the upper end. The downtown, Queen Anne, Cascade are some of the expensive neighborhoods to stay.
References:
- https://www.kaggle.com/airbnb/seattle/data
- https://classroom.udacity.com/nanodegrees/nd025/parts/5c671264-6d88-412d-bb3a-0c2e07a8b915/modules/fcbd01aa-a9f5-4218-b559-1dfbf7dbafe9/lessons/93967fc6-1c57-407a-888e-2a8e676eb994/concepts/d0f0c9ed-424d-4360-aa59-811b52c54304
- https://www.kaggle.com/aleksandradeis/airbnb-seattle-reservation-prices-analysis
- https://medium.com/@tsakunelsonz/top-5-airbnb-analytical-questions-answered-aa13101a6009