Assignment 3#
Due date: 23:59 20/11/24
Part 1 - Little Project on Seattle Airbnb Listings (50 pts)#
For the first part of the assignment, you will continue with several analysis tasks using the Seattle Airbnb Open Data. You will be working on the listings dataset, including full descriptions for each airbnb listing and average review scores.
1. Data Preprocessing (10 pts)#
Read in the
lisitngs.csv
as a pandas DataFrame.Keep only the
id
,latitude
,longitude
,amenities
, andprice
columns.Remove all rows containing missing values.
Replace the
$
and,
in theprice
column with empty string and convert the column into numerical format. Hint: using .astype(float)Sort DataFrame based on price from highest to lowest.
...
2. Data Exploration (20 pts)#
2.1. Grouping (5 pts)#
Create a new column
category
by slicing theprice
column to 5 quantiles (A, B, C, D, E), which corresponding to the top 20%, 20-40%, 40-60%, 60-80% and 80-100%, with Category A being the most expensive and E the cheapest.
Hint: using the qcut() function, examples in here.
...
2.2. Data Visualization (10 pts)#
Create a map displaying the locations of the listings in Category A and Category E with different colors.
Identify a location where listings in Category A clustered on map.
...
2.3. Aggregation (5 pts)#
Aggregate values in
amenities
by thecategory
created.
Hint: the shape of your aggregated DataFrame should be 5 x 2.
...
3. Compute TF-IDF based on the categories (20 pts)#
Compute TF-IDF on the cleaned
amenities
column (excluding stop words).Display top 20 words for listings in Category A.
Display top 20 words for listings in Category E.
Provide a one-sentence observation from the results.
...
Part 2 - Reading (50 pts)#
For the second part of the assignment, you will read this paper on An Empirical Study on the Names of Points of Interest and Their Changes with Geographic Distance.
Please describe how TF-IDF is adapted to the geographic context in this paper. Explain the meaning of different variables in Equation 2. (10 pts)
You might have noticed that whether in Lab 7 and this paper, there are both attempts to develop a spatial/geographic version of TF-IDF. However, there are differences in (1) what textual clues are used in the computation, and (2) what kind of geographic regions are the subject of the study. Explain these differences. (20 pts)
Write a short summary (no more than 250 words) on why the authors think place names have significant meanings. (20 pts)
Submission#
The assignment includes two parts. Please just submit as one Jupyter Notebook file.