全文下载链接:http://tecdat.cn/?p=29480
作者:Xingsheng Yang
最近我们被客户要求撰写关于链家租房数据分析的研究报告,包括一些图形和统计输出。
1 利用 python 爬取链家网公开的租房数据;
2 对租房信息进行分析,主要对房租相关特征进行分析,并搭建模型用于预测房租
任务/目标
利用上海链家网站租房的公开信息,着重对月租进行数据分析和挖掘。
上海租赁数据
此数据来自 Lianjia.com.csv文件包含名称,租赁类型,床位数量,价格,经度,纬度,阳台,押金,公寓,描述,旅游,交通,独立浴室,家具,新房源,大小,方向,堤坝,电梯,停车场和便利设施信息。
属性:
名称:列表名称 类型:转租或全部租赁(全部) 床:卧室号码 价格 经度/纬度:坐标 阳台,押金(是否有押金政策),公寓,描述,旅游可用性,靠近交通,独立浴室,家具
新房源:NO-0,YES-1 面积:平方米 朝向:朝向窗户,南1,东南2,东-3,北4,西南-5,西-6,西北-7,东北8,未知-0 级别:房源层级, 地下室-0, 低层(1-15)-1, 中层(15-25)-2, 高层(>25)-3 停车场:无停车场-0,额外收费-1,免费停车-2 设施:设施数量
import pandas as pd
import numpy as np
import geopandas
df = pd.read_csv('lighai.csv', sep =',', encoding='utf_8_sig', header=None)
df.head()
data:image/s3,"s3://crabby-images/7b400/7b4002b8d176faf32c478e6152f788a47fc6278f" alt=""
数据预处理
ETL处理,清理数据帧。
df_clean.head()
data:image/s3,"s3://crabby-images/b29bc/b29bcee65ad9e94bf21b2b9fde4f3e25230ef86a" alt=""
data:image/s3,"s3://crabby-images/8fd41/8fd4142f3d6864e9ab9eb7db2955951c8e19f996" alt=""
探索性分析 - 数据可视化
plt.figure(figsize=(8, 6))
sns.distplot(df_clean.price, bins=500, kde=True)
plt.xscale('log') # Log transform the price
data:image/s3,"s3://crabby-images/824d3/824d39e0341b5f27794abb4b34a46f21b4f95a70" alt=""
data:image/s3,"s3://crabby-images/4ff7c/4ff7cdee2507c467f4e66ed71ac34830bc5d000c" alt=""
读取地理数据
data:image/s3,"s3://crabby-images/d63a4/d63a492c0c5cc29e1764c62be1f76d7960c0c088" alt=""
data:image/s3,"s3://crabby-images/7a521/7a52123dd9ce1944cb25b0028783b86ec967a5ed" alt=""
plt.figure(figsize=(12, 12))
sns.heatmap(df_clean.corr(), square=True, annot=True, fmt = '.2f', cmap = 'vla
点击标题查阅往期内容
线性回归和时间序列分析北京房价影响因素可视化案例
左右滑动查看更多
01
data:image/s3,"s3://crabby-images/d3571/d35716865a6f1b3b5fd953dfb4b8d4faae5a422b" alt=""
02
data:image/s3,"s3://crabby-images/fc91a/fc91a6590545b4378dfa0f0b321386ff417bced2" alt=""
03
data:image/s3,"s3://crabby-images/a4b65/a4b65e7fae468863b5aa38a666c6b28440d5429d" alt=""
04
data:image/s3,"s3://crabby-images/31d43/31d433c336feb469a94f06e84d886332c4d27fc8" alt=""
data:image/s3,"s3://crabby-images/2b9d6/2b9d6b372546d7d67e6208fb4f967899cb427813" alt=""
模型构建
尝试根据特征预测价格。
y = df_clean.log_price
X = df_clean.iloc[:, 1:].drop(['price', 'log_price'], axis=1)
岭回归模型
ridge = Ridge()
alphas = [0.0001, 0.001, 0.001, 0.01, 0.1, 0.5, 1, 2, 3, 5, 10]
data:image/s3,"s3://crabby-images/8e60e/8e60e2188267cb7ec7c5dc1118e4d8766c79bfe6" alt=""
data:image/s3,"s3://crabby-images/6dd5a/6dd5ab47f39407f22879ade67b6946cd4d33dee0" alt=""
data:image/s3,"s3://crabby-images/a430b/a430b691d4d599e8abf9b63bcbd3e3c10d986ad9" alt=""
Lasso回归
data:image/s3,"s3://crabby-images/9c4c5/9c4c5fa9844828a936946af75dd212f694b25a59" alt=""
data:image/s3,"s3://crabby-images/38e0f/38e0fa753271c7d0cb54410b14466999a94b2adc" alt=""
coef.sort_values(ascending=False).plot(kind = 'barh')
data:image/s3,"s3://crabby-images/81932/819326c0834e0271ca88878074aca93c82403e7d" alt=""
Random forest随机森林
rf_cv.fit(X_train, y_train)
data:image/s3,"s3://crabby-images/4057b/4057b93ca111eff82700eaea108e0d96492b94f2" alt=""
data:image/s3,"s3://crabby-images/4c835/4c83586fd4ec2a2315d3bde75ddac449072e8170" alt=""
XGBoost
xgb_model.loc[30:,['test-rmse-mean', 'train-rmse-mean']].plot();
data:image/s3,"s3://crabby-images/7ded9/7ded9ad62289c5678c9d97ccb5e822fc00806591" alt=""
xgb_cv.fit(X_train, y_train)
data:image/s3,"s3://crabby-images/b7019/b7019cb134927229041722394b48eb20bb44bf9a" alt=""
data:image/s3,"s3://crabby-images/7f228/7f22853ff04f8ee211df73a504d20d138583b677" alt=""
data:image/s3,"s3://crabby-images/a58b8/a58b87bf9288f0f82180523588f1d77c8e26d684" alt=""
Keras神经网络
model.add(Dense(1, kernel_initializer='normal'))
# Compile model
model.compile(loss='mean_squared_error', optimizer='Adam')
model.summary()
data:image/s3,"s3://crabby-images/73f6c/73f6cfa4e96d7611d1c5578db913ee85f707c0c1" alt=""
kmeans聚类数据
kmeanModel = KMeans(n_clusters=k).fit(X)
kmeanModel.fit(X)
inertias.append(kmeanModel.inertia_)
plt.plot(K, inertias, 'bx-')
data:image/s3,"s3://crabby-images/d1332/d13329a1288d051f1ece99c20276eb7a90863154" alt=""
gpd.plot(figsize=(12,10), alpha=0.3)
scatter_map = plt.scatter(data=df_clean, x='lon', y='lat', c='label', alpha=0.3, cmap='tab10', s=2)
data:image/s3,"s3://crabby-images/ec014/ec0148a19f35e71b46f6dceb980d135e52d38c9c" alt=""
data:image/s3,"s3://crabby-images/bd20f/bd20fabc50482af32756d6cb658b25f1368d0c6f" alt=""
本文选自《python岭回归、Lasso、随机森林、XGBoost、Keras神经网络、kmeans聚类链家租房数据地理可视化分析》。