南京二手房价分析

    获取数据

    爬取房价数据,并存入MongoDB

    代码写的比较随意,凑合着看吧.(2019年4月11日)

    import sys
    import time
    from datetime import datetime
    import requests
    from bs4 import BeautifulSoup
    
    from pymongo import MongoClient
    
    Client = MongoClient()
    collection = Client.lianjia.secondHouse
    
    prefix = "https://nj.lianjia.com"
    areas = ["/ershoufang/gulou/","/ershoufang/jianye/","/ershoufang/qinhuai/",
     "/ershoufang/xuanwu/","/ershoufang/yuhuatai/","/ershoufang/qixia/","/ershoufang/jiangning/",
     "/ershoufang/pukou/","/ershoufang/liuhe/","/ershoufang/lishui/","/ershoufang/gaochun/",
    ]
    areaChineses = ['鼓楼', '建邺', '秦淮', '玄武', '雨花台', '栖霞', '江宁', '浦口', '六合', '丽水', '高淳']
    headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'}
    city='南京'
    curtime = datetime.now()
    curtime.utcoffset()
    Aindex = 0
    for area,areaCN in zip(areas, areaChineses):
        for index in range(1,101):
            page = "" if  index == 1 else ("pg" + str(index))
            url = prefix + area + page
            r = requests.get(url, headers=headers)
            soup = BeautifulSoup(r.text, 'html.parser')
            houses = soup.find_all('li', class_='clear LOGCLICKDATA')
            try:
                for house in houses:
                    title = str(house.find('div', 'title').find('a').text).strip()
                    url = str(house.find('div', 'title').find('a')['href']).strip()
                    basicInfo = str( house.find('div', 'houseInfo').text).strip().split('|')
                    address = basicInfo[0].strip()
                    houseType = basicInfo[1].strip()
                    size = float(basicInfo[2].strip()[:-2])
                    followers = int(str(house.find('div', 'followInfo').text.split("/")[0].split("人")[0]).strip())
                    hasSeen = int(str(house.find('div', 'followInfo').text.split("/")[1].split("次")[0][2:]).strip())
                    price = float(str(house.find('div', 'totalPrice').find("span").text).strip())
                    meanPrice = float(str(house.find('div', 'unitPrice').find("span").text.split("元")[0][2:]).strip())
                    data = dict(
                        city=city,
                        area=areaCN,
                        title=title,
                        url=url,
                        address=address,
                        houseType=houseType,
                        size=size,
                        followers=followers,
                        hasSeen=hasSeen,
                        price=price,
                        meanPrice=meanPrice,
                    )
                    collection.insert_one(data)
                    Aindex += 1
    
            except:
                print("第", Aindex, "个房产插入异常")
            finally:
                time.sleep(10)
    

    从MongoDB取数据并格式化到本地

    import pymongo
    import pandas as pd
    collection = client.lianjia.secondHouse
    df = pd.DataFrame(list(collection.find())
    df.to_csv('lianjia.csv', index=False)
    

    从本地读取数据

    数据很干净

    import pandas as pd
    df = pd.read_csv("./lianjia.csv")
    del df["_id"]
    print(df.info())
    

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 29300 entries, 0 to 29299
    Data columns (total 11 columns):
    address 29300 non-null object
    area 29300 non-null object
    city 29300 non-null object
    followers 29300 non-null int64
    hasSeen 29300 non-null int64
    houseType 29300 non-null object
    meanPrice 29300 non-null float64
    price 29300 non-null float64
    size 29300 non-null float64
    title 29300 non-null object
    url 29300 non-null object
    dtypes: float64(3), int64(2), object(6)
    memory usage: 2.5+ MB

    数据分析

    对数据进行基本的分析, 获取数据之间显式的的关联性

    户型分析

    热度分析

    最受人青睐的是2室1厅, 其次是3室2厅,2室2厅,
    最不受青睐的2室0厅,3室0厅,4室1厅,5室3厅,

    <iframe width="100%" height="500" frameborder="0" scrolling="no" src="//plot.ly/~Jansora/5.embed"></iframe>
    

    Python 代码

    x = []
    y = []
    for label, _df in df.groupby(by='houseType'):
        x.append(label)
        y.append(_df.shape[0])
    py.iplot([go.Bar(x=x,y=y)], filename='linajia-houseType')
    

    评论栏