简介:本文详细介绍了如何使用Python爬取淘宝商品信息(名称、店铺、销量、地址等),并通过自动化脚本将数据存储为CSV文件,帮助开发者高效获取电商数据。
在电商数据分析场景中,商品信息的快速获取与结构化存储是核心需求。淘宝作为国内最大的电商平台,其商品数据(名称、店铺、销量、地址等)对市场调研、竞品分析具有重要价值。本文将详细介绍如何通过Python实现自动化爬取淘宝商品信息,并将数据存储为CSV文件,覆盖从环境配置到数据清洗的全流程。
pip install requests beautifulsoup4 pandas selenium
robots.txt文件,避免爬取禁止访问的页面。time.sleep()设置请求间隔(建议2-5秒),防止被封IP。淘宝商品列表页通常包含以下结构:
<div class="title">或<a class="J_ClickStat">。<div class="shop">或<a class="shop-name">。<div class="sale-num">或<span class="sold">。<div class="price">和<div class="location">。示例代码(静态页面解析):
import requestsfrom bs4 import BeautifulSoupdef fetch_taobao_page(keyword):url = f"https://s.taobao.com/search?q={keyword}"headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}response = requests.get(url, headers=headers)if response.status_code == 200:return response.textelse:print("请求失败")return Nonedef parse_commodity_info(html):soup = BeautifulSoup(html, "html.parser")items = soup.find_all("div", class_="item J_MouserOnverReq")data = []for item in items:name = item.find("div", class_="title").get_text(strip=True)shop = item.find("div", class_="shop").get_text(strip=True)sales = item.find("div", class_="sale-num").get_text(strip=True)location = item.find("div", class_="location").get_text(strip=True)data.append([name, shop, sales, location])return data
若页面通过JavaScript动态加载数据,需使用Selenium模拟浏览器行为:
from selenium import webdriverfrom selenium.webdriver.common.by import Byimport timedef fetch_with_selenium(keyword):driver = webdriver.Chrome()url = f"https://s.taobao.com/search?q={keyword}"driver.get(url)time.sleep(3) # 等待页面加载items = driver.find_elements(By.CSS_SELECTOR, ".item.J_MouserOnverReq")data = []for item in items:name = item.find_element(By.CSS_SELECTOR, ".title").textshop = item.find_element(By.CSS_SELECTOR, ".shop").textsales = item.find_element(By.CSS_SELECTOR, ".sale-num").textlocation = item.find_element(By.CSS_SELECTOR, ".location").textdata.append([name, shop, sales, location])driver.quit()return data
import pandas as pddef save_to_csv(data, filename="taobao_commodities.csv"):df = pd.DataFrame(data, columns=["商品名称", "店铺名称", "销量", "地址"])df.to_csv(filename, index=False, encoding="utf_8_sig") # 避免中文乱码print(f"数据已保存至{filename}")# 调用示例html = fetch_taobao_page("手机")if html:data = parse_commodity_info(html)save_to_csv(data)
df.drop_duplicates()删除重复商品。df.fillna("未知")填充缺失字段。int(sales.replace("万", "0000")))。requests.Session()结合代理IP(如proxies={"http": "ip:port"})。concurrent.futures或aiohttp加速爬取。
import requestsfrom bs4 import BeautifulSoupimport pandas as pdimport timedef fetch_taobao_data(keyword, max_pages=3):all_data = []for page in range(1, max_pages + 1):url = f"https://s.taobao.com/search?q={keyword}&s={(page-1)*44}"headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}response = requests.get(url, headers=headers)if response.status_code == 200:soup = BeautifulSoup(response.text, "html.parser")items = soup.find_all("div", class_="item J_MouserOnverReq")for item in items:try:name = item.find("div", class_="title").get_text(strip=True)shop = item.find("div", class_="shop").get_text(strip=True)sales = item.find("div", class_="sale-num").get_text(strip=True)location = item.find("div", class_="location").get_text(strip=True)all_data.append([name, shop, sales, location])except Exception as e:print(f"解析错误: {e}")time.sleep(2) # 礼貌性延迟else:print(f"第{page}页请求失败")return all_datadef main():keyword = input("请输入搜索关键词(如手机): ")data = fetch_taobao_data(keyword)if data:save_to_csv(data, f"{keyword}_commodities.csv")else:print("未获取到数据")if __name__ == "__main__":main()
通过Python爬取淘宝商品信息并存储为CSV文件,可实现电商数据的自动化采集与分析。开发者需注意技术实现的合法性与稳定性,结合反爬策略与数据清洗技术,构建高效、可靠的数据管道。未来可进一步探索结合API接口(如淘宝开放平台)或分布式爬虫框架(如Scrapy)提升数据获取效率。