查找数据集
2026-01-06
我想研究全球电商平台用户的购买行为。请帮我找几个公开的相关数据集,并说明数据包含哪些字段,返回真实的可访问链接。 帮我想出分析方案,比如热门商品统计、购买频次分布、地区差异分析等。
思考过程
**Defining the Scope of Inquiry**
I've zeroed in on the core task: dissecting global e-commerce user purchasing patterns. My focus now is on pinpointing suitable public datasets. I'm prioritizing datasets that provide comprehensive fields – specifically, I'm making notes of the types of data that would be most insightful, such as transaction amounts, timestamps, and user demographics.
**Refining Data Requirements**
I'm now prioritizing datasets that provide detailed customer demographics alongside the purchase data. This would allow for segmentation and comparisons across different user groups. Thinking of adding fields to track items added to baskets before purchase, and I'm currently looking for data about shipping and delivery times.
**Gathering Dataset Information**
I'm deep in the process of confirming the availability and specific fields of potential datasets. I've successfully found the UCI "Online Retail" dataset and the Olist Brazilian e-commerce dataset, both of which are readily accessible. I'm now also scrutinizing the "eCommerce Events History in Cosmetics Shop" dataset on Kaggle, confirming its suitability for user funnel analysis.
**Verifying Dataset Properties**
I'm presently examining the data schemas of the identified datasets to formulate a practical plan for analysis. I am reviewing the structure of the "eCommerce Events History in Cosmetics Shop" dataset, focusing on events related to user interactions. I'm also going through the Olist dataset to map out the connections between tables.
**Assessing Dataset Quality**
I'm now prioritizing datasets that provide detailed customer demographics alongside the purchase data. This would allow for segmentation and comparisons across different user groups. Thinking of adding fields to track items added to baskets before purchase, and I'm currently looking for data about shipping and delivery times.
**Confirming Data Structures**
I'm now diving into the detailed structure of each dataset. I'm taking notes on how tables are linked in the Olist dataset. Also, examining fields like `event_type` and `user_session` in the "eCommerce Events History in Cosmetics Shop" dataset to strategize a user funnel analysis.
**Defining Dataset Selection**
I've homed in on three promising datasets for e-commerce analysis. "Online Retail II" from UCI and the Brazilian E-Commerce dataset on Kaggle seem like strong starting points, offering transactional data ideal for initial investigations. My focus is now on confirming their data quality and suitability for RFM and basket analysis.
**Confirming Data Structures**
Now I'm digging into the detailed schemas. I've got the data types and key fields outlined for each dataset. My primary focus is on cross-referencing this with the intended analytical techniques, ensuring alignment for a complete analysis. The plan is to present dataset structures clearly and build a Python implementation.
**Refining Analytical Scope**
I'm now fully immersed in constructing the response. The focus is to precisely define the analytical methods for each dataset, ensuring they align with the business questions posed. I've sketched a structure for clear presentation, covering data details, analysis schemes (RFM, Market Basket, etc.), and Python implementation. The geographical scopes are accounted for, and I will be working on phrasing and syntax.
**Outlining the Response Strategy**
I've formulated a solid structure for the response. It will begin with an introduction that acknowledges the request's value. Then, it will detail the three chosen datasets with their direct links. A comprehensive analytical framework, grouped by "Business Question" and "Analytical Method," will follow. I'll conclude with a Python RFM snippet, as requested. The geographical scope nuances are considered, and I will be sure to comply with all constraints, avoiding requests for permission and providing actionable information.
研究全球电商用户行为是一个非常有价值的课题。为了全面覆盖“交易型”、“关联型”和“行为型”数据的分析需求,我为你精选了三个最经典且完全公开的数据集。
以下是数据集详情、字段说明及对应的深度分析方案。
### 一、 精选公开数据集 (附真实链接)
这三个数据集分别代表了电商分析的三种不同维度:**纯订单交易数据**、**复杂业务流程数据**和**用户点击流行为数据**。
#### 1. UCI Online Retail II (最经典的交易数据)
这是一个跨国数据集,包含了2009-2011年间一家英国在线零售商的所有交易记录。虽然主体在英国,但包含全球多个国家的客户数据,非常适合做跨国购买差异分析。
* **来源**:UCI Machine Learning Repository
* **主要字段**:
* `InvoiceNo`: 订单编号 (以C开头表示取消)
* `StockCode`: 产品代码
* `Description`: 产品描述
* `Quantity`: 数量
* `InvoiceDate`: 订单日期和时间
* `UnitPrice`: 单价
* `CustomerID`: 用户ID
* `Country`: 用户所在国家
* **适用场景**:RFM模型分析、购物篮分析、客户留存率。
* **访问链接**:[UCI Online Retail II Dataset](https://archive.ics.uci.edu/dataset/502/online+retail+ii)
#### 2. Brazilian E-Commerce Public Dataset by Olist (最全的业务全链路数据)
这是巴西最大的百货商店 Olist 公布的真实数据(已脱敏)。它不是一张单表,而是多张关联表,包含了从订单创建、支付、物流到评论的全流程。
* **来源**:Kaggle
* **主要字段 (分布在多个CSV中)**:
* `order_id`: 订单ID
* `customer_id`: 用户ID
* `order_status`: 订单状态 (delivered, shipped, canceled等)
* `payment_type`: 支付方式 (credit card, boleto, voucher)
* `product_category_name`: 商品类目
* `review_score`: 用户评分 (1-5)
* `customer_city/state`: 用户地理位置
* `geolocation`: 经纬度数据
* **适用场景**:物流效率分析、地理位置热力图、支付偏好分析、NLP评论情感分析。
* **访问链接**:[Kaggle Olist Dataset](https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce)
#### 3. eCommerce Behavior Data from Multi Category Store (海量点击流行为数据)
这是最接近“大数据”规模的数据集,包含数千万条记录。它记录了用户**浏览(view)**、**加购(cart)**、**购买(purchase)**的精细行为,而不仅仅是最终的订单。
* **来源**:Kaggle (Rees46 Marketing Platform)
* **主要字段**:
* `event_time`: 事件时间
* `event_type`: 事件类型 (view, cart, remove_from_cart, purchase)
* `product_id`: 商品ID
* `category_code`: 商品层级类目 (如 electronics.smartphone)
* `brand`: 品牌
* `price`: 价格
* `user_id`: 用户ID
* `user_session`: 用户会话ID (用于追踪单次访问路径)
* **适用场景**:转化漏斗分析、推荐系统算法验证、用户流失预测。
* **访问链接**:[Kaggle eCommerce Behavior Data](https://www.kaggle.com/datasets/mkechinov/ecommerce-behavior-data-from-multi-category-store)
---
### 二、 全球电商用户行为分析方案
针对上述数据,我为你设计了一套从基础到高阶的分析框架。建议使用 Python (`pandas`, `matplotlib`, `seaborn`) 或 BI 工具 (Tableau, PowerBI) 进行实现。
#### 方案 A:用户价值分层 (RFM 模型)
这是电商分析的核心。利用 **Recency (最近一次购买时间)**、**Frequency (购买频次)**、**Monetary (消费金额)** 对用户进行评分。
* **分析逻辑**:
1. **重要价值客户**:R低(最近买过)、F高、M高 -> 需重点维护,提供VIP服务。
2. **一般保持客户**:R低、F低、M高 -> 消费力强但频率低,需推送新品刺激。
3. **流失预警客户**:R高(很久没买)、F高、M高 -> 曾经的忠诚客户,需发送召回优惠券。
* **适用数据集**:UCI Online Retail II, Olist
#### 方案 B:商品关联与购物篮分析 (Market Basket Analysis)
分析用户“买了又买”的行为模式,挖掘隐性需求。
* **核心算法**:Apriori 或 FP-Growth。
* **分析指标**:
* **支持度 (Support)**:商品A和B同时出现的概率。
* **置信度 (Confidence)**:买了A的人有多大比例会买B。
* **提升度 (Lift)**:A的出现是否提升了B的购买概率(Lift > 1 表示正相关)。
* **业务应用**:例如发现“买了咖啡机的人通常会买除垢剂”,则可以在咖啡机页面打包推荐除垢剂。
* **适用数据集**:UCI Online Retail II, eCommerce Behavior Data
#### 方案 C:转化漏斗与流失分析 (Funnel Analysis)
针对用户从浏览到支付的每一步转化率进行分析。
* **分析步骤**:
1. 计算总访问量 (View)。
2. 计算加购量 (Cart) -> **加购率 (Cart/View)**。
3. 计算实际购买量 (Purchase) -> **购买转化率 (Purchase/Cart)**。
* **洞察点**:如果加购率高但最终购买率低,可能是支付环节繁琐、运费过高或价格在最后时刻超出预期。
* **适用数据集**:eCommerce Behavior Data
#### 方案 D:地理与时间维度的偏好差异
* **时间分析**:统计一天中订单的高峰时段(小时级)和一周中的高峰日期(周级)。
* *洞察*:如果是工作日中午下单多,说明是白领摸鱼购物;如果是周末晚上,说明是休闲购物。
* **地区分析**:利用 Olist 的经纬度数据或 UCI 的 Country 字段。
* *洞察*:不同地区对配送时间的容忍度(Olist数据中有预计送达vs实际送达),以及不同国家的客单价差异。
---
### 三、 技术实现示例 (Python)
为了让你快速上手,这里提供一段基于 `UCI Online Retail` 数据集进行数据清洗和计算 **客单价 (AOV)** 及 **用户复购率** 的 Python 代码框架。
```python
import pandas as pd
# 假设你已经加载了数据
# df = pd.read_excel('Online Retail.xlsx')
# 1. 数据清洗
# 去除取消的订单 (InvoiceNo 含 'C')
df_clean = df[~df['InvoiceNo'].astype(str).str.contains('C')]
# 去除 Quantity 或 UnitPrice <= 0 的异常值
df_clean = df_clean[(df_clean['Quantity'] > 0) & (df_clean['UnitPrice'] > 0)]
# 计算总金额
df_clean['TotalPrice'] = df_clean['Quantity'] * df_clean['UnitPrice']
# 2. 热门商品统计 (Top 10)
top_products = df_clean.groupby('Description')['Quantity'].sum().sort_values(ascending=False).head(10)
print("--- 销量 Top 10 商品 ---")
print(top_products)
# 3. 地区差异分析 (按国家统计客单价)
# 客单价 AOV = 总销售额 / 订单数量
country_stats = df_clean.groupby('Country').agg({
'TotalPrice': 'sum',
'InvoiceNo': 'nunique'
})
country_stats['AOV'] = country_stats['TotalPrice'] / country_stats['InvoiceNo']
print("\n--- 各国客单价 (部分) ---")
print(country_stats.sort_values('AOV', ascending=False).head(5))
# 4. 简单复购率计算
# 统计每个用户购买的订单数
customer_counts = df_clean.groupby('CustomerID')['InvoiceNo'].nunique()
# 购买次数 > 1 的用户占比
repeat_rate = (customer_counts > 1).sum() / customer_counts.count()
print(f"\n--- 用户复购率: {repeat_rate:.2%} ---")
```
这些资源和方案足够你完成一个深度的电商行为研究项目。你可以先从 **UCI 数据集** 入手,因为它最轻量且不需要复杂的数据库关联操作。