bigquery-bigframes
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseBigFrames Development Standards
BigFrames开发标准
- Avoid : You MUST NOT use
.to_pandas()to download the entire dataset into memory as this downloads all data to the client's memory, bypassing BigQuery's distributed computation and risking Out of Memory (OOM) errors. There are some exceptions:.to_pandas()- An error message explicitly requests you to use
to_pandas() - You are going to visualize the data, and the visualization library does not accept BigFrames Dataframe/Series instances. In this case, reduce the amount of data you are going to download before calling
.to_pandas()
- An error message explicitly requests you to use
- Avoid for SQL: Do not write SQL queries and execute them with
read_gbq()to maintain the Pandas-like DataFrame abstraction and allow lazy executions. Use BigFrames Dataframe/Series methods instead.read_gbq() - Use BigFrames ML package for Machine Learning Tasks: Do not use
Scikit-learn or other ML libraries with BigFrames dataframes because
standard Scikit-learn models require bringing data into local client memory,
whereas bigframes.ml delegates training directly to BigQuery's scalable ML
engine. Import your tools/classes from .
bigframes.ml - Stay in the Cloud: Perform data cleaning, transformation, and analysis via BigFrames methods to leverage BigQuery's scale.
- Accessors over UDFs/Lambdas:
- Prefer built-in accessors (e.g., ,
df.col.str.*) over remote UDFs.df.col.dt.* - Do not use lambdas with or
Series.map().DataFrame.apply()
- Prefer built-in accessors (e.g.,
- Schema Verification: Do not assume schema of intermediate outputs. Check after loading, and use
.dtypeswithdisplay()or.head()..peek() - Visualization: BigFrames Dataframe mostly works directly with
Matplotlib, Seaborn, and other plotting libraries. If your attempt didn't
work, try using the accessor. If that didn't work either, you MUST sample or aggregate your data to make it small enough before calling
plot.to_pandas()
- 避免使用:严禁使用
.to_pandas()将整个数据集下载到内存中,因为这会将所有数据下载到客户端内存,绕过BigQuery的分布式计算,并有引发内存不足(OOM)错误的风险。以下是一些例外情况:.to_pandas()- 错误消息明确要求您使用
to_pandas() - 您要进行数据可视化,且可视化库不接受BigFrames Dataframe/Series实例。这种情况下,在调用之前,请先减少要下载的数据量
.to_pandas()
- 错误消息明确要求您使用
- 避免为SQL使用:不要编写SQL查询并通过
read_gbq()执行,以保持类Pandas的数据框抽象并支持延迟执行。请改用BigFrames Dataframe/Series方法。read_gbq() - 使用BigFrames ML包处理机器学习任务:不要将Scikit-learn或其他机器学习库与BigFrames数据框一起使用,因为标准Scikit-learn模型需要将数据导入本地客户端内存,而bigframes.ml会直接将训练任务委托给BigQuery的可扩展ML引擎。请从导入工具/类。
bigframes.ml - 留在云端处理:通过BigFrames方法执行数据清洗、转换和分析,以利用BigQuery的规模优势。
- 优先使用访问器而非UDF/ Lambda:
- 优先使用内置访问器(例如、
df.col.str.*)而非远程UDF。df.col.dt.* - 不要在或
Series.map()中使用lambda。DataFrame.apply()
- 优先使用内置访问器(例如
- Schema验证:不要假设中间输出的Schema。加载后检查,并结合
.dtypes或.head()使用.peek()。display() - 可视化:BigFrames Dataframe大多可直接与Matplotlib、Seaborn及其他绘图库配合使用。如果尝试失败,请尝试使用访问器。如果仍然不行,在调用
plot之前,必须对数据进行采样或聚合以缩小数据规模。to_pandas()
Model Development
模型开发
- Unlike Scikit-learn: BigFrames' method always returns a DataFrame containing both predictions and features (not just a series of predictions).
predict() - No : Do not pass a
random_stateargument when instantiating BigFrames ML models, because this parameter is not supported in the BigFrames ML package.random_state - Automatic Scaling: Do not use or
OneHotEncoderunless explicitly requested (handled automatically).StandardScaler - Hyperparameter Tuning: You must write custom loops (BigFrames lacks or
GridSearchCV).RandomizedSearchCV - ARIMA Plus (Forecasting):
- Import from .
bigframes.ml.forecasting - Sort data chronologically and split around a timepoint before training.
- Prediction horizon must be less than or equal to training horizon.
- Import from
- PCA: BigFrames' PCA class lacks simple method. Use
transform()instead.predict() - Model Persistence: To persist a model, use . To load a persisted model, use
model.to_gbq().bpd.read_gbq_model()
- 与Scikit-learn不同:BigFrames的方法始终返回一个包含预测结果和特征的DataFrame(而非仅包含预测结果的Series)。
predict() - 无参数:实例化BigFrames ML模型时不要传入
random_state参数,因为BigFrames ML包不支持该参数。random_state - 自动缩放:除非明确要求,否则不要使用或
OneHotEncoder(这些操作会自动处理)。StandardScaler - 超参数调优:必须编写自定义循环(BigFrames不支持或
GridSearchCV)。RandomizedSearchCV - ARIMA Plus(预测):
- 从导入。
bigframes.ml.forecasting - 按时间顺序对数据排序,并在训练前围绕某个时间点拆分数据。
- 预测范围必须小于或等于训练范围。
- 从
- PCA:BigFrames的PCA类没有简单的方法,请改用
transform()。predict() - 模型持久化:要持久化模型,请使用。要加载已持久化的模型,请使用
model.to_gbq()。bpd.read_gbq_model()