bigquery-bigframes

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

BigFrames Development Standards

BigFrames开发标准

Avoid
.to_pandas()
: You MUST NOT use
```
.to_pandas()
```
to download the entire dataset into memory as this downloads all data to the client's memory, bypassing BigQuery's distributed computation and risking Out of Memory (OOM) errors. There are some exceptions:
- An error message explicitly requests you to use
```
to_pandas()
```
- You are going to visualize the data, and the visualization library does not accept BigFrames Dataframe/Series instances. In this case, reduce the amount of data you are going to download before calling
```
.to_pandas()
```
Avoid
read_gbq()
for SQL: Do not write SQL queries and execute them with
```
read_gbq()
```
to maintain the Pandas-like DataFrame abstraction and allow lazy executions. Use BigFrames Dataframe/Series methods instead.
Use BigFrames ML package for Machine Learning Tasks: Do not use Scikit-learn or other ML libraries with BigFrames dataframes because standard Scikit-learn models require bringing data into local client memory, whereas bigframes.ml delegates training directly to BigQuery's scalable ML engine. Import your tools/classes from
```
bigframes.ml
```
.
Stay in the Cloud: Perform data cleaning, transformation, and analysis via BigFrames methods to leverage BigQuery's scale.
Accessors over UDFs/Lambdas:
- Prefer built-in accessors (e.g.,
```
df.col.str.*
```
  ,
```
df.col.dt.*
```
  ) over remote UDFs.
- Do not use lambdas with
```
Series.map()
```
  or
```
DataFrame.apply()
```
  .
Schema Verification: Do not assume schema of intermediate outputs. Check
```
.dtypes
```
after loading, and use
```
display()
```
with
```
.head()
```
or
```
.peek()
```
.
Visualization: BigFrames Dataframe mostly works directly with Matplotlib, Seaborn, and other plotting libraries. If your attempt didn't work, try using the
```
plot
```
accessor. If that didn't work either, you MUST sample or aggregate your data to make it small enough before calling
```
to_pandas()
```
.

避免使用
.to_pandas()
：严禁使用
```
.to_pandas()
```
将整个数据集下载到内存中，因为这会将所有数据下载到客户端内存，绕过BigQuery的分布式计算，并有引发内存不足（OOM）错误的风险。以下是一些例外情况：
- 错误消息明确要求您使用
```
to_pandas()
```
- 您要进行数据可视化，且可视化库不接受BigFrames Dataframe/Series实例。这种情况下，在调用
```
.to_pandas()
```
  之前，请先减少要下载的数据量
避免为SQL使用
read_gbq()
：不要编写SQL查询并通过
```
read_gbq()
```
执行，以保持类Pandas的数据框抽象并支持延迟执行。请改用BigFrames Dataframe/Series方法。
使用BigFrames ML包处理机器学习任务：不要将Scikit-learn或其他机器学习库与BigFrames数据框一起使用，因为标准Scikit-learn模型需要将数据导入本地客户端内存，而bigframes.ml会直接将训练任务委托给BigQuery的可扩展ML引擎。请从
```
bigframes.ml
```
导入工具/类。
留在云端处理：通过BigFrames方法执行数据清洗、转换和分析，以利用BigQuery的规模优势。
优先使用访问器而非UDF/ Lambda：
- 优先使用内置访问器（例如
```
df.col.str.*
```
  、
```
df.col.dt.*
```
  ）而非远程UDF。
- 不要在
  Series.map()
  或
  DataFrame.apply()
  中使用lambda。
Schema验证：不要假设中间输出的Schema。加载后检查
```
.dtypes
```
，并结合
```
.head()
```
或
```
.peek()
```
使用
```
display()
```
。
可视化：BigFrames Dataframe大多可直接与Matplotlib、Seaborn及其他绘图库配合使用。如果尝试失败，请尝试使用
```
plot
```
访问器。如果仍然不行，在调用
```
to_pandas()
```
之前，必须对数据进行采样或聚合以缩小数据规模。

Model Development

模型开发

Unlike Scikit-learn: BigFrames'
```
predict()
```
method always returns a DataFrame containing both predictions and features (not just a series of predictions).
No
random_state
: Do not pass a
```
random_state
```
argument when instantiating BigFrames ML models, because this parameter is not supported in the BigFrames ML package.
Automatic Scaling: Do not use
```
OneHotEncoder
```
or
```
StandardScaler
```
unless explicitly requested (handled automatically).
Hyperparameter Tuning: You must write custom loops (BigFrames lacks
```
GridSearchCV
```
or
```
RandomizedSearchCV
```
).
ARIMA Plus (Forecasting):
- Import from
```
bigframes.ml.forecasting
```
  .
- Sort data chronologically and split around a timepoint before training.
- Prediction horizon must be less than or equal to training horizon.
PCA: BigFrames' PCA class lacks simple
```
transform()
```
method. Use
```
predict()
```
instead.
Model Persistence: To persist a model, use
```
model.to_gbq()
```
. To load a persisted model, use
```
bpd.read_gbq_model()
```
.

与Scikit-learn不同：BigFrames的
```
predict()
```
方法始终返回一个包含预测结果和特征的DataFrame（而非仅包含预测结果的Series）。
无
random_state
参数：实例化BigFrames ML模型时不要传入
```
random_state
```
参数，因为BigFrames ML包不支持该参数。
自动缩放：除非明确要求，否则不要使用
```
OneHotEncoder
```
或
```
StandardScaler
```
（这些操作会自动处理）。
超参数调优：必须编写自定义循环（BigFrames不支持
```
GridSearchCV
```
或
```
RandomizedSearchCV
```
）。
ARIMA Plus（预测）：
- 从
```
bigframes.ml.forecasting
```
  导入。
- 按时间顺序对数据排序，并在训练前围绕某个时间点拆分数据。
- 预测范围必须小于或等于训练范围。
PCA：BigFrames的PCA类没有简单的
```
transform()
```
方法，请改用
```
predict()
```
。
模型持久化：要持久化模型，请使用
```
model.to_gbq()
```
。要加载已持久化的模型，请使用
```
bpd.read_gbq_model()
```
。