bigquery-bigframes

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

BigFrames Development Standards

BigFrames开发标准

  • Avoid
    .to_pandas()
    : You MUST NOT use
    .to_pandas()
    to download the entire dataset into memory as this downloads all data to the client's memory, bypassing BigQuery's distributed computation and risking Out of Memory (OOM) errors. There are some exceptions:
    • An error message explicitly requests you to use
      to_pandas()
    • You are going to visualize the data, and the visualization library does not accept BigFrames Dataframe/Series instances. In this case, reduce the amount of data you are going to download before calling
      .to_pandas()
  • Avoid
    read_gbq()
    for SQL
    : Do not write SQL queries and execute them with
    read_gbq()
    to maintain the Pandas-like DataFrame abstraction and allow lazy executions. Use BigFrames Dataframe/Series methods instead.
  • Use BigFrames ML package for Machine Learning Tasks: Do not use Scikit-learn or other ML libraries with BigFrames dataframes because standard Scikit-learn models require bringing data into local client memory, whereas bigframes.ml delegates training directly to BigQuery's scalable ML engine. Import your tools/classes from
    bigframes.ml
    .
  • Stay in the Cloud: Perform data cleaning, transformation, and analysis via BigFrames methods to leverage BigQuery's scale.
  • Accessors over UDFs/Lambdas:
    • Prefer built-in accessors (e.g.,
      df.col.str.*
      ,
      df.col.dt.*
      ) over remote UDFs.
    • Do not use lambdas with
      Series.map()
      or
      DataFrame.apply()
      .
  • Schema Verification: Do not assume schema of intermediate outputs. Check
    .dtypes
    after loading, and use
    display()
    with
    .head()
    or
    .peek()
    .
  • Visualization: BigFrames Dataframe mostly works directly with Matplotlib, Seaborn, and other plotting libraries. If your attempt didn't work, try using the
    plot
    accessor. If that didn't work either, you MUST sample or aggregate your data to make it small enough before calling
    to_pandas()
    .
  • 避免使用
    .to_pandas()
    :严禁使用
    .to_pandas()
    将整个数据集下载到内存中,因为这会将所有数据下载到客户端内存,绕过BigQuery的分布式计算,并有引发内存不足(OOM)错误的风险。以下是一些例外情况:
    • 错误消息明确要求您使用
      to_pandas()
    • 您要进行数据可视化,可视化库不接受BigFrames Dataframe/Series实例。这种情况下,在调用
      .to_pandas()
      之前,请先减少要下载的数据量
  • 避免为SQL使用
    read_gbq()
    :不要编写SQL查询并通过
    read_gbq()
    执行,以保持类Pandas的数据框抽象并支持延迟执行。请改用BigFrames Dataframe/Series方法。
  • 使用BigFrames ML包处理机器学习任务:不要将Scikit-learn或其他机器学习库与BigFrames数据框一起使用,因为标准Scikit-learn模型需要将数据导入本地客户端内存,而bigframes.ml会直接将训练任务委托给BigQuery的可扩展ML引擎。请从
    bigframes.ml
    导入工具/类。
  • 留在云端处理:通过BigFrames方法执行数据清洗、转换和分析,以利用BigQuery的规模优势。
  • 优先使用访问器而非UDF/ Lambda
    • 优先使用内置访问器(例如
      df.col.str.*
      df.col.dt.*
      )而非远程UDF。
    • 不要在
      Series.map()
      DataFrame.apply()
      中使用lambda
  • Schema验证:不要假设中间输出的Schema。加载后检查
    .dtypes
    ,并结合
    .head()
    .peek()
    使用
    display()
  • 可视化:BigFrames Dataframe大多可直接与Matplotlib、Seaborn及其他绘图库配合使用。如果尝试失败,请尝试使用
    plot
    访问器。如果仍然不行,在调用
    to_pandas()
    之前,必须对数据进行采样或聚合以缩小数据规模。

Model Development

模型开发

  • Unlike Scikit-learn: BigFrames'
    predict()
    method always returns a DataFrame containing both predictions and features (not just a series of predictions).
  • No
    random_state
    : Do not pass a
    random_state
    argument when instantiating BigFrames ML models, because this parameter is not supported in the BigFrames ML package.
  • Automatic Scaling: Do not use
    OneHotEncoder
    or
    StandardScaler
    unless explicitly requested (handled automatically).
  • Hyperparameter Tuning: You must write custom loops (BigFrames lacks
    GridSearchCV
    or
    RandomizedSearchCV
    ).
  • ARIMA Plus (Forecasting):
    • Import from
      bigframes.ml.forecasting
      .
    • Sort data chronologically and split around a timepoint before training.
    • Prediction horizon must be less than or equal to training horizon.
  • PCA: BigFrames' PCA class lacks simple
    transform()
    method. Use
    predict()
    instead.
  • Model Persistence: To persist a model, use
    model.to_gbq()
    . To load a persisted model, use
    bpd.read_gbq_model()
    .
  • 与Scikit-learn不同:BigFrames的
    predict()
    方法始终返回一个包含预测结果和特征的DataFrame(而非仅包含预测结果的Series)。
  • random_state
    参数
    :实例化BigFrames ML模型时不要传入
    random_state
    参数,因为BigFrames ML包不支持该参数。
  • 自动缩放:除非明确要求,否则不要使用
    OneHotEncoder
    StandardScaler
    (这些操作会自动处理)。
  • 超参数调优:必须编写自定义循环(BigFrames不支持
    GridSearchCV
    RandomizedSearchCV
    )。
  • ARIMA Plus(预测):
    • bigframes.ml.forecasting
      导入。
    • 按时间顺序对数据排序,并在训练前围绕某个时间点拆分数据。
    • 预测范围必须小于或等于训练范围。
  • PCA:BigFrames的PCA类没有简单的
    transform()
    方法,请改用
    predict()
  • 模型持久化:要持久化模型,请使用
    model.to_gbq()
    。要加载已持久化的模型,请使用
    bpd.read_gbq_model()