Loading...
Loading...
Write, push, run, publish, and manage Kaggle Benchmark tasks using the kaggle CLI and the kaggle-benchmarks Python SDK. Use when the user wants to create or push a benchmark task (optionally with attached Kaggle datasets), run benchmarks against LLM models, check task/run status, stream or fetch execution logs, download results and source notebooks, publish a task to make it public, or troubleshoot benchmark workflows.
npx skill4agent add kaggle/kaggle-skills write-kaggle-benchmarkskaggle benchmarks (alias: kaggle b)
├── auth — Fetch Model Proxy credentials
├── init — Fetch credentials + setup local dev environment
└── tasks (alias: t) — Manage benchmark tasks
├── push — Upload a task from a .py file
├── run — Run a task against model(s)
├── list — List your benchmark tasks
├── status — Show task details and per-model run status
├── download — Download completed run outputs (and optionally source notebooks)
├── log (logs) — Show execution logs for run(s) (streams live for RUNNING runs)
├── publish — Make a task public (publishes the backing notebook by default)
├── models — List available benchmark models
└── delete — Delete a task (not yet supported by server)# Full setup: credentials + .env + example_task.py + kaggle_benchmarks_reference.md
kaggle b init -y
# Credentials only (refresh MODEL_PROXY_* in .env)
kaggle b auth -y--env-file <FILE>--example-file <FILE>MODEL_PROXY_URLMODEL_PROXY_API_KEYMODEL_PROXY_EXPIRY_TIMELLM_DEFAULTLLM_DEFAULT_EVALLLMS_AVAILABLEinitlistinit.envexample_task.pykaggle_benchmarks_reference.mdMODEL_PROXY_*python task.pykaggle b t runkaggle b init -y # first-time setup
kaggle b auth -y # creds-only refresh (no scaffolding)kaggle_benchmarks as kbench@kbench.task(...).run(kbench.llm).evaluate(...)# %%# %%
import kaggle_benchmarks as kbench
# %%
@kbench.task(name="my-test-task")
def my_test_task(llm):
response = llm.prompt("What is 2 + 2?")
kbench.assertions.assert_in("4", response, expectation="Should contain 4")
my_test_task.run(kbench.llm)task.run(llm=kbench.llms["google/gemini-3.5-flash"])task.run(llm=kbench.llm)LLM_DEFAULTLLM_DEFAULTLLMS_AVAILABLEMODEL_PROXY_*kaggle b init -y # ensure .env is current
python task.py # run the task directly
ls -1 *.run.json # confirm a run file was producedpython task.py*.run.jsonkaggle b t push my-task -f task.py --wait
kaggle b t push my-task -f task.py -d owner/dataset1 -d owner/dataset2 # attach datasets--wait [TIMEOUT]--poll-interval <SECONDS>-d--kaggle-dataset# Interactive picker
kaggle b t run my-task
# Specific model
kaggle b t run my-task -m google/gemini-3.5-flash
# Multiple models (repeat -m, do NOT space-separate)
kaggle b t run my-task -m google/gemini-3.5-flash -m anthropic/claude-haiku-4-5
# Wait for completion
kaggle b t run my-task -m google/gemini-3.5-flash --waitkaggle b t modelskaggle b t status my-task
kaggle b t status my-task -m google/gemini-3.5-flashErrors:kaggle b t download my-task # all terminal runs
kaggle b t download my-task -o ./results # custom directory
kaggle b t download my-task -m google/gemini-3.5-flash
kaggle b t download my-task -s # also fetch source notebooks
kaggle b t download my-task -f # force re-download (overwrite)<output>/<task>/<version>/<model>/<run_id>/....--force-f--include-source-s__notebook__.ipynb__notebook_source__.ipynbkaggle b t log my-task # logs for every run of the task
kaggle b t log my-task -m google/gemini-3.5-flash # filter to one model
kaggle b t log my-task -m model-a -m model-b # multiple models, sequentialRUNNINGCOMPLETEDERROREDQUEUED(No logs available — server returned 404)kaggle b t publish my-task # publish task + backing notebook (default)
kaggle b t publish my-task --no-publish-backing-notebook # publish task only, keep notebook private--no-publish-backing-notebook# Push → run → download (run one command at a time, confirm between)
kaggle b t push my-task -f task.py --wait
kaggle b t run my-task -m google/gemini-3.5-flash --wait
kaggle b t download my-task -o ./results
# List tasks, filtered
kaggle b t list --name-regex "^math" --status errored
# Debug an errored run: pull logs first, then download source notebook
kaggle b t log my-task -m google/gemini-3.5-flash
kaggle b t download my-task -m google/gemini-3.5-flash -s -f.run().run()@task.run.jsontask_fn.run(kbench.llm).evaluate(...)MODEL_PROXY_API_KEYpython task.pykaggle b auth -ykaggle b init -yinitauthdotenv@taskkaggle b t push <SLUG> -f file.py<SLUG>@kbench.task(name=...)My Taskmy-taskmy_taskmy-task@defaultgoogle/gemini-3.5-flash@default@-owner/modeldeleteDelete is not supported by the server yet.-m-d--kaggle-dataset-m a -m b-m a b