Loading...
Loading...
Generate comprehensive issue reports from HyperPod clusters (EKS and Slurm) by collecting diagnostic logs and configurations for troubleshooting and AWS Support cases. Use when users need to collect diagnostics from HyperPod cluster nodes, generate issue reports for AWS Support, investigate node failures or performance problems, document cluster state, or create diagnostic snapshots. Triggers on requests involving issue reports, diagnostic collection, support case preparation, or cluster troubleshooting that requires gathering logs and system information from multiple nodes.
npx skill4agent add awslabs/agent-plugins hyperpod-issue-reportscripts/hyperpod_issue_report.pysagemaker:DescribeClustersagemaker:ListClusterNodesssm:StartSessions3:PutObjects3:GetObjecteks:DescribeClusters3:GetObjects3:PutObjectarn:aws:sagemaker:us-west-2:123456789012:cluster/abc123s3://bucket/prefixs3://hyperpod-diagnostics-<account-id>-<region>aws sts get-caller-identity
aws sagemaker describe-cluster --cluster-name <name-or-arn> --region <region>aws s3 mb s3://<bucket-name> --region <region>Orchestrator.Ekswhich kubectlaws eks update-kubeconfig --name <eks-cluster-name> --region <region>uv run scripts/hyperpod_issue_report.py \
--cluster <cluster-name-or-arn> \
--region <region> \
--s3-path s3://<bucket>[/prefix]--help--instance-groups--nodes--command--max-workers--debug--instance-groups--nodesi-*hyperpod-i-*ip-*