Page Deprecated
Information you find on this page may be out of date and no longer accurate.
Slurm Native Profiling
Job resource usage can be determined on job completion by checking the following sacct columns;
- MaxRSS - Peak memory usage.
- TotalCPU - Check Elapsed x Alloc≈TotalCPU
However if you want to examine resource usage over the run-time of your
job,
the line #SBATCH --profile task
can be added to your script.
That will cause profile data to be recorded every 30 seconds throughout
the job. For jobs which take much less/more than a day to run we
recommend increasing/decreasing that sampling frequency, so for example
when profiling a job of less than 1 hour it would be OK to sample every
second by adding #SBATCH --acctg-freq=1
, and for a week long job the
rate should be reduced to once every 5
minutes: #SBATCH --acctg-freq=300
.
On completion of your job, collate the data into an HDF5 file using
sh5util -j <jobid>
, this will collect the results from the nodes where
your job ran and write into an HDF5 file named: job_<jobid>.h5
You can plot the contents of this file with the command
nn_profile_plot job_<jobid>.h5
, this will generate a file named
job_<jobid>_profile.png
.
Alternatively you could use one of the following scripts.
Any GPU usage will also be recorded in the profile, so long as the process was executed via srun.