Useful links and batch job examples for the latest versions of commonly used software on various computers. See migration and note cedar and plato have the same setup except allocation. Note changed from computecanada name to alliancecan and therefore please use this fro support:

support@tech.alliancecan.ca

Links to Compute Canada login: https://ccdb.computecanada.ca/security/login

Main storage for acl-jerzy group is on cedar and you can access it issuing at home dir:

cd /project/rrg-jerzy-ab/jerzygroup

(recommended to create link: ln –s /project/rrg-jerzy-ab/jerzygroup jerzygroup or:

ln -s /project/6004094/jerzygroup/ jerzygroup , if above does not work.)

Please see there info on permissions in file: /jerzygroup/backup_records/permisionsetup

and under acl on: https://docs.alliancecan.ca/wiki/Sharing_data

use command quota to check used space and number of files.

Do not store wfc files produced by QE. Either delete them (rm *.wfc*) or use for backup:

rsync -rv --exclude '*.wfc*' * szpunarb@cedar.computecanada.ca:/home/szpunarb/dirnane

Not much used directories can be archived but please add REDME file to describe content:

In order to "zip" a directory, the correct command would be

tar -zcvf directory_name.tar.gz directory_name/

This will tell tar to

 

RAC allocation is limited and default priority may became higher than RAC, therefore please check your usage level for all accessible allocations if your jobs do not run on currently used allocation and compare using the commands (here compared default and RAC):

 

sshare -l -A def-szpunarb_cpu -a --format=Account,User,EffectvUsage,LevelFS

Account                   User  EffectvUsage    LevelFS

def-szpunarb_cpu                     0.000239   4.174797

 

sshare -l -A rrg-szpunarb-ad_cpu -a --format=Account,User,EffectvUsage,LevelFS

Account                    User  EffectvUsage    LevelFS

rrg-szpunarb-ad_cpu                  0.000063   1.508949

As you can see, the LevelFS for the default allocation is "4.174797" while it is only "1.508949" for rrg-* accounting group because of heavier usage of rg*

See more on (Ali):
https://docs.alliancecan.ca/wiki/Frequently_Asked_Questions#Why_are_my_jobs_taking_so_long_to_start.3F

 

Resources:

ComputeCanada:  https://docs.alliancecan.ca/wiki/Technical_documentation

Good training: https://computecanada.github.io/2019-01-21-sfu/

WestGrid web including training materials: https://westgrid.ca

Materials Getting started videos http://bit.ly/2sxGO33

Compute Canada YouTube channel http://bit.ly/2ws0JDC

 

Software

Cedar

Graham

Beluga (40p/n)/narval (64p/n)

QE

QE6.4.1gcc# ,     pbs-QE6.3-cedar

pbs-QE6.3-graham

pbs-QE6.3-beluga

QE run array with%1 only

 

sbatch --array=1-108%1 joname

pbs-graham-gcc-intelforarrayQE-7.2.1

 

QE (pw, ph)

Verif. 2020

jobpw_ph_cedar_QE6.1

jobcomplete_cedar_intel_QE6.2..2

jobcnotworkingomplete_cedar_gcc_QE6.2.2

for primitive unite cells:

jobcomplete_cedar_serial_cdar_gcc_QE_6.2.2

pbs.ph-graham-gccQE-6.4.1.1

jobcomplete_beluga_gcc_QE_6.2.2

EPW

QE6.4.1,   run1QE6.3, run1QE2.2

run2 QE6.3,  run2QE2.2

QE6.4.1, run1QE6.3, run1QE2.2

run2QE6.3,  run2QE2.2

run-beluga

run2-beluga

ShengBTE

Cedar_BTE complete_current.job

Cedar_BTE array.job

All clusters: ShengBTEpython3.8.

Beluga_BTE complete_current .job

Tested with QE6.2.2

almaBTE

Job-almabte-VCAbuilder-cedar

Job-almabte-Kappa-Tsweep-cedar

Job-almabte-VCAbuilder-graham

Job-almabte-kappa-Tsweep-graham

 

LAMMPS*

pbs-LAMMPS2019-cedar

pbs-LAMMPS-cedar

 

pbs-lmps-beluga

GULP

 

 

 

WIEN2K

WIEN2k-job-SLURM-scf-CEDAR

 

WIEN2k-job1-SLURM-Graham

job1-GRAHAM-WIEN2k18.2_prev

 

Phono3py

Phonopy/phonons

Phono3pyinst_example

 

Phonopy_finite temperatures

Phono3pyinst_example

Phonopy_phonons

VASP (licensed)

VASP_alliancecan.ca

VASP (Jaya)

vasp_cedar_job

 

job-VASP-cedar

                                                                                             

                                                                 

 job-VASP-graham

 

Priority/share

[jaya@cedar5 ~]$ sshare -U

 

 

Notes:

 

 

Belugamodulesenv.pdf

*Note that the web page is managed on PC, with windows operating system, therefore when downloading the above batch_job examples into unix system if it is in DOS format one should reformat it to UNIX format by issuing for example command: dos2unix batch_job .

* The name of the executable may differ from one version to another.

If you take the whole node, you can ask for the whole memory by using: --mem=0 (see examples prepared by Jaya for Graham)

 

#To check current QE versions do:

module -r spider 'quantumespresso*'

Versions, expanded info on current:

        quantumespresso/6.1

        quantumespresso/6.2

        quantumespresso/6.3

        quantumespresso/6.4

module spider quantumespresso/6.4.1

      nixpkgs/16.09  gcc/7.3.0  openmpi/3.1.2 (example in table)

      nixpkgs/16.09  gcc/8.3.0  openmpi/4.0.1

      nixpkgs/16.09  intel/2018.3  openmpi/3.1.2

      nixpkgs/16.09  intel/2019.3  openmpi/4.0.1

 

To use the GCC build:module for which EPW works OK do:

module load gcc/5.4.0, but more recent intel on beluga works too and see notes above.
module load quantumespresso/6.3, 6.4.1 lists resistivity.

To use the more common Intel build:module load quantumespresso/6.3. 6.4 version works OK on beluga with new intel on one node.

For detailed information about a specific "quantumespresso" module (including how to load the modules) use the module's full name.

For example:

     $ module spider quantumespresso/6.4.1

So use

     $ module load quantumespresso/6.4.1

to set up runtime environment.

 

Find the path example:

which thirdorder_espresso.py

/cvmfs/soft.computecanada.ca/easybuild/software/2017/avx/MPI/intel2016.4/openmpi2.1/shengbte/1.1.1/bin/thirdorder_espresso.py

 

Fixture for error in diagonalization (Olivier):

1)     Davidson (default) uses serial diagonalization as checked on beluga for this change in batch job:

srun pw.x -ndiag 1 <DISP.un_sc.in. > DISP.un_sc.in.out

2)     Switch to conjugate gradient method and play with mixing (input file, electrons):

  diagonalization = 'cg'
  mixing_ndim=8 (default could be lowered e.g. to 4)

 

The most recent LAMMPS 2019, August is installed on cedar, graham and beluga.

Unfortunately, the package LATTE did not work. I will have to find a way to install it separately.

To load the module, use:
module load nixpkgs/16.09  intel/2018.3  openmpi/3.1.2 lammps-omp/20190807

 

As for the scripts, the same ones from the documentation should work:
https://docs.computecanada.ca/wiki/LAMMPS

Note that the name of the executable is: lmp_icc_openmpi

 

The following packages are included in this version:

$ cat ${EBROOTLAMMPS}/list-packages.txt | grep -a "Installed YES:"

Installed YES: package ASPHERE
Installed YES: package BODY
…..

 

The following packages are not supported in this version:

$ cat ${EBROOTLAMMPS}/list-packages.txt | grep -a "Installed  NO:"

Installed  NO: package GPU
Installed  NO: package KOKKOS

……

To figure out what is the name of the executable for LAMMPS, do the following (example for lammps-omp/20170331):

module load lammps-omp//20170811

ls ${EBROOTLAMMPS}/bin/

lmp  lmp_icc_openmpi

 

From this output, the executable is: lmp_icc_openmpi and lmp is a symbolic link to the executable. We have done the same for each version installed: there is a symbolic link to each executable and it is called lmp. It means that no matter which module you pick, lmp will work as executable for that module.

For detailed information about a specific "lammps" module (including how to load the modules) use the module's full name.  For example:  $ module spider lammps/20170331

 

Check available modules:

module available

To find other possible module matches execute:

 $ module -r spider '.*lammps.*'

See also: slurm-directives on plato and cedar.

 

The details on performance of running job can be found via portal and login to see e.g. on narval:

https://portail.narval.calculquebec.ca/

 

Info on running job:

scontrol show jobid 13448796

scontrol update jobid=446153 timelimit=10-00:00:00

 

To cancel job:

scancel 13448796

 

The details about efficiency of completed job can be found via seff:

 seff 40220208
Job ID: 40220208
Cluster: cedar
User/Group: szpunarb/szpunarb
State: TIMEOUT (exit code 0)
Nodes: 6
Cores per node: 48
CPU Utilized: 186-22:22:53
CPU Efficiency: 16.23% of 1152-01:12:00 core-walltime
Job Wall-clock time: 4-00:00:15
Memory Utilized: 5.54 GB
Memory Efficiency: 0.49% of 1.10 TB

 

Not very useful command on running job info:

sacct -j 39677871
   Account      User JobID                      Start                 End  AllocCPUS    Elapsed                      AllocTRES    CPUTime     AveRSS     MaxRSS MaxRSSTask MaxRSSNode        NodeList ExitCode                State
---------- --------- ------------ ------------------- ------------------- ---------- ---------- ------------------------------ ---------- ---------- ---------- ---------- ---------- --------------- -------- --------------------
rrg-szpun+  szpunarb 39677871     2022-07-18T04:15:16 2022-07-20T04:15:24         48 2-00:00:08 billing=48,cpu=48,mem=187.50G+ 96-00:06:24                                                     cdr2183      0:0              TIMEOUT

 

https://www.rc.fas.harvard.edu/resources/documentation/convenient-slurm-commands/

https://researchcomputing.princeton.edu/support/knowledge-base/memory

https://westgrid.github.io/manitobaSummerSchool2018/4-materials.html

 https://www.westgrid.ca/support/training

Info on new servers:

https://docs.computecanada.ca/wiki/Compute_Canada_Documentation

https://docs.computecanada.ca/wiki/Getting_Started

https://docs.computecanada.ca/wiki/Running_jobs

https://docs.computecanada.ca/wiki/Available_software

 https://docs.computecanada.ca/wiki/Project_layout

 https://docs.computecanada.ca/wiki/Sharing_data

 https://docs.computecanada.ca/wiki/Utiliser_des_modules/en

Note Ali presentation and note about LAMMPS:

In case you are interested in how the program scales, I gave this week a presentation (online webinar) where I have presented few slides about how the time spend in computing interactions between the particles (in LAMMPS) scales with number of processors and number of particles. For a given system, more processors you add, more communications between the processors increase which kills the performance of the program. I used one type of the potential but the idea is the same since for most all potentials, the program spent more than 80% of the time in computing the pair interactions. Even for a system with 1 000 000 particles, the efficiency drops for more than 16 or 32 cores. The results may differ a little bit if the shape of the simulation box is different.

You can see the slides here:

https://www.westgrid.ca/events/introduction_classical_molecular_dynamics_simulations

 

Resources on local servers:

Globus and other: https://www.usask.ca/ict/services/research-technologies/advanced-computing/plato/running-jobs.php

WestGrid link: https://www.westgrid.ca/support/quickstart

Calcul Québec: https://wiki.calculquebec.ca/w/Accueil?setlang=en

https://wiki.calculquebec.ca/w/Ex%C3%A9cuter_une_t%C3%A2che/en

Plato:

 

https://wiki.usask.ca/display/ARC/Plato+technical+specifications

https://wiki.usask.ca/display/ARC/Quantum+ESPRESSO+on+Plato

https://wiki.usask.ca/display/ARC/LAMMPS+on+Plato

to search:

https://wiki.usask.ca/display/ARC/Advanced+Research+Computing

Note:

avx1 nodes have 16 processors with memory per node 310000M for use

avx2 nodes have 40 processors with memory per node 190000M for use

See: https://wiki.usask.ca/pages/viewpage.action?pageId=1620607096

 

Memory limit on plato: physical memory existent on each node: 16 x 1.94 = 31.04 G

Increase memory by using more nodes.

More info on Plato according to Juan:

 

Time limits (and priorities) are as follows:

long: 504 hours (lowest priority)

short: 84 hours (normal priority)

rush: 14 hours (highest priority)

 

jobs are simply sorted according to walltime. If you request less than 14 hours then the job goes to rush and gets a boost in priority.

Now, the ‘R’ or ’S’ stand for Researchers or Students. Students’ jobs have also higher priority, i.e. time limits are the same, but Srush has higher priority than Rrush.

 

Note QE from Compute Canada only fully compatible on avx2

Batch jobs*: Note users who use VASP on plato have a separate access account to all software and are required to add in the second line(see VASP batch job)  #SBATCH--account=hpc_p_szpunar

Software

Plato

guillimin

Grex

Bugaboo

Jasper

Third on avx2

QE6.4.1plato.third

 

 

 

 

QE

Plato_QE.job

Plato_QE6.4.1.job#

guillimin_QE.job

Grex_QE.job

BugabooDefault_QE.job

Bugaboo_QE.job

Jasper_QE.job

QE (pw,ph)

Platojob2nodesQE6.2.2complete

pbs-ph-plato_espresso-6.1.1plato

 

 

 

 

EPW

Plato_EPW.job

Plato_EPWQE2.2

guillimin_EPW.job

Grex_EPW.job

Bugaboo_EPW.job

Jasper_EPW.job

BTE

Plato_BTE_Sowingavx2.job

Plato_BTE_Reapingavx2.job

Plato_BTEavx2.job

Plato_BTE complete_current.job

Plato_BTE_Sowing_prev.job

Plato_BTE_Reaping_prev.job

Plato_BTE_prev.job

 

Grex_BTE_Sowing.job

Grex_BTE_Reaping.job

Grex_BTE.job

Bugaboo_BTE_sowing.job

Bugaboo_BTE_Reaping.job

Bugaboo_BTE.job

Jasper_BTE_Sowing.job

Jasper_BTE_Reaping.job

Jasper_BTE.job

Jasper_BTE_Array.job

almaBTE

Job-almabte-VCAbuilder-plato

Job-almabte-kappa_Tsweep-plato

 

 

 

 

LAMMPS

Plato LAMMPS2019.job

LAMMPS2019Latte.job

Plato_LAMMPS2017.job

Plato_LAMMPS2015.job

guillimin_LAMMPS.job

Grex_LAMMPS.job

Bugaboo_LAMMPS.job

Jasper_LAMMPS.job

GULP

Plato_GULP.job

 

 

 

 

VASP

 

VASP_plato_Jaya

 

 

 

 

WIEN2K

Grex_job_wien2k.dos1

 

 

 

 

STATUS

qstat –u UN

showq –u UN

showq –u UN

showq –u UN

showq –u UN

*Note that the web page is managed on PC, with windows operating system, therefore when downloading the above batch_job examples into unix system if it is in DOS format one should reformat it to UNIX format by issuing for example command: dos2unix batch_job .

# versions of QE on plato:

   Versions:

        quantumespresso/6.0

        quantumespresso/6.1

        quantumespresso/6.2.2

        quantumespresso/6.3

        quantumespresso/6.4.1

[...]

 

$ module spider quantumespresso/6.4.1

 

[...]

You will need to load all module(s) on any one of the lines below before the

"quantumespresso/6.4.1" module is available to load.

 

      nixpkgs/16.09  gcc/7.3.0  openmpi/3.1.2

      nixpkgs/16.09  intel/2019.3  openmpi/4.0.1

[...]

 

$ module load gcc/7.3.0

$ module load openmpi/3.1.2

$ module load quantumespresso/6.4.1

 

 

 

 

 

MIGRATION

UNIVERSITY

CONSORTIA

CLUSTER NAME

DEFUNDING

U. Guelph

Compute Ontario

Mako

7/31/2017

U. Waterloo

Compute Ontario

Saw

7/31/2017

U. Toronto

Compute Ontario

TCS

9/27/2017

U. Alberta

WestGrid

Jasper

9/30/2017

U. Alberta

WestGrid

Hungabee

9/30/2017

U. Toronto

Compute Ontario

GPC

12/31/2017

Dalhousie U.

ACENET

Glooscap

4/18/2018

Memorial U.

ACENET

Placentia

4/18/2018

St. Mary's U.

ACENET

Mahone

4/18/2018

U. New Brunswick

ACENET

Fundy

4/18/2018

McMaster U.

Compute Ontario

Requin

4/18/2018

Queen's U.

Compute Ontario

CAC

4/18/2018

U. Guelph

Compute Ontario

Global_b

4/18/2018

U. Guelph

Compute Ontario

Redfin

4/18/2018

U. Waterloo

Compute Ontario

Orca

4/18/2018

Western U.

Compute Ontario

Monk

4/18/2018

Western U.

Compute Ontario

Global_a

4/18/2018

Western U.

Compute Ontario

Global_c

4/18/2018

Western U.

Compute Ontario

Kraken

4/18/2018

Simon Fraser U.

WestGrid

Bugaboo

4/18/2018

U. British Columbia

WestGrid

Orcinus

4/18/2018

U. Calgary

WestGrid

Parallel

4/18/2018

U. Manitoba

WestGrid

Grex

4/18/2018

Concordia U.

Calcul Quebec

Psi

12/31/2018

McGill U.

Calcul Quebec

Guillimin

12/31/2018

U. Laval

Calcul Quebec

Colosse

12/31/2018

U. Montreal

Calcul Quebec

Cottos

12/31/2018

U. Montreal

Calcul Quebec

Briarée

12/31/2018

U. Montreal

Calcul Quebec

Hadès

12/31/2018

U. Sherbrooke

Calcul Quebec

MS2

12/31/2018

U. Sherbrooke

Calcul Quebec

MP2

12/31/2018