HACC-mlcluster Getting Started

Welcome to HACC-mlcluster!

Getting started:

Before you continue, please do the following:

  1. Fill in the HACC user survey form.
  2. Sign the external user agreement forms.
  3. Familiarize yourself with the linux command line.

To request access to HACC, please contact the xilinx-center PI, Prof Deming Chen. Upon completion of the above checklist, the system admin will provide you with an XACC account.

NOTE: XACC cluster accounts are different from UIUC’s active directory accounts. Your accounts may have the same login ID, but HACC passwords are managed within the cluster only. Furthermore, HACC and HACC-mlcluster accounts are separate and may have different usernames and passwords.

Your first login:

When you log into HACC-mlcluster for the first time make sure to do the following:

  1. Update your password from the provided default.
  2. Ensure that a home directory has been created for you. If you do not have a home directory, contact the system admin.
  3. Configure VNC settings for GUI-based development (if desired). See guide here.

 

Submitting Jobs:

There are two types of jobs that can be submitted, either interactive or non-interactive.

For interactive jobs, see here.

For batch jobs, the typical flow will be as follows:

  1. Upload directory with all necessary files and executables. Consider this the container’s working directory.
  2. This directory should include a bash script which runs your application appropriately.
  3. All paths should be relative to this directory, and all files should be here or in a subdirectory. Any generated files not meeting this requirement will not persist after the job is terminated, and input files not meeting the requirement will not be seen by the application.
  4. If you want to use a large dataset in your application, consult the /nfs/datasets directory. If it does not exist here, contact system admins. These datasets will be accessible to your application at /datasets/<dataset> within the container. Your application will need to reference them as this path, not /nfs/datasets/<dataset>.
  5. Ensure all files have the appropriate permissions before starting the job. Execution permissions are typically lost on copying, and are a common issue with jobs failing.
  6. See here for tips on how to run the job submission script.

Be nice:

Please respect the time and constraints of other users. This is a shared research cluster with limited resources. Please do not attempt to consume all resources by submitting multiple jobs. Please do not perform multiple large compilations and builds on the shared development node.

What you need to know:

HACC-mlcluster supports both FPGA and GPU development. We provide several MI210 GPUs and VCK5000 and U55C FPGAs.

There are several different machines with varied resources. You can check the specifications and accessibility of machines here. The cluster is virtualized via Kubernetes and jobs are scheduled via Kueue. We provide instructions for submitting jobs here.

The cluster status page will be periodically updated to reflect changes in node status, availability, and resources. For more information on our tools and software versions, please check this page.

General Notes

  • We do not support custom OS/Kernels
  • You will not have root access
  • You cannot run jobs bare metal.

FPGA Notes

  • We only support Vitis-based FPGA flows. We do not currently support Vivado flows.
  • You may develop your hardware in C/C++ or OpenCL or RTL
  • You must work with the installed FPGA shells.
  • We do not support custom FPGA images or custom shells.
  • No JTAG debugging available

GPU Notes

  • We support the PyTorch Python library
  • We use AMD GPUs with ROCm

How to use the cluster:

HACC has been designed to enable sharing resources. You will need to submit “jobs” to the scheduler in order to run your tasks. You can not directly execute your jobs on a compute node. You can perform 2 types of operations on the cluster:

  1. Development – Compiling, synthesizing, and generating bitstreams for FPGA accelerators and/or developing PyTorch code for GPUs.
    We have one shared development node for these activities. This node is NOT governed by the job scheduler right now. Users are free to directly SSH into the node, and compile their jobs. We expect that as the number of users increases, we may need to put the development node behind the scheduler. In this case, you will need to submit Vitis compilations as jobs as well.
  2. Compute
    1. Running accelerated kernels on FPGAs
      Once the project is built, users should have an host executable and a Xilinx XCLbin file (partial bitstream). Users may then submit a job to the scheduler and request time on an accelerated compute node.
    2. Running GPU jobs
      We support interactive, Jupyter based flows and job flows. If you’d like to use the interactive Jupyter flow, please check here.