For those interested in the kernel language, you can look at the referenced paper here: https://tianshilei.me/wp-content/uploads/llvm-hpc-2023.pdf

cuda code ompx equivalent

(left is some CUDA code, and right is the SIMT OpenMP equivalent)

They seem to have implemented equivalents of most of what you’d like to see when you come from the CUDA/HIP world: similar 3D thread indexing (with things like ompx_thread_id_x or ompx::thread_id(ompx::DIM_X)), shared/local memory, explicit allocation/memcpy (omp_target_alloc and omp_target_memcpy), synchronization barriers (ompx_sync_thread_block and ompx_sync_warp), warp/wavefront intrinsics etc…

What’s impressive and very encouraging is that ompx achieves performance that is very similar to CUDA on the A100 and sometimes even better performance than HIP on the MI250:

performance comparison plots