Architecture and System Technology for Scalable Computing
For future scalable computing systems, we explore not only processor,
memory and IO subsystem architectures but also system hardware and software
together. Especially, we aim to prototype our cool concepts with various
implementation technologies and demonstrate them with commercial systems
although we evaluate them with simulators. We have aimed to impact the
industry research and development with our proposals, and our proposals have been
evaluated and/or adopted by big tech companies. Our recent research focuses
on exploring cool memory and networking technologies such as Compute eXpress
Link (CXL) based memory modules and various SmartNICs to make large-scale
computer systems more efficient in terms of performance, energy consumption,
The recent technologies we have been exploring:
CXL Memory Expander
systems have demanded higher memory bandwidth. However, the current parallel
interface (e.g., DDR4) can increase neither the number of (memory) channels
nor the bit rate of the channels without paying high package and power costs.
In contrast, the high-speed serial interface (HSI) can offer much higher
bandwidth for the same number of pins and lower power consumption for the
same bandwidth than the parallel interface. This allows us to integrate more
channels under a pin and/or package power constraint but at the cost of
longer latency for memory accesses and higher static energy consumption in
particular for idle channels. In our work, we first provided a deep understanding
of recent HSI (e.g., PCIe) exhibiting very distinct characteristics from past
serial interfaces in terms of bit rate, latency, energy per bit transfer, and
static power consumption. To overcome the limitation of using only parallel
(e.g., DDR) or serial (e.g., PCIe) interfaces, we proposed a hybrid memory
channel architecture-Alloy (the top illustration) consisting of low-latency
parallel and high-bandwidth serial channels. At the architecture level, this
proposal is the same as the current CXL memory expander in which a CPU is
connected to both DDR and PCIe-based CXL (Compute eXpress Link) (the middle
CXL is a high-speed
interconnect technology that enables data transfer between CPUs, GPUs, and
other devices. CXL-based memory expander technology, also known as CXL.mem,
is an extension of the CXL specification that allows for the expansion of
memory capacity beyond what is physically installed in a system.
Traditionally, memory expansion has been achieved by adding more DIMM (dual
in-line memory module) slots or increasing the capacity of existing DIMMs.
However, CXL.mem allows for the use of external memory devices, such as
high-speed DRAM or persistent memory, to expand a system's memory capacity.
This is achieved by connecting these external devices to the system via the
CXL interconnect. CXL.mem operates by using the CXL.io protocol to access the
external memory devices. The system sees the external memory devices as if
they were part of the local memory, allowing for seamless integration and
efficient access to the additional memory capacity. One of the key advantages
of CXL.mem is its high bandwidth and low latency. This allows for faster
access to data stored in the external memory devices, which can be
particularly beneficial in data-intensive applications such as machine
learning, data analytics, and high-performance computing. CXL.mem is still a
developing technology, and it is not yet widely available. However, it has
the potential to greatly increase memory capacity and performance in a wide
range of applications.
Memory Channel Network (MCN) based Near-DRAM Processing
A Research Prototype based on
IBM’s Contuto and POWER8 System
An Industry Prototype by Samsung
The physical memory
capacity of servers is expected to increase drastically with deployment of
the forthcoming non-volatile memory technologies. This is a welcomed
improvement for emerging data-intensive applications. For such servers to be
cost-effective, nonetheless, we must cost-effectively increase compute
throughput and memory bandwidth commensurate with the increase in memory
capacity without compromising application readiness. Tackling this challenge,
we developed Memory Channel Network (MCN) architecture. Specifically, we first
proposed an MCN DIMM, an extension of a buffered DIMM where a small but
capable processor called MCN processor is integrated with a buffer device on
the DIMM for near-memory processing. Second, we implemented device drivers to
give the host and MCN processors in a server an illusion that they are
independent heterogeneous nodes connected through an Ethernet link. These
allow the host and MCN processors in a server to run a given data-intensive
application together based on popular distributed computing frameworks such
as MPI and Spark without any change in the host processor hardware and its
application software, while offering the benefits of high-bandwidth and
low-latency communications between the host and the MCN processors over
memory channels. As such, MCN can serve as an application-transparent
framework which can seamlessly unify near-memory processing within a server
and distributed computing across such servers for data-intensive applications.
Our evaluation shows that a server with 8 MCN DIMMs offers 4.56X higher
throughput and consume 47.5% less energy than a cluster with 9 conventional
nodes connected through Ethernet links, as it facilitates up to 8.17 times
higher aggregate DRAM bandwidth utilization. Then, we demonstrated the
feasibility of MCN with an IBM POWER8 system and an experimental buffered
We are the first
research group proposed a DIMM-based near-DRAM computing (NDA: Near-DRAM
acceleration architecture leveraging commodity DRAM devices and standard
memory modules, HPCA 2015). and this MCN-based near-DRAM processing work
became the foundation for Samsung’s AxDIMM, and Samsung and Meta jointly
evaluated its efficacy for accelerating Deep Learning Recommendation Model
(DLRM). The AxDIMM has been ued as a
popular platform to explore various near-DRAM processing concept until the
CXL-based memory modules were introduced to industry researchers.
An Industry Prototype by Samsung
Integration with Commercial
The performance of
modern computing systems is bottlenecked by the off-chip communication
bandwidth and its energy cost. Integrating the accelerators within DRAM
(i.e., PIM) can mitigate these bottlenecks and additionally expose them to
the higher internal bandwidth of DRAM. However, such an integration is
challenging, as should not disturb the highly-optimized DRAM core circuitry.
To enable the integration, this work proposed an integration of a SIMD
acceleration unit with each DRAM bank (the top illustration).
Later, Professor Kim
worked with talented Samsung engineers and researchers to further develop an
industry version of an advanced PIM architecture similar to what he proposed
(i.e., integration of SIMD units with DRAM banks). For a commercial success,
he aimed to develop a PIM architecture that does not demand any change to (1)
the existing processor architectures and (2) the existing DRAM interface so
that the PIM device can be a drop-in-replacement of standard DRAM device;
this development philosophy was inspired by the success of SSD which was
initially developed as a drop-in-replacement of HDD although it began to use
an interface specialized for SSD later.
The PIM architecture was implemented in HBM2 (HBM-PIM) (the middle
illustration), and later HBM-PIM was integrated with Xilinx U280 (the bottom
illustration), AMD MI-50 GPU, and AMD MI-100 GPU. This PIM architecture
became the foundation of a JEDEC HBM3-PIM standard later.
Making sense of using a SmartNIC to reduce
datacenter tax from SLO and TCO perspectives’
received a Best Paper Runner-up Award from IISWC. Congratulations to
Jinghan, Jiaqi, Yan, and Jerry!
[Honor] `Drowsy Caches: Simple Techniques for Reducing
Leakage Energy’ published at ISCA in 2002 is selected as one of ISCA@50
Dong Kai (Edward) Wang successfully completed his Ph.D. defense and joins the
University of Illinois, Urbana-Champaign as a Teaching Assistant Professor!
contract is signed today! I will co-lead the SRC
JUMP 2.0 Intelligent Memory and Storage Center (PRISM: PRocessing In
Storage and Memory), starting January 2023.
Youjie Li successfully completed his Ph.D. defense and joins Meta as a
Yifan Yuan successfully completed his Ph.D. defense and joins Intel Research
as a Sr. Research Scientist!
[Honor] Professor Kim’s work on PIM hardware architecture
and software stack, “Hardware Architecture and Software Stack for PIM Based on Commercial
DRAM Technology” was recognized as Honorable Mention by IEEE Micro Top Picks.
[Award] Professor Kim’s 2003 IEEE/ACM International
Symposium on Microarchitecture (MICRO) paper got SIGMICRO Test of Time Award
[Honor] Professor Kim is named the W.J.
‘Jerry’ Sanders III – Advanced Micro Devices, Inc. Endowed Chair
first HBM-based PIM Solution] Professor Kim’s work on PIM hardware architecture and software stack, “Hardware Architecture and Software Stack for PIM
Based on Commercial DRAM Technology” got accepted by ISCA 2021!
Talk] Professor Kim gave a keynote speech on his PIM
work at HPCA!
first HBM-based PIM Chip] Professor Kim’s paper on the industry’s first HBM Processing In Memory
(PIM) chip will be presented at ISSCC! [Samsung’s
Press Release, Youtube Press Release, ZDNet,
[Honor] Professor Kim was elevated to ACM Fellow [ACM Press
[Honor] BabelFish: Fusing address translations for containers appeared in ISCA 2020 was selected as one of
IEEE Micro Top Pick 2021 papers.