Title of the session: “Silicon-Proven Multiprocessor SoC Design Strategies”
Organizer: Gerhard Fettweis, TU Dresden, Germany
List of Special Session presentations:
1. Challenges and Solutions for Future-Proof Embedded Vision and AI
Processors, Yankin Tanurhan,Tom Michiels, Eino Jacobs, Joep Boonstra,
Chuck Pilkington, Pierre Pauli, Synopsys, USA
The field of embedded vision continues to grow in importance across a wide range of
surveillance, mobile, consumer, automotive, industrial and IoT applications. In addition,
the fusion of vision and other sensors – e.g. audio, radar and lidar – has demanded more
compute power and more general-purpose parallel processing. This has driven the
continued development of parallel vector-oriented DSPs to address classical vision
algorithms and new fusion applications. In addition, in the last decade, deep-learning
based solutions for embedded vision have emerged as a key application of the growing
class of artificial intelligence-based solutions. Convolutional Neural Networks (CNN)
and Recurrent Neural Networks (RNN) have been key classes of deep-learning solution
in the field of embedded vision.
Embedding large-scale deep learning vision applications at the edge remains
challenging today, especially in automotive applications where the computing
requirements for ADAS and autonomous driving can range between 10 and 100 TOPs,
with power targets below 10W. These computational loads imply significant memory
and data bandwidth requirements for these applications with high-dimensional input
data and the large algorithmic diversity of modern vision tasks.
We will introduce a standards-based programming model for the EV6 vision CPU. The
Khronos OpenVX standard is used as the top-level programming model for high-level
dataflow and task-level parallelism. Our OpenVX runtime allows for the optimized
mapping of OpenVX kernels to parallel processing resources, while automatically
generating optimized data communication and synchronization. The kernels of the
OpenVX graph are expressed using the OpenCL C kernel language. The EV6 cores
were specifically designed to ensure efficient support of the OpenCL C kernel language.
As a result, our OpenCL C compiler supports the automatic vectorization of these
kernels, fully exploiting small-grain data-level parallelism.
Machine-learning based Deep Neural Networks (DNN) have emerged as a key
complementary new class of technology alongside of classical vision DSPs. While most
of the early applications of DNN have been on general purpose GPUs or straightforward
extensions of existing vector processors, there has been increased interest in more
specialized DNN engines in order to provide high-performance with low power, low
area and low bandwidth. In addition, due to the rapid rate of innovation in deep
learning, this solution must also remain flexible and future-proof.
We will present our scalable third-generation neural network acceleration engine
designed to address low-power and low-cost requirements for embedded systems, while
offering high flexibility and the support of state-of-the-art CNN and RNN graphs. This
engine offers class leading MACs/cycle at the cost and power of a hardwired solution
but is fully programmable via automated CNN graph mapping tools. It is scalable to
three configuration sizes supporting 880, 1760 and 3520 MACs/cycle.
We will additionally describe the EV6 automatic neural network graph mapping tools
that compile CNN and RNN graphs captured on well-known frameworks like
Tensorflow and Caffe, but also many others via the recent ONNX (Open Neural
Network Exchange) format. This fully automated mapping tool generates the optimized
executable on the scalable CNN accelerator engine and the Vision CPU’s vector DSP
processing units. We also introduce CNN graph optimization techniques that can be
used to further exploit the available computing and memory resources. We illustrate
tradeoffs in performance, power, bandwidth and accuracy using well-known CNN
2. Energy efficient computing from Exascale to MicroWatts: A Parallel Ultra-
Low Power RISC-V experience, Luca Benini, ETHZ, Switzerland
The end of Moore's Law and the Cambrian explosion of Machine Learning applications
and related workloads across the full range of the computing continuum are creating a
fantastic window of opportunity for innovation in digital circuits and systems. The
open RISC-V ISA has a major catalytic potential, as it unlocks the possibility to revive,
from the ground up, cross-layer approaches for achieving energy efficiency.
Furthermore, the free and open source philosophy at the root of the RISC-V movement
has created the premises for a fertile open innovation ecosystem. The Parallel Ultra-
Low Power (PULP) open platform, based on the RISC-V ISA and originated by the
combined effort of the University of Bologna and ETH Zurich, was created to foster
open innovation in computing and to achieve energy efficiency and proportionality from
high-performance processors and massively parallel accelerators to microWatt near-
sensor processing engine. The talk will give a view on the key milestones, the insight
gained, the vision and challenges that lay ahead in the future.
3. The KAVUAKA-Hearing-Aid Processor low-power Chip-Design, Holger Blume, Guillermo Payá Vayá, Lukas Gerlach
Leibniz Universität Hanover, Germany
The power consumption of digital hearing aids is very limited. At the same time, there
is a trend for more computing power to make future hearing aids more intelligent.
Future hearing aids should be able to detect, localize and filter out target speakers in
complex acoustic environments to further increase the speech intelligibility of the
individual hearing aid user. Computationally complex algorithms are required for this
task. In order to maintain an acceptable battery life, the hearing aid processing
architecture has to be highly optimized for ultra-low power consumption and high
The integration of application specific instruction set processors (ASIPs) in hearing aids
enables a variety of architectural customizations to meet the stringent power
consumption constraints and processing performance requirements. We present the
application specific KAVUAKA hearing aid processor, which was customized and
optimized using state-of-the-art hearing aid algorithms like sound source localization,
noise reduction, beamforming and dynamic compression. Specialized and complex
instructions were added to the basic instruction set. Noteworthy extensions are a
multiplication and addition (MAC) unit for real- and complex-valued numbers,
architectures for power reduction during register accesses, and a low latency audio
interface. During the development phase of KAVUAKA, many design space
exploration techniques including netlist simulations based on the target 40 nm ASIC
technology were performed. The functional verification was tested with in-circuit
emulation with dummy hearing aids.
The final integrated hearing aid ASIC is a system-on-chip (SoC) with four KAVUAKA
processor cores and 10 co-processors on one die. Each of these processors and co-
processors includes individual customizations and hardware features and the data path
width is varied from 24 bit, 32 bit, 48 bit to 64 bit. The processors are organized in two
clusters, which share memories, an audio interface, co-processors and serial interfaces.
Each component can be activated separately or simultaneously to increase computing
power or minimize power dissipation. This feature enables power consumption
measurements for different hearing aid system configurations. The measured average
power consumption is less than 1 mW per core. The area is less than 1 mm 2 per core.
4. 5G Scalable Machines, Gerhard Fettweis, Emil Matus, Stefan Damjancevic, Robert Wittig, et al., TU
With the advent of 5G we are facing a new challenge when facing implementation. Not
only does the data rate increase once again by an order of magnitude, but the end-to-end
latency of a communication interaction is an additional bottleneck. Hence, the main
focus has been put onto delivering the extreme: an end-to-end experience of 5ms
latency and a minimum of 1Gb/s data rate. This is what seems to be the requirement for
the vast market of 5G smart phones, with data rate and latency being e.g. a stringent
requirement for augmented and virtual reality.
However, applications of 5G communications technology require different “pods” of
optimization within the latency-rate plane. Applications cover 3 orders of magnitude of
latency, starting at a 2s requirement for connecting sensor nodes or for video delivery
and extending all the way to interactive robotics. And, they cover 5 orders of magnitude
of data rate requirements, starting from 10kb/s voice reaching up to 1Gb/s for virtual
Building one chip which does all is certainly a tough challenge. Also, very large
markets exist which require only a small window of latency-ate parameters. E.g. for
factory automation we can expect 10-100Mb/s at 5ms latency to address the needs.
Hence the question: does it make sense to build a solution for the high-end only, or can
we build scalable solutions that match the individual application needs?
In this talk we want to introduce a scalable hardware/firmware solution to this problem.
The idea is to find a way that the firmware is written once, and is remapped
automatically by a run-time scheduler onto different hardware elements as needed. The
hardware can be configured at 2 levels of hierarchy. At a node-level we propose a new
memory interface which allows for multiple processors as well as accelerators to
efficiently share common memory sub-bocks such that energy-expensive copying of
memory is not required. At the MPSoC-level we propose a tiled method where an
“intelligent” DMA and preprocessor is integrated into the router access.
This 2-level approach can be tiled at both levels of hierarchy into a configuration,
addressing exactly the required range of the latency-rate without re-designing hardware
nor firmware components.
Title of the session: "Innovations in IC Design in FDSOI CMOS"
Organizer: Andrei Vladimirescu; Institut Superieur d’Electronique de Paris (ISEP);University of Califronia
List of Special Session presentations:
1. New design of analog and mixed cells using back-gate cross-coupled structure, Gilles Jacquemod,
Zhaopeng Wei, Yves Leduc, Emeric de Foucauld Polytech Lab, EA UNS 7498, Biot, France; CEA-LETI, Grenoble, France
Following the Moore ‘Law, MOS bulk transistor is reaching its limits: Sub-threshold Slope (SS),
Drain Induced Barrier Lowering (DIBL), Threshold Voltage (V Th ) and V DD scaling slowing down,
more power dissipation, less speed gain, less accuracy, variability and reliability issues. Moreover,
while the digital blocks continue to shrink, analog cells hardly shrink at all. For example, due to the
Short Channel Effect or noise, we have to implement longer transistor, especially for analog cell.
Thanks to the characteristic of the threshold voltage of Ultra-Thin Body and Box (UTBB) FDSOI
transistors according to the back-gate biasing, we realized new topologies for digital and analog cells.
In this paper, we present a new inverter topology to realize a voltage controlled ring oscillator
(VCRO) and a PLL using FDSOI technology. The access to UTBB transistor Back-Gates (BG) offers
an extended control of the threshold voltages of the transistors, opening new opportunities to exciting
performances. This new complementary structure is based on a pair of Back-gate cross-coupled
inverters offering a fully symmetrical operation of complementary signals, and will offer two other
advantages very important for ring oscillator realization. The first one concerns the duty cycle, which
has to be close to 50%. Secondly, this topology enables a VCRO with an even number of inverters.
This latter feature makes it easy to perform a quadrature VCO (QVC0: four identical outputs (same
amplitude and same frequency) but with different phases (0°, 90°, 180° and 270°)).
We have also adapted this concept of back-gate control to a current mirror implemented in the
charge pump of a PLL. As shown in Figure 1.a, the design principle is as follows: both back-gates of
the two transistors P A and P B are connected to the drain voltages of the other instead source. As V DS of
P B increases, the threshold voltage of P A decreases, so we have a smaller bias voltage to get the same
input current. At the same time, a smaller bias voltage (i.e. a larger threshold voltage for P B ) can
reduce the output current, so as to compensate the short channel effect with an appropriated size of
transistors. This circuit is implemented in 28nm FDSOI technology and the layout, with a size of
30*40 µm2, is given in figure 1.b. We have implemented both with and without back-gate control on
the same die.
Figure 1. Current mirror with back-gate control
Both simulations and measurements of the VCRO and of the current source will be presented to the
conference and have validated our topologies.
2. A Distributed Body-Bias Strategy for Asynchronous Circuits, L. Fesquet, Y. Decoudu, A. R. Iga Jadue, T. Ferreira de Paiva Leite,
M. Diallo, R. Possamai Bastos, K. Morin-Allory, Univ. Grenoble Alpes, CNRS, Grenoble INP, TIMA, Grenoble, France; Laurent.Fesquet@univ-
The fast evolving pace of electronic mobile devices have made mandatory to reduce power
consumption without compromising the circuit performance or its robustness. Asynchronous circuits
have demonstrated to be an excellent solution to help designing robust and energy-efficient circuits
required for the Internet-of-Things and the mobile applications. Their local synchronization
mechanisms, based on handshake protocols, make them perfectly suitable for exploiting dynamic
power management techniques, such as Adaptive Body Biasing (ABB) in FD-SOI technologies.
Indeed, the circuit activity is simply detected by using the already existing handshake signals, enabling
the application of different ABB strategies with almost no modification to the original asynchronous
circuit. As the synchronization mechanisms are local to small logic blocks, the ABB strategy is able to
target from small to large body bias domains. In order to manage such a technique, an analog
dedicated standard cell has been designed to body bias small regions. Depending on the body bias
domain granularity, an appropriated number of these specific cells is inserted exactly as logic standard
cells during the back-end operations. Additionally, the robustness of asynchronous circuits makes
possible changing the transistor threshold voltage on-the-fly, a requirement for applying ABB schemes
without complex power management issues.
3. Benefits of an FD-SOI Feature :Optimal Power Budget of Wireless Links
through Phase Noise Tuning, Y. Deval, A. Cathelin, R. Guillaume, A. Ait Ihda, H. Lapuyade and F. Rivet, IMS, Univ. Bordeaux, CNRS, Bordeaux INP, Bordeaux, France; STMicroelectronics, Crolles, France; CNES, Toulouse, France
When it comes to wireless communications, several parameters are of major influence on the power
budget of a given link. Among them is the receiver oscillator phase noise. Indeed, the latter is
responsible for parasitic unwanted down-conversion of unwelcome adjacent channels. Therefore,
unwanted signals come down the baseband frequency range. As a matter of consequence, these
unwanted signals act as blockers or high noise perturbations from the wanted signal point of view. To
ensure the collapse of parasitic signal down-conversion to baseband, standards usually impose
stringent phase noise behavior. This signal purity related characteristic, however, is largely
overestimated for most real-life and/or real-time wireless links. This signal purity related
characteristic, by the way, is largely overestimated for most real-life and/or real-time wireless links.
This overestimation yields both to overdesign and a useless system power overconsumption.
On the other hand, fully depleted silicon on insulator technologies (FD-SOI) depict a unique feature of
threshold voltage tuning. This feature is efficient on a larger range of voltage when compared to
classical CMOS, while it conjointly depicts a better voltage-to-threshold conversion slope. So, a
thorough study of the relation between threshold voltage of transistors and the phase noise of harmonic
oscillators will be presented and discussed in this paper. This relationship paves the way to a dynamic
control of oscillator phase noise behavior under the constraint of an optimal wireless link, namely a
perfectly suited power consumption budget. Hence the power consumption of an overall complex
system, such as a satellite for instance, can be tuned to the required absolute minimum value for a
given link efficiency. This approach, in return, improves the system reliability and increase its
lifetime. Hereinafter, it optimizes the use cases of a given complex satellite and can yield to an
economically affordable ratio of the payload cost versus the return-on-investment.
Both measurement, simulation and theoretical results will support the discussion.
4. Does FDSOI bring added value for new connectivity challenges?, Didier Belot, CEA-LETI, ST-Microelectronics
The connectivity functions are everywhere to do the link between all other electronic functions. From
the sensors and actuators to the processors and microcontrollers, from the sensor nodes to the
gateways, from the gateways to the cells from the cells to the data centers, and all over the world.
Inside each of these units, the connectivity links the computers to the memories, the core of multicores
in High performance computing applications, and the peripheral devices to the central computing
units. The connectivity functions can be differentiated depending on the range and the nature. The
nature of such function are wireless (in radio frequency mmW, THz bands, or visible light), or
wireline (in copper or optical fiber). The range of such functions can be sorted out depending on the
distance, the ultra-short range, is in the µm to cm distance; the short range is under 100 m, while the
long range covers distances over 100 m. We will focus in this talk to wireless functions and will
explore the capability of FDSOI technologies to answer to these connectivity challenges for the next 5-
10 years communication roadmap.
A Special Session for IFIP/IEEE VLSI-SoC takes the form of a session that includes several
presentations (3 or 4) all of them targeting a hot topic. The special session has an Organizer
that must submit a document describing the structure of the session.
Structure of a Special Session proposal
The document submitted must have the following structure:
Title of the session
Organizer and affiliation
List of Special Session presentations (3 or 4), with the following information for each of them:
Title of the presentation
Short Abstract (maximum of half a page).
Submission of a Special Session and evaluation process
The submission of a Special Session proposal must be done through the Conference
website, using the next form:
The submitted special sessions will be evaluated by the Organizing Committee, including
the Program Chairs and Special Session Chairs.
The contributors in a Special Session are not obliged to produce a paper to be published in
the Conference proceedings.
In the case that the authors of a presentation proposed in a Special Session wish to publish a
paper in the Conference Proceedings, they should also submit the paper in the
corresponding regular track. The paper will be evaluated by the Technical Program
If the paper is accepted, it will be presented in the corresponding Special Session if this
session is accepted, or in a Regular Session otherwise. In both cases, the paper will appear
in the Proceedings and it will be evaluated as a candidate for a book chapter publication or
the Conference Best Paper Award.