Synchronizing multiple cores on a single chip or in a system requires atomic operations and hardware that enforces these operations.
The motivation for atomic operations comes from the need to synchronize two or more entities when sharing a resource. A simple example of this might be two threads in a multithreaded RTOS based system, both of which need to occasionally send a message to a UART peripheral that they share.
In operation, it would be preferred that one thread might gain control of the UART and hold that control until its entire message is sent. Otherwise, if another thread decides to send a message when the first is not complete, the two threads might send alternate characters through the UART, resulting in a mess of unintelligible gibberish coming out of the console port.
To solve this problem, we can define a resource, as simple as a single memory location, that contains a data word that indicates whether the UART is busy or available. Let’s say that a 0 in this location indicates the UART is free, and a 1 means that it is in use.
When thread 1 has a message to send, it reads the lock word. It sees that it is 0, so it writes a 1 to this location. Any other thread that wants to send a message will find this lock or semaphore contains a 1 and will wait until it is clear before it tries to grab the resource for its own use.
A problem arises if the process of checking and setting the lock is interrupted.
Consider the following case:
Now both Task 1 and Task 2 are certain that they have unique control of the UART resource and will send characters to it, interleaved, as they see it become available on a character by character basis.
What is needed to fix this problem is an atomic operation to ensure that the complete transaction of checking and setting the lock word is completed before any other agent, even one with higher priority, can interrupt the ongoing critical operation.
An atomic operation is simply one which is completed in one uninterrupted sequence. Even if it is a complex sequence of events.
Support for atomic operations is built into language standards such as C11 and beyond. However, the actual mechanisms used to implement this at the machine instruction level and hardware level must be present in the Instruction Set Architecture (ISA) of a CPU and the system hardware implementation to ensure correct operation.
In the case described above, Task 1 must complete all of the steps of testing and setting the lock before any other task or entity can have access to the lock word in memory.
In a single core system, this can generally be ensured by simply disabling all interrupts before the lock word is read, and re-enabling them once the lock is written. In this way, no interrupts can occur, and therefore nothing can cause the kernel to preempt Task 1 in the middle of its test-and-set operation.
Of course in the case of a single core system with other bus masters, such as DMA controllers and peripheral devices which can directly write memory, this could be violated by one of these other bus masters writing the lock value at just the wrong time. However, this can be prevented by ensuring that the memory where lock variables are stored is not accessible by any other bus masters in the system.
Now we look at the case where you have multiple computation elements on the same chip, like a multicore chip, where a lock in memory is used to ensure that threads running on the different cores also can share resources without clobbering each other.
In this case we cannot prevent the different bus masters from accessing the lock location. This is because the lock location is what each of the cores must be able to read and set to ensure correct sharing of common resources like the UART described above.
Now we must ensure that one core can complete its read-modify-write sequence once started, before any other core or bus master may have access to the memory location where the lock is being set.
In Version 8.1 and later of the Arm architecture, there are new atomic instructions added for this purpose. I will focus this example on the new instructions. One such instruction is the LDADD instruction and its variants. This instruction reads a value from memory, adds the value from one of the registers on the chip and writes the result back to memory while holding the memory bus until the entire operation is complete.
In this way, the system can guarantee that no other bus master can modify the value in memory in such a way that both masters think they have ownership of the shared resource.
After this code is completed, the processor can check the value that was read to verify that in fact it is the sole owner of the resource and that its value corresponded to the resource being available before the operation started.
Real world implications
The good news is that this is all taken care of in the lower levels of system code if you use an RTOS or operating system to manage your threads in either a single or multi-core threaded environment. It is useful however to understand that the underlying instruction set and memory hardware must be designed to support these lockout mechanisms for this all to work. If these mechanisms are not designed correctly or misused by directly manipulating registers, it is possible for multiple cores to inadvertently gain simultaneous control of resources intended to be exclusively held while in use. To debug these kinds of situations requires advanced multicore debugging capability where the code running on multiple cores in a system can be observed and controlled.
Debugging multicore synchronization
A multicore debugger can facilitate finding synchronization issues by showing the programs running on multiple cores or threads, as well as the ability to selectively stop and start cores based on breakpoints on the other core should be ideal to determine issues with this kind of mechanism.
In Figure 1, we can see an NXP i.MX 8 with 4 x CPU’s in IAR Embedded Workbench. All cores can be started and stopped individually.
Figure 1: Debugger control for each Core independently. (Source: IAR Systems)
The Figure 2 shows the use of multiple breakpoints in code running on different CPU’s combined with the use of a mutex (example provided by Arm): _mutex_acquire() and _mutex_release(), that sets the flag to block the of the object that is used in the primes calculations.
click for full size image
Figure 2: Use of mutexe’s and breakpoint in individual cores. (Source: IAR Systems)
One of the most common mistakes is the misuse or lack of use of the Cross Trigger Interface (CTI). For Arm, the CoreSight Cross Trigger Interface (CTI) is connected to each core through a Cross Trigger Matrix (CTM). The CTI enables the debug logic, ETM trace unit, and PMU, to interact with each other and with other CoreSight components. This makes a stop and reset possible to each core independently. Having to manipulate a “homemade” CTI workaround, with controlling and halting cores manually, perhaps using macros on the fly is a mission impossible task. This should and needs be handled by default by a good debugger from the probe (CTI interface signals) and software debug side. Figure 3 shows the use case of full control of CTI.
Figure 3: Full control with the Cross Trigger Interface (CTI). (Source: IAR Systems)
Once all comes together, the debugger with multi-core support can control cores in asymmetric and symmetric scenarios and even combined. Figure 4 shows an NXP i.MX 8 device with 4x Cortex-A53 and 1 x Cortex-M4 running. The MCU and MPU’s can be halted, monitored and controlled independently. While all the 4 x Cortex-A53 cores or a single one is running from the master session it’s possible to set breakpoints on the Cortex-M4 partner side and focus on this application that might be running the security monitor of the complete device.
click for full size image
Figure 4: Multi-core session running on an NXP i.MX 8 device with 4x Cortex-A53 and 1 x Cortex-M4. (Source: IAR Systems)
Using parallelism and concurrency in the application is aimed to use the available cores more efficiently. It however comes with the price of adding complexity in the application and how the source code can be split into smaller pieces to run as much efficient as possible.
Synchronizing multiple cores on a single chip or in a system requires atomic operations and hardware that enforces these operations. When this HW/SW combination is first being developed a full-function debugger that supports multi-core debugging and observation can be critical in finding problems with such a system. It is impossible to imagine how to achieve the same control by using print statements all over the code and get everything in perfect synchronization. Every developer deserves a debugging solution that can handle multi-core and have full control over all threads. IAR Embedded Workbench with its debugger capabilities provides just such a tool which can be invaluable in developing and debugging these complex systems.
This article was originally published on Embedded.
Aaron Bauch is a Senior Field Application Engineer at IAR Systems working with customers in the Eastern United States and Canada. Aaron has worked with embedded systems and software for companies including Intel, Analog Devices and Digital Equipment Corporation. His designs cover a broad range of applications including medical instrumentation, navigation and banking systems. Aaron has also taught a number of college level courses including Embedded System Design as a professor at Southern NH University. Mr. Bauch Holds a Bachelor’s degree in Electrical Engineering from The Cooper Union and a Masters in Electrical Engineering from Columbia University, both in New York, NY.