multicore 101: migrating embedded apps to multicore with linux
DESCRIPTION
Joint presentation with Ian Forsyth of Freescale Semiconductor (2008)TRANSCRIPT
Multicore 101: Migrating Embedded Applications to a Multicore Environment with Linux
Presented by MontaVista Software and Freescale Semiconductor
Ian Forsyth Senior Enablement Architect
Freescale Semiconductor
Brad Dixon Director of Product Management
MontaVista Software
Attend Vision for more in-depth multicore sessions www.mvista.com/Vision
►The Challenge In Migrating Applications•
The “Net Effect”•
Changing networking topology•
The multicore challenge►Proposed Multicore Solutions
•
Combined hardware/software•
Virtualization and hypervisor►The Pathway to Migrating Your Applications
•
Contain –
Exploit –
Analyze –
Optimize•
Use the right tools►Learn more and evaluate multicore solutions
•
Evaluate MontaVista TestDrive: Freescale + MontaVista Linux
Agenda
Multicore 101
The “Net Effect”
NetworkAdmission
Control
Service ProviderRouters
Storage Networks
Unified ThreatManagement
IMS Controller
Integrated ServicesRouters
Access Point Aggregation
Serving Node Router (GSN)
Metro Carrier Edge Router
Access Gateway
TelePresence
Wireless
IP Services
Enterprise
Converged Networking
SSL, IPSec, Firewall
Networking trends drive the need for more performance
Multicore 101
The Changing Networking Topology
►Layer 4-7 (Application) processing in the network is now common
► Increasing Integration in datacom
deployments
►Both driving higher computational capabilities from hardware vendors
Multicore 101
Why Multicore in Embedded Networks?
►Demand for differentiating features
►Advance services are implemented in software running on general purpose CPUs
►Frequency scaling of CPU cores no longer valid, primarily due to power
►Multicore processors viewed as most viable approach Performance Requirement
1xCPU
Device Hot-spotPower Limit
Pow
er
nxCPU
Multicore 101
The Multicore Challenge – It’s All About the Software
► Multicore silicon devices have raced ahead of the embedded software market’s ability to support them
► Millions of lines of single-threaded legacy code will need to be written in a parallel fashion in order to utilize multicore devices
► Creates a paradigm shift in how developers must think about and implement future programs
► No automated or “quick-fix”
approaches for this software migration and paradigm shift –
significant programmer effort is required
► Tools and support –
simulators, compilers, OS, virtualization packages, performance profilers, debuggers, example applications and training will all be key to the widespread adoption of multicore solutions
Power Architecture™Core
D-Cache I-Cache
L2 Cache
Power Architecture™Core
D-Cache I-Cache
L2 Cache
Single-threaded Legacy Software
Power Architecture™Core
D-Cache I-Cache
L2 Cache
Power Architecture™Core
D-Cache I-Cache
L2 Cache
Power Architecture™Core
D-Cache I-Cache
L2 Cache
Multicore Software
Multicore 101
Multicore Tools and Solutions
Market-specific multicore stacks, apps, libraries. Support green field.
Software Pyramid
Support for standard and OS-dependent programming models, often leveraging multiprocessor.
Base multicore infrastructure: Operating System, boot standards.
First-rate tools: debuggers, performance and trace analyzers, simulators, compilers.
SMP/AMP OS’sAdvance DebugLibraries
Early Code Partitioning Hardware & Software Hypervisor
Stacks N/W Accel
Multicore 101
Hypervisor
Optimized High-Speed Drivers
Applications
Freescale QorIQ™ SiliconPerformance Model
QorIQ™ Solution Platforms
Simulation to Hardware: Same Software
Freescale-supplied
Functional Model APIID
E (compiler / debugger / build tools)
Simics
Virtualized Development
Environment
Hypervisor
Optimized High-Speed Drivers
Applications
Multicore 101
Hybrid Functional/Performance Simulator
API
Functional Model Performance Model
CPU
RAM
I/O
HardwareAcceleration
ROM
BusCPU
Ethernet
CPU
CPU
I/O
CPU
Ethernet
CPU
HardwareAcceleration
Simulated Time
Functional Mode Simulation -
High Speed
Functional Mode Simulation
PeriodicCheckpoints
Performance Mode Simulation
Multicore 101
A Hybrid Model:
Functional
Performance
Virtualization for Reduced Cycle Time
Single Simulation
Environment
Core
SOC
Boards
Systems
MPC8360/MPC8641DMPC8548/MPC8572Multicore Platform/ …
e200, e300 e500, e600, …
Freescale with Virtutech and MontaVista provide a multicore development platform that accelerates software development before and after silicon availability
Provides programmer's view of the SoC
Deterministic
Non-invasive
Control of time
Systematic control of validation and error
Control of cores
Control of configuration
Force and detect race conditions
Optimized solutions
Products andSystems
Multicore 101
MPC8641/40D Dual Core Block Diagram► Dual e600 PowerPC cores @
1.25/1.0 GHz•
1MB L2 Cache w/ECC per core•
36-bit physical addressing
► System Unit•
64b DDR/DDR2 w/ECC•
4x 10/100/1000 Ethernet Controllers
► High-speed Interfaces•
1x/4x SRIO (2.5GB/s) and x1/x2/x4/x8 PCI-Express (4GB/s)
•
OR two x1/x2/x4/x8 PCI-Express (8GB/s)
► Pin and Software compatible to MC8641D
► Max Power (Watts)•
31.0 W @ 1.25 GHz•
21.0 W @ 1.00 GHz
► Production Availability•
0 to 105C – Now•
-40 to 105C – Q408
► MontaVista commercial support•
Professional Edition 5.0•
Carrier Grade Edition 5.0
Multicore 101
QorIQ™ P4080 Multicore
Features• Eight e500mc cores
• CoreNet™ scales to 32 cores• PCI Express®
2.0, 10GbE• PME 2.0, SEC 4.0• Data path acceleration• Trust/secure boot• Hypervisor• Standardized debug•
Virtualization with real applications
• High-performance SoC• Advanced technology• Tier one partnerships• Outstanding ecosystem• MontaVista Linux support
►
Innovative Multicore Micro-architecture for unprecedented computing efficiency, performance and scalability.
•
On-chip coherency fabric•
Back-side cache per CPU core•
On-demand application acceleration
►
Multicore Simulation Environment for accurate, fast code development and debugging.
•
Fully tap the capabilities of the multicore platform•
Debug software not hardware•
Dynamic, real-time debug with non-intrusive capture
►
45-nm Process Technology for industry-leading power-to-performance solution.
•
Provides highest instructions-per-cycle (IPC) and frequency for given Milliwatt/area
It’s a smarter approach to multicore. Freescale’s Multicore Platform
Multicore 101
Datapath Acceleration Architecture
Congestion Mgmt
Parse
Classify
SteerPolicing
Stash Context Enqueue
Manage Work Q
QMan BMan
FMan
QorIQ™ P4 Platform DPAA
Datapath
Acceleration Architecture simultaneously enables a lower complexity software environment as well as very high networking performance
Cores Accelerators
NetworkInterfaces
Multicore 101
Multicore Operating Systems► Wide variation of customer use-cases
•
Multiple operating systems utilized across cores on a single deviceProprietary, 3rd party and Open Source multicore operating systems
•
Symmetric Multi-Processing (SMP) and Asymmetric Multi-Processing (AMP), often running concurrently
•
Often no OS, or engineered light OS, used on forwarding/data plane cores
► Leverage Power Architecture™ technology’s 3rd party OS ecosystemFreescale embedded HypervisorFreescale boot standards, including u-boot Leverage open boot protocol and API standards (e.g. Power.org™)Freescale Light Weight Executive (LWE) for run to completion data plane processingDemonstrate performance and provide reference example for customers
Services
Light Weight Executive
Forwarding/ Data Plane Control Plane
MontaVista
Linux®
SMPAMPAMPPower
Architecture™Core
Power Architecture™
Core
Power Architecture™
Core
Power Architecture™
Core
Power Architecture™
Core
Power Architecture™
Core
Power Architecture™
Core
Power Architecture™
Core
MontaVista
Linux®MontaVista
Linux®
Multicore 101
Light Weight Executive Summary►The LWE provides a set of services and abstractions to an
application►Focus is on run-to-completion model
►Freescale provides example applications to demonstrate the use of the LWE
►The LWE helps Freescale customers and partners develop functionality using cores as highly optimized accelerators
Light Weight ExecutiveApplicationSoftware on other Cores–
e.g. running Linux®
interaction
Multicore 101
Hypervisor Contrasts
CPU CPU CPU
Freescale Hypervisor Implementation
Traditional Hypervisor Implementation
Requirement: isolation, performance
Implications: No more than one OS per core, OS has direct control of
high-speed peripherals
Requirement: solves problem of under-utilized CPUs, plus isolation
Implications: more than one OS per core, complexity, performance
implications
QorIQ™ P4080 hypervisor hardware assists in meeting both requirement sets
Guest OS
Guest OS
Guest OS
Guest OS
Multicore 101
Natural Virtualization via QorIQ™ P4080 Datapath
Network Interface
P4080 Datapath
portalportal
Cores can access the same network interface with no SW synchronization because cores have their own portals
►Datapath
decouples cores and peripherals–
allows N cores to share M peripherals
►Accessed by “Portals”
that are per-core►Allows direct and efficient access by cores to many high-speed
peripherals
Power Architecture™
Core
Power Architecture™
Core
Multicore 101
Solution
Hypervisor
Drivers
Applications
Freescale QorIQ™ Silicon
Example Apps
Stacks
IPC
High Level IPCL
W
E
Hypervisor
Drivers
Applications
Freescale QorIQ™ Silicon
Partition Mgmt.
Stacks
IPC
High Level IPC
MontaVista
Linux
Freescale
3rd Party and/or Customer
Solution = Freescale software + ecosystem software + customer software
Multicore 101
Market Analysis
Source: Embedded Systems Design Survey
“Developers overwhelmingly voted for the chip's software-
development tools as the most important thing when evaluating a new embedded processor.”
“The most valuable feature of a chip isn't even the chip itself.
Compilers and debuggers trump MIPS and megahertz.”
-
Jim Turley, ESD
Multicore 101
Migrating to Multicore: What is the pathway?
►Contain►Exploit►Analyze►Optimize
Multicore 101
ContainmentGoal: Migrate application codebase to multicore
platform without disruption
►Risk –
concurrent execution will expose latent race conditions and synchronization issues
►Technique –
utilize Linux's
processor and interrupt affinity APIs to contain your application's threads and processes to a single core
Multicore 101
Containment
You
r A
pp
Housekeeping Utilities
Multicore 101
ContainmentHousekeeping
Utilities
You
r A
pp
Housekeeping Utilities
You
r A
pp
Multicore 101
Benefits:►Delay exposing latent concurrency defects►Easily gain an efficiency boost by exploiting available cores► I/D/L2 cache efficiency by minimizing scheduler bounces
ContainmentHousekeeping
Utilities
You
r A
pp
Housekeeping Utilities
You
r A
pp
Multicore 101
The designer can explicitly control which CPUs are permitted to handle particular threads and interrupts
Migration with Containment
Shown on Freescale 8641D multicore processor
Multicore 101
►Why SMP?►Linux's
long march to multicore
►On virtualization
A Quick Sidebar…
Multicore 101
►Multicore CPU's can permit a number of processing scenarios
►SMP maximizes run-time flexibility to match CPU to the needs of the moment
►SMP ends up playing a role in many system architectures►Combined with a hypervisor SMP does not exclude any other
design options
Why SMP?
Multicore 101
Linux’s Long March to Multicore
►Linux has been MC ready for years►Kernel, drivers, protocol stacks, and
apps are ready►As core count scales the focus shifts
to exploiting MC at the application layer
Multicore 101
►Difficulties applying virtualization to telecom/datacom•
The isolation vs. latency trade-off•
Hardware contention•
I/O devices►Hardware support minimizes virtualization overhead
On Virtualization…
Multicore 101
►SMP is the natural way for Linux to exploit multicore processors.►Hypervisors can permit new flexibilities►New hardware features are making hypervisor based
architectures more efficient to use
Sidebar Summary
Multicore 101
►Contain•
Migrate to multicore but contain code to a single core
►Exploit►Analyze►Optimize
Migrating to Multicore: What is the Pathway?
Multicore 101
Goal: Identify code that will benefit from multicore execution and modify code to exploit available cores
Exploit
Multicore 101
Objective: scale efficiently across multiple cores so that more client work can be handled rapidly
►
Key question is how to map client requests (or packets) to workers quickly and obtain speed-up from multicore
Application Architectures to Exploit MC
Multicore 101
►Each request requires a small amount of work
►Requests are largely independent of each other
►Requires read-only access to a moderate amount of state
►Small amount of state may travel with the request
►Must be able to manage overload effectively
Application Characteristics
Multicore 101
►Each request requires a small amount of work
►Requests are largely independent of each other
►Requires read-only access to a moderate amount of state
►Small amount of state may travel with the request
►Must be able to manage overload effectively
►Some anti-patterns•
Non-concurrent•
Process/Thread per client•
Spawn process/thread per request
•
HPC message passing such as MPI
Application Characteristics
Multicore 101
►Each request requires a small amount of work
►Requests are largely independent of each other
►Requires read-only access to a moderate amount of state
►Small amount of state may travel with the request
►Must be able to manage overload effectively
►Some anti-patterns•
Non-concurrent•
Process/Thread per client•
Spawn process/thread per request
•
HPC message passing such as MPI
For telecom/datacom applications an event driven architecture is ideal to facilitate multicore migration
Application Characteristics
Multicore 101
Similar to that used by memcached
& Apache►Dispatcher can handle overload, monitoring, etc.►Multicore awareness only for central services►Plugable
Dispatcher is feasible if planned correctly►Managing global, per service, per session, and per request state
is the battleground for scalability
Sample Application Architecture
Multicore 101
►Contain•
Migrate to multicore but contain code to a single core
►Exploit•
Use an event driven architecture to add explicit functional parallelism
►Analyze►Optimize
Migrating to Multicore: What is the Pathway?
Multicore 101
Goal: Understand MC performance bottlenecks and diagnose unexpected faults
►
Benchmark first... the bottlenecks may not be where you think they are
Analyze
Multicore 101
Profiling•
Can be used for far more than CPU cycles per function or line•
e500mc core has a rich set of performance attributes it can monitor
•
MontaVista DevRocket can use oprofile
to collect and correlate this data to your code
Runtime Monitoring•
“top”
in SMP mode will give you a broad overview of CPU stats
Tracing•
Fine grained CPU-aware tracing
Analysis Tools
Multicore 101
MontaVista DevRocket Analysis Tools
Multicore 101
MontaVista DevRocket Analysis Tools
Multicore 101
MontaVista DevRocket Analysis Tools
Multicore 101
Per process & thread information•
Time in nanoseconds•
Time consumed since process start.
See: /proc/<PID>/tasks/<TID>/msa
for per-thread information
# cat /proc/1845/msa
State: Interruptible
Now: 2287392468035
ONCPU_USER 1473381312
ONCPU_SYS 3110032766
INTERRUPTIBLE 1183737626438
UNINTERRUPTIBLE 1011435
INTERRUPTED 546291
ACTIVEQUEUE 2217218048
EXPIREDQUEUE 0
STOPPED 0
ZOMBIE 0
SLP_POLL 0
SLP_PAGING 0
SLP_FUTEX 0
CGE5 Only: Microstate Accounting
Multicore 101
Debug process, thread, and kernel context
Debug “Multi-Anything
DevRocket IDE
Multicore 101
►Contain•
Migrate to multicore but contain code to a single core
►Exploit•
Use an event driven architecture to add explicit functional parallelism
►Analyze•
Use available profiling, tracing, and performance monitoring tools and APIs
►Optimize
Migrating to Multicore: What is the Pathway?
Multicore 101
Goal: Get the most from the available MC performance
►Focus attention on areas where Amdahl's law indicates the most benefit can occur!
►Leverage data parallelization for CPU bound computations
►Utilize interrupt and process/thread affinity to tune the system
Optimize
Multicore 101
►Contain•
Migrate to multicore but contain code to a single core
►Exploit•
Use an event driven architecture to add explicit functional parallelism
►Analyze•
Use available profiling, tracing, and performance monitoring tools and APIs
►Optimize•
Specialize cores as needed. Explore other MC optimizations
Migrating to Multicore: What is the Pathway?
Multicore 101
►Carrier Grade Edition 4.0•
8572•
8641D, 8640D►Carrier Grade Edition 5.0
•
8641D, 8640D
►Professional Edition 4.0•
8641D, 8640D►Professional Edition 5.0
•
8572•
8641D, 8640D
Freescale P4080 operating today on the Virtutech Simics simulator in advance of hardware availability
MontaVista offers comprehensive support of Freescale Power Architecture processors today
MontaVista Support for Freescale Multicore
Multicore 101
October 1-3, 2008 San Francisco, CA Where embedded Linux gets real
Two Ways to Learn More About Multicore
MontaVista TestDrive Evaluate Freescale multicore and MontaVista Linux for free, visit:
www.mvista.com/freescale/eval
Multicore 101
MontaVista Vision For more information on in-depth multicore sessions, visit:
www.mvista.com/vision