Revenue-generating routers require a technology platform that provides significant packet processing power (even under worst-case traffic conditions) while offering flexibility at lower cost. The rapid advancement of FPGA technology has made it possible to design entire routers and switchblades based on FPGAs. Today's Platform FPGAs provide a complete platform for packet processing, classification and policing, traffic management, backplane communication, etc.
Network processing Network processors are highly optimized off-the-shelf devices used for processing network data traffic, offering time-to-market and flexibility over the incumbent ASICs. They extract, classify and filter incoming bit-streams, determine destination ports, and forward data packets to switch matrix with optional traffic management functions.
Figure 1: Line card functional block diagram.To achieve the performance required for packet processing, several vendors approach the problem by breaking the functions (shown in
Figure 1) into:
Classification co-processor - assigns a packet to a flow. Policing engine - ensures that a flow does not use more bandwidth than allocated in its SLA (service level agreement). It is typically performed at the edge of a QoS (quality of service) network and the non-compliant packets are either dropped or marked for later action. Traffic manager - enforces SLAs for that flow. Generally, packets from different flows with varied SLAs are reordered and often dropped. Packets within a flow are never reordered. Traffic management, which includes traffic shaping, queuing and scheduling, is the most bandwidth-intensive and critical function in the network processing flow. Traffic shaping helps manage congestion and deal with the bursty nature of network traffic. Queuing and scheduling engines determine departure time and ordering of packets. They create hierarchical queues to aggregate flows into classes and classes into ports. Each level of hierarchy can use different queuing algorithms to prioritize the various flows. Typically, traffic managers are standalone chips that perform shaping, queuing and scheduling based on the set of governing policies determined by the classifier. They provide fine-grain QoS and maintain SLAs. An external processor may be required only to set-up or tear down flows, but not on a per packet/cell basis. Every system differs in traffic management protocols, memory management, payloads, interfaces, etc. Typical policing algorithms include leaky bucket, token bucket, etc. Congestion management algorithms include random early detect (RED), weighted RED (WRED), etc. Scheduling algorithms include priority queuing (PQ), fair queuing (FQ), weighted FQ (WFQ), etc. Off-the-shelf NPUs (network processing units) rarely meet the performance requirements, and typical OC-48c traffic managers for packet over SONET/SDH require separate traffic managers in the ingress and egress paths or a full-duplex (5Gbps) traffic manager. They also rarely support all the required algorithms. Also, invest-ment in ASICs is cost prohibitive.
Enabling traffic management and backplanesTraffic management demands high performance, flexibility and support for multiple queuing and scheduling algorithms and protocols, memory types and interfaces. Platform FPGA devices offer the following features that provide key advantages for traffic management:
High-speed interfacing
- Up to 24 embedded MGTs (multi-gigabit transceivers) enable high-speed (up to 10.3125Gbps) with improved noise immunity, lower power, reduced signal count and reduced board complexity.
- These devices also support 17 single-ended and 6 differential standards, required for schedulers using:
· HSTL for high-speed interfacing to framers and memories
· SSTL for interfacing to framers, memories and ASSPs
· PECL for clock inputs/ outputs
· LVDS/CML for blade or backplane communication
· PCI for interfacing to CPU chipsets
· LVCMOS/LVTTL for almost everything else
- Provides a large number of package types and high IO pin count (maximum of 1200) for the throughput required for interfacing.
- Every pin on the FPGA provides digitally-controlled impedance (DCI) for simplified board layout via elimination of hundreds or thousands of off-chip terminating resistors. This allows for fewer layers and shorter traces on the PCB, leading to higher system reliability. DCMs (digital clock managers) and clock distribution trees - The traffic manager interfaces to several external devices and must handle multiple clock domains at different frequencies. DCMs compensate for signal skew due to clock distribution delays and board layout constraints. A DCM and clock-tree is typically utilized for each external high-speed interface. The 12 DCMs provide phase shifting and frequency synthesis, suited for systems with multiple clock domains and critical timing requirements. DCMs support over 400MHz clock outputs to enable leading-edge interfaces, such as RapidIO and SPI-4. Being digital, the DCM is impervious to system temperature and voltage variations. The DCM offers a zero-delay clock buffer with a precise 50/50 duty cycle generation. Precise phase control is within a 1 percent clock period accuracy, which is critical for setup and hold time alignment. It allows precise frequency generation from 24 to 420MHz. BlockRAM - The over 10Mbit embedded BlockRAM is ideal to store frequently accessed objects, thus accelerating performance. The embedded memory enables a plethora of applications, such as memory cache, storage for statistics and scratch pad, storage of bitmaps for transmit schedules and buffer management operations, clock domain crossing, and elastic buffer for intra-chip communication. Multipliers - Traffic managers require intense arithmetic operations for packet scheduling computation. Scheduling involves multiply operations between integers and floating point numbers (Tsi (t+) <= Tsi (t) + Lpkt/ri). Typical algorithms require 18-bit multiply operations at 100MHz performance. Platform FPGAs offer up to 556, 18318 multipliers per device, running over 300MHz. The multipliers and logic allow the design of custom hardware accelerator cores like encryption, check-sum calculations and DSP. Large amount of high-performance programmable logic (up to 10 million gates) and routing resources - The scheduler performs a large number of complex operations at very high speed. Also, the operands are maintained in registers. Since each scheduling decision must be made in each cycle, deep pipelines are employed, which cause data hazards and hence serious inefficiencies. A large number of on-chip flip-flops are needed to satisfy these design objectives. The FPGA offers logic performance in excess of 300MHz, with a number of internal routing resources for the numerous, wide communication paths and for storing linked-lists. PowerPC processor, Core-Connect and tools - Today's Platform FPGAs embed up to four, 300MHz (420 D-MIPS) IBM PowerPC cores to assist in functions such as statistics monitoring, control and exception handling. Solutions include the IBM CoreConnect bus for access to peripherals and a smooth hardware and software design environment through the System Generator for PowerPC, GNU compiler and software debugger tool chain, WindRiver VxWorks, etc. Debug tools such as ChipScope Pro are also available. ConclusionBrute force alone cannot meet the design objectives of modern packet switching platforms. The new level of performance and features of Platform FPGAs provide a powerful platform for building revenue-generating routers and switches.
Author informationAmit Dhir is a Senior Manager in the strategic solutions marketing group at Xilinx. He has a BSEE from Purdue University, MSEE from San Jose State University, and is working on his MBA with the University of California at Berkeley's Haas School of Business. He may be reached at
Dhir@xilinx.com.