Five Challenges in VoIP - BZU PAGES: Find Presentations, Reports, Student's Assignments and Daily Discussion; Bahauddin Zakariya University Multan

**Waqas Ahmed** · #1 23-09-2008, 10:52 AM

CHAPTER NO.5<o:p></o:p>

CHALLENGES IN VOIP<o:p></o:p>

<o:p> </o:p>
Voice over Internet Protocol (VoIP) is the convergence of traditional voice onto the data network. Other real-time traffic, such as uncompressed video and streaming audio, is also converging onto data networks. VoIP is very complex because it involves components of both the data and the voice world. Historically, the world have used two different networks, two different support organizations and two different philosophies. The voice network has always been separate from the data network because the characteristics of voice and data have different characteristics of applications. The traditional voice network is circuit switched. Interactive voice traffic is sensitive to delay and jitter but can tolerate some packet loss. The voice philosophy was to ensure the “five nines” of reliability – 99.999%, because the lack of communication might be life threatening, (i.e. the inability of placing a “911” call for help). The data network, on the other hand, is packet switched and less sensitive to delay and jitter, but cannot tolerate loss but main emphasis on reliable data transmission over unreliable media, regardless of delay. Bandwidth in the data world is largely shared, so congestion and delay are often present for multimedia applications like voice. The factors that affect the quality of data transmission are different from the factors that affect the quality of voice transmission. For example, data is generally not affected by delay. Voice transmissions are degraded by even small amounts of extra delay and cannot be retransmitted. Additionally, a tiny amount of packet (data) loss does not affect voice quality at the receiver’s ear. But even a small loss of data can corrupt an entire file or computer application. So in some cases, introducing VoIP to a high performing data network can yield very poor voice quality. Therefore, implementing VoIP requires attention to many factors, including:<o:p></o:p>
<o:p> </o:p>
• Delay<o:p></o:p>
• Jitter<o:p></o:p>
• Packet loss<o:p></o:p>
• Packet mis-order<o:p></o:p>
• Available bandwidth<o:p></o:p>
• Packet prioritization<o:p></o:p>
• Network design<o:p></o:p>
• Endpoint audio characteristics (sound card, microphone, earpiece, etc.)<o:p></o:p>
• Duplex<o:p></o:p>
• Transcoding<o:p></o:p>
• Echo<o:p></o:p>
• Silence suppression<o:p></o:p>
• Codec selection<o:p></o:p>
• Router data-switch setup<o:p></o:p>
• Reliability <o:p></o:p>
• Scalability<o:p></o:p>
• Manageability<o:p></o:p>
• WAN protocols<o:p></o:p>
• QoS policy<o:p></o:p>
• Encryption/Decryption<o:p></o:p>
5.1 VOICE CODING<o:p></o:p>
Over time, it became obvious that digital coding was more immune to noise corruption on long-distance connections, and the world’s communications systems converted to a digital transmission format called Pulse Code Modulation or PCM. PCM converts voice into digital form by sampling the voice signals 8000 times per second and converting each sample into a code. Standard telephone PCM uses 8 bits for the code and thus consumes 64,000 bps per call. Another telephone voice standard called Adaptive Differential PCM or ADPCM codes voice into 4-bit values and so consumes only 32,000 bps. ADPCM is often used on long-distance connections. In traditional telephony applications, PCM or ADPCM is used on synchronous digital channels, which means that there is a constant stream of bits generated at the specified rate, whether there is conversation or not. There are, in fact, hundreds of brief silent periods in the average call, and each of them wastes bandwidth and money. On standard telephone connections, there is no alternative to this waste.<o:p></o:p>
In packet voice applications, speech is transported as “data” packets, and these packets are generated only when there is actual speech to transport. The elimination of wasted bandwidth during periods of silence will, by itself, reduce the effective bandwidth required for speech transport by approximately one-third.<o:p></o:p>
5.1.1 UNDERSTANDING CODECS<o:p></o:p>
A codec (“compressor/decompressor” or “coder/decoder”) is the hardware or software that samples analog sound and converts it to digital bits, which it outputs at a predetermined data rate. The codec often performs compression as well, to save bandwidth. There are dozens of available codecs, each with its own characteristics. Codecs have odd-looking names that correspond to the name of the ITU standard that describes their operation. For example, the codecs named G.711u and G.711a convert from analog to digital and back with relatively high quality. As with most things digital, higher quality implies more bits, so these two codecs use more bandwidth than lower-speed codecs. Lower-speed codecs, such as G.726, G.729, and those in the G.723.1 family, consume less network bandwidth. However, low-speed codecs impair the quality of the audio much more than high-speed codecs, because they compress the digital transmission with lossy compression.<o:p></o:p>
<!--[if gte vml 1]><v:shapetype id="_x0000_t202" coordsize="21600,21600" o:spt="202" path="m,l,21600r21600,l21600,xe"> <v:stroke joinstyle="miter"/> <v:path gradientshapeok="t" o:connecttype="rect"/> </v:shapetype><v:shape id="_x0000_s1026" type="#_x0000_t202" style='position:absolute; left:0;text-align:left;margin-left:9pt;margin-top:4.7pt;width:393.9pt; height:139.2pt;text-indent:0;z-index:1;mso-wrap-style

ne'> <v:textbox style='mso-fit-shape-to-text:t'> <![if !mso]> <table cellpadding=0 cellspacing=0 width="100%"> <tr> <td><![endif]>

<v:shapetype id="_x0000_t75" coordsize="21600,21600" o:spt="75" o:preferrelative="t" path="m@4@5l@4@11@9@11@9@5xe" filled="f" stroked="f"> <v:stroke joinstyle="miter"/> <v:formulas> <v:f eqn="if lineDrawn pixelLineWidth 0"/> <v:f eqn="sum @0 1 0"/> <v:f eqn="sum 0 0 @1"/> <v:f eqn="prod @2 1 2"/> <v:f eqn="prod @3 21600 pixelWidth"/> <v:f eqn="prod @3 21600 pixelHeight"/> <v:f eqn="sum @0 0 1"/> <v:f eqn="prod @6 1 2"/> <v:f eqn="prod @7 21600 pixelWidth"/> <v:f eqn="sum @8 21600 0"/> <v:f eqn="prod @7 21600 pixelHeight"/> <v:f eqn="sum @10 21600 0"/> </v:formulas> <v:path o:extrusionok="f" gradientshapeok="t" o:connecttype="rect"/> <o:lock v:ext="edit" aspectratio="t"/> </v:shapetype><v:shape id="_x0000_i1034" type="#_x0000_t75" style='width:378.75pt; height:131.25pt'> <v:imagedata src="file:///C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\msohtml1\01\cli p_image001.png" o:title=""/> </v:shape><o:p></o:p>

<![if !mso]></td> </tr> </table> <![endif]></v:textbox> <w:wrap type="square"/> </v:shape><![endif]-->[IMG]file:///C:/DOCUME%7E1/ADMINI%7E1/LOCALS%7E1/Temp/msohtml1/01/clip_image002.gif[/IMG]<o:p></o:p>
<o:p> </o:p>
<o:p> </o:p>
<o:p> </o:p>
<o:p> </o:p>
<o:p> </o:p>
<o:p> </o:p>
<o:p> </o:p>
<o:p> </o:p>

FIGURE 5-1. COMMON CODECS USED IN VOIP. FOR EACH CODEC, THE CODEC’S DATA RATE IS SHOWN, AS WELL AS THE TIME NEEDED BY THE CODEC TO DO THE ANALOG-TO-DIGITAL AND DIGITAL-TO-ANALOG CONVERSIONS.<o:p></o:p>

<o:p> </o:p>
Fewer bits are sent, so the receiving side does its best to approximate what the original audio sounded like, but it’s not a high fidelity recreation.<o:p></o:p>
The fig 5-1. below describes some common VoIP codecs. The middle column in the table shows the rate at which the codec generates its output. The “Packetization Delay” column refers to the delay a codec introduces as it converts from analog to digital and back. This fixed amount of delay can affect the quality of the call as perceived by the listeners.<o:p></o:p>
A typical speech coder consists of two modules: an analysis module and a synthesis module. The Analysis module extracts from the speech waveform the time varying excitation waveform and the time varying filter parameters. The Synthesizer module recreates the perceptually best match to the original speech waveform. Examples of the multipulse and stochastic coders are the MPLPC , multipulse linear predictive coder, and CELP, the code excited linear prediction. Codecs use sophisticated techniques for coding and compression. Packet loss concealment (PLC) is an additional feature available with the G.711u or G.711a codecs. <o:p></o:p>
5.1.2 CODEC SELECTION<o:p></o:p>
The codecs named G.711u and G.711a convert from analog to digital and back with high quality and no compression, however, takes a fair amount of bandwidth, was designed based on several fundamental signaling characteristics. To capture the proper degree of resolution, the voice information is sampled at double the frequency range, or 8000 times per second. Thus, PCM grabs a chunk of data every 0.125 ms (1 second / 8000 = 0.000125 seconds) so the overall bandwidth required is 8000 * 8, or 64000 bps When G.711 was invented, modern digital signal processing (DSP) technology was not available. But new compression algorithms make it possible to provide intelligible voice communications with reduced bandwidth consumption. <o:p></o:p>
The lower-speed codecs, G.726-32, G.729, and those in the G.723.1 family, consume less network bandwidth. Low speed codecs impair the quality, however, because they compress the signal with lossy compression. Fewer bits are sent, so the receiving side does its best to approximate what the original signal sounded like. The fact that they use less bandwidth is good, since you can run more concurrent calls over the same links, but the compression they do reduces the clarity, introduces delay, and makes the voice quality very sensitive to lost data. <o:p></o:p>
In the table of codec defaults shown in Figure 5-2 , some of the most commonly used VoIP codecs are listed with their default values. The “Packetization Delay” column refers to the delay a codec introduces as it converts a signal from analog to digital. Packetization delay is included in the MOS estimate (the subjective testing way), as is the “jitter buffer delay,” the delay introduced by the effects of buffering to reduce inter-arrival delay variations.<o:p></o:p>
<o:p> </o:p>

[IMG]file:///C:/DOCUME%7E1/ADMINI%7E1/LOCALS%7E1/Temp/msohtml1/01/clip_image004.jpg[/IMG]<o:p></o:p>

FIGURE 5-2 : CODEC DEFAULTS.<o:p></o:p>

The “Combined Bandwidth” column shows that the real bandwidth consumption by VoIP calls is actually higher than it first appears. The G.729 codec, for example, has a data payload rate of 8 kbps. But its actual bandwidth usage is higher than this; when sent at 20 ms intervals, the payload size is 20 bytes per datagram. To this add the 40 bytes of RTP header (yes, the header is bigger than the payload) and any additional layer 2 headers. For example, Ethernet adds 18 more bytes. And because there are two concurrent G.729 RTP flows (one in each direction), so double the bandwidth consumption you’ve calculated so far. It’s worth observing in the table that both G.723.1 codecs result in calls of only “Acceptable” quality at best. Their theoretical maximum MOS is below the 4.0 value needed to be considered “Good.”<o:p></o:p>
5.1.3 PACKET LOSS CONCEALMENT<o:p></o:p>
(PLC) is an additional option in G.711u and G.711a codecs. PLC techniques reduce or mask the effects of data loss during a VoIP conversation. When PLC is enabled, it is assumed that the quality of the conversation would be improved; this improvement is factored into the MOS estimate calculation if any data is lost. PLC makes the codec itself more expensive to manufacture, but does not otherwise add delay or have other bad side-effects.<o:p></o:p>
5.2 BANDWIDTH<o:p></o:p>
In a converged voice and data network, it is to be decided how much bandwidth to give each service. These decisions are based on careful consideration of your priorities and the available bandwidth you can afford. If you allocate too little bandwidth for voice service, there might be unacceptable quality issues. Therefore, bandwidth for voice services and associated signaling must take a priority over that of best-effort Internet traffic. If a network were to use the same prevailing encoding (CODEC) scheme as the current PSTN system, bandwidth requirements for VoIP networks would tend to be larger than that of a circuit-switched voice network of similar capacity. <o:p></o:p>
<o:p> </o:p>
[IMG]file:///C:/DOCUME%7E1/ADMINI%7E1/LOCALS%7E1/Temp/msohtml1/01/clip_image006.jpg[/IMG]<o:p></o:p>

FIG 5-3 THIS RIGHTMOST COLUMN OF THIS TABLE SHOWS THE REAL BANDWIDTH CONSUMED FOR EACH CODEC IN A TWO-WAY VOIP TELEPHONE CONVERSATION.<o:p></o:p>

The reason is the overhead in the protocols used to deliver the voice service. Typically, you would need speeds of OC-12c/STM-4 and higher to support thousands of call sessions. However, VoIP networks that employ compression and silence suppression could actually use less bandwidth than a similar circuit-switched network. The reason is because of the greater granularity in bandwidth usage. Allocations of network bandwidth are based on projected numbers of calls at peak hours. Any over-subscription of voice bandwidth can cause a reduction in voice quality. Also, you must set aside adequate bandwidth for signaling to ensure that calls are complete and to reduce service interruptions. The formula for calculating total bandwidth needed for voice traffic is relatively straightforward. The formula to calculate RTP bearer voice bandwidth usage for a given number of phone calls is as follows. <o:p></o:p>
Bits per sec = packet creation rates per sec x packet size x number of calls x 8 bits per sec<o:p></o:p>
where samples per sec = 1,000 ms / packet creation rate<o:p></o:p>
Note that this number is a raw measure of IP traffic and does not take in account the overhead used by the transporting media (links between the routers) and data-link layer protocols. Add this raw IP value to that of the overhead to determine the link speeds needed to support this number of calls. Note this value represents only the bearer (voice) content.<o:p></o:p>
Signaling bandwidth requirements vary depending on the rate at which the calls are generated and signaling protocol used. A general guideline for the maximum bandwidth requirement that an IP signaling protocol needs is roughly three percent of all bearer traffic. Using the previous example, signaling bandwidth requirements if all 2,000 calls were initiated in one second would be approximately 4.8 Mbps (3 percent of 160-megabits).With the calculation of bearer and signaling, the total bandwidth needed to support two thousand G.711 encoded calls would be an approximate maximum of 164.8 MB. This bandwidth requirement is a theoretical maximum for this specific case. If the parameters change, such as call initiation rate, voice encoding method, packet creation rate, employment of compression, and silence suppression, the bandwidth requirements would change as well. <o:p></o:p>
Example: 2,000 full-duplex G.711 encoded voice channels that have a packet creation rate of 20 ms, with a packet size of 200 bytes (40 byte IP header + 160 byte payload)<o:p></o:p>
50 samples per second = 1,000 ms / 20 ms<o:p></o:p>
160 Mbps = 50 x 200 x 2,000 x 8<o:p></o:p>
5.3 PACKET LOSS/LOST DATA<o:p></o:p>
Network packet loss is when packets are sent, but not received at the final destination due to some network problem. To ensure good quality voice in a VoIP network, packet loss should be less than 0.2% between endpoints. There are several factors that make packet loss requirements somewhat variable, such as the following:<o:p></o:p>
Packet loss requirements are tighter for tones (other than DTMF) than for voice. The ear is less able to detect packet loss during speech (variable-pitch), than it is during a tone (consistent pitch). Packet loss requirements are tighter for short, continuous packet loss than for random packet loss over time. Losing ten contiguous packets is worse than losing ten packets evenly spaced over an hour time span. Packet loss may be more noticeable for larger voice payloads than for smaller ones, because more voice is lost in a larger payload. Remember that too much delay can cause dropped packets, and it may appear the network is losing packets when in fact they have been discarded intentionally.<o:p></o:p>
VoIP datagrams are sent using RTP, the real-time transport protocol. Although every RTP datagram contains a sequence number to help applications detect data loss and datagrams received out of order, there isn’t enough time to retransmit lost or out of order datagrams. To measure data loss, each side keeps track of how many bytes of data it sent. The sender tells the receiver how many bytes it sent, and the receiver compares that value to the amount received to determine lost data. But it’s “bursts of loss” that degrade quality most significantly. <o:p></o:p>
<o:p> </o:p>

[IMG]file:///C:/DOCUME%7E1/ADMINI%7E1/LOCALS%7E1/Temp/msohtml1/01/clip_image008.jpg[/IMG]<o:p></o:p>

FIGURE 5-4 : THE EFFECT ON THE MOS OF 5% RANDOMIZED PACKET LOSS ON FOUR CODECS, AS DELAY INCREASES.<o:p></o:p>

A burst is generally considered to be more than one consecutive lost datagram. Human listeners don’t readily notice lower quality if loss of datagrams is randomly distributed, with just a few at a time dropped. This type of loss pattern has some effect, as shown in the following two figures, but the quality decline mostly stems from a combination of loss and delay. Take, for example, the following comparison charts: fig 5-4<o:p></o:p>
At 5% random packet loss the MOS starts at around 4 for the G.711 codec with PLC and declines as the delay increases. Contrast this with 5% bursty packet loss in Figure 5-5 below, and you see that the MOS starts at around 3.5 for the same codec. The effect of bursty loss is even greater on the other codecs with high compression. For example, G.729 starts with a MOS of around 3.4 for 5% random packet loss. However, with 5% bursty packet loss G.729 drops to a MOS below 2.<o:p></o:p>

[IMG]file:///C:/DOCUME%7E1/ADMINI%7E1/LOCALS%7E1/Temp/msohtml1/01/clip_image010.jpg[/IMG]<o:p></o:p>

FIGURE 5-5: THE EFFECT ON THE MOS OF 5% BURSTY PACKET LOSS ON FOUR CODECS, AS DELAY INCREASES.<o:p></o:p>

Two primary reasons explain why RTP datagrams might be lost in a data network:<o:p></o:p>

there’s too much traffic, so datagrams are discarded when there’s congestion, or<o:p></o:p>
there’s too much delay variation (jitter), so datagrams are discarded because they arrive at the listener’s jitter buffer too late or too early.<o:p></o:p>

5.3.1 LOST PACKET COMPENSATION:<o:p></o:p>
In current IP networks, all voice frames are treated like data. Under peak loads and congestion, voice frames will be dropped equally with data frames. The data frames, however, are not time sensitive and dropped packets can be appropriately corrected through the process of retransmission. Lost voice packets, however, cannot be dealt with in this manner. Some schemes used by Voice over Packet software to address the problem of lost frames are:<o:p></o:p>
1. Interpolate for lost speech packets by replaying the last packet received during the interval when the lost packet was supposed to be played out. This scheme is a simple method that fills the time between non-contiguous speech frames. It works well when the incidence of lost frames is infrequent.<o:p></o:p>
2. Send redundant information at the expense of bandwidth utilization by sending the nth packet of voice information along with the (n+1)th packet. This method has the advantage of being able to exactly correct for the lost packet. However, this approach uses more bandwidth and also creates greater delay.<o:p></o:p>
3. A hybrid approach uses a much lower bandwidth voice coder to provide redundant information carried along in the (n+1)th packet. This reduces the problem of the extra bandwidth required, but fails to solve the problem of delay.<o:p></o:p>
<o:p> </o:p>
5.4 VOICE TRAFFIC PRIORITY :<o:p></o:p>
It is a very sensitive issue being voice is an interactive and real time service and delay sensitive therefore it is required to prioritize the voice and video traffic over other traffic types. The issue is discussed at greater length in QoS chapter.<o:p></o:p>
5.5 DELAY OR LATENCY <o:p></o:p>
When designing networks that transport voice over packet, frame, or cell infrastructures, it is important to understand and account for the delay components in the network. Correctly accounting for all potential delays ensures that overall network performance is acceptable. <o:p></o:p>
5.5.1 Standards For Delay Limits<o:p></o:p>
The International Telecommunication Union (ITU) considers network delay for voice applications in Recommendation G.114. This recommendation defines three bands of one-way delay as show in Fig 5-6<o:p></o:p>

[IMG]file:///C:/DOCUME%7E1/ADMINI%7E1/LOCALS%7E1/Temp/msohtml1/01/clip_image012.jpg[/IMG]<o:p></o:p>

FIG 5-6 DELAY SPECIFICATIONS<o:p></o:p>

<o:p> </o:p>
Note: These recommendations are for connections with echo adequately controlled, which implies that echo cancellers are used. Echo cancellers are required when one-way delay exceeds 25 ms (G.131). These recommendations are oriented for national telecom administrations, and therefore are more stringent than would normally be applied in private voice networks. When the location and business needs of end users are well-known to the network designer, more delay may prove acceptable. For private networks 200 ms of delay is a reasonable goal and 250 ms a limit, but all networks should be engineered such that the maximum expected voice connection delay is known and minimized.<o:p></o:p>
5.5.2 Sources of Delay<o:p></o:p>
There are two distinct types of delay: fixed and variable. Fixed delay components add directly to the overall delay on the connection. <o:p></o:p>
Variable delays arise from queuing delays in the egress (outgoing) trunk buffers on the serial port connected to the WAN. These buffers create variable delays, called jitter, across the network. Variable delays are handled via the de-jitter buffer at the receiving router/gateway. The de-jitter buffer described in a later Section De-jitter Delay .<o:p></o:p>
Coder (Processing) Delay <o:p></o:p>
Also called processing delay, coder delay is the time taken by the digital signal processor (DSP) to compress a block of PCM samples. Because different coders work in different ways, this delay varies with the voice coder used and processor speed. Best and worst case coder delays are shown in Fig 5-7 :<o:p></o:p>
<o:p> </o:p>
[IMG]file:///C:/DOCUME%7E1/ADMINI%7E1/LOCALS%7E1/Temp/msohtml1/01/clip_image014.jpg[/IMG] [IMG]file:///C:/DOCUME%7E1/ADMINI%7E1/LOCALS%7E1/Temp/msohtml1/01/clip_image016.jpg[/IMG] <o:p></o:p>

FIG 5 -7 BEST AND WORST CASE PROCESSING DELAY<o:p></o:p>

ALGORITHMIC DELAY<o:p></o:p>

The compression algorithm, which relies on known voice characteristics to correctly process sample block N, must have some knowledge of what's in block N+1 to accurately reproduce sample block N. This look ahead, which is really an additional delay, is called algorithmic delay and effectively increases the length of the compression block. The net effect is a 5 ms addition to the overall delay on the link. This means that the total time required to process a block of information is 10m with a 5 ms constant overhead factor. See: Voice Compression.<o:p></o:p>

Algorithmic Delay for G.726 coders is 0 ms <o:p></o:p>
Algorithmic Delay for G.729 coders is 5 ms. <o:p></o:p>
Algorithmic Delay for G.723.1 coders is 7.5 ms <o:p></o:p>

<o:p> </o:p>

WORST CASE COMPRESSION TIME PER BLOCK<o:p></o:p>

+<o:p></o:p>

(DE COMPRESSION TIME PER BLOCL) X (NO. OF BLOCKS IN FRAME)<o:p></o:p>

+<o:p></o:p>

ALGORITHMIC DELAY<o:p></o:p>

=LUMPED CODER DELAY<o:p></o:p>

The lumped Coder delay for G.729 that we'll use for the remainder of this document is:<o:p></o:p>
Worst Case Compression<o:p></o:p>
Time Per Block: 10 ms<o:p></o:p>
Decompression Time Per Block x 3 Blocks : 3 ms<o:p></o:p>
Algorithmic Delay : 5 ms<o:p></o:p>
Total (c) 18 ms<o:p></o:p>
Packetization Delay<o:p></o:p>
Packetization delay is the time taken to fill a packet payload with encoded/compressed speech. This delay is a function of the sample block size required by the vocoder and the number of blocks placed in a single frame. Packetization delay may also be called Accumulation delay, as the voice samples accumulate in a buffer before being released. As a general rule you should strive for a packetization delay of no more than 30 ms. <o:p></o:p>
[IMG]file:///C:/DOCUME%7E1/ADMINI%7E1/LOCALS%7E1/Temp/msohtml1/01/clip_image018.jpg[/IMG] <o:p></o:p>

FIG 5 -8 : COMMON PACKETIZATION<o:p></o:p>

You have to balance the Packetization delay against the CPU load. The lower the delay, the higher the frame rate, and the higher the load on the CPU. On some older platforms, 20 ms payloads may strain the main CPU.<o:p></o:p>
Pipelining Delay In The Packetization ProcessThough each voice sample experiences both algorithmic delay and packetization delay, [IMG]file:///C:/DOCUME%7E1/ADMINI%7E1/LOCALS%7E1/Temp/msohtml1/01/clip_image020.jpg[/IMG] <o:p></o:p>

FIG 5-9: SERIALIZATION DELAY IN MSEC FOR DIFFERENT FRAME SIZES<o:p></o:p>

<o:p> </o:p>
in reality, the processes overlap and there is a net benefit effect from this pipelining. <o:p></o:p>
Reading from the table, on a 64 Kbps line, a CS-ACELP voice frame with a length of 38 bytes (37+1 flag) has a serialization delay of 4.75 ms.<o:p></o:p>
Note: The serialization delay for a 53 byte ATM cell (T1: 0.275ms, E1: 0.207ms) is negligible due to the high line speed and small cell size.<o:p></o:p>
<o:p> </o:p>
Queuing/Buffering Delay <o:p></o:p>
After the compressed voice payload is built, a header is added and the frame is queued for transmission on the network connection. Because voice should have absolute priority in the router/gateway, a voice frame must only wait for either a data frame already playing out, or for other voice frames ahead of it. Essentially the voice frame is waiting for the serialization delay of any preceding frames in the output queue. Queuing delay is a variable delay and is dependent on the trunk speed and the state of the queue. Clearly there are random elements associated with the queuing delay.<o:p></o:p>
For example, assume we are on a 64 Kbps line, and that we are queued behind one data frame (48 bytes) and one voice frame (42 bytes). Because there is a random nature as to how much of the 48 byte frame has played out, we can safely assume, on average, that half the data frame has been played out. Using the data from the serialization table, our data frame component is than 6ms * 0.5 = 3ms. Adding the time for another voice frame ahead in the queue (5.25 ms) gives a total time of 8.25 ms of queuing delay.<o:p></o:p>
Network Switching Delay <o:p></o:p>
The public frame relay or ATM network interconnecting the endpoint locations is the source of the largest delays for voice connections. These delays are also the most difficult to quantify. If wide-area connectivity is provided by private network, it is possible to identify the individual components of delay. In general, the fixed components are from propagation delays on the trunks within the network, and variable delays are from queuing delays clocking frames into and out of intermediate switches. To estimate propagation delay, a popular estimate of 10 microseconds/mile or 6 microseconds/km (G.114) is widely used, although intermediate multiplexing equipment, backhauling, microwave links, and other factors found in carrier networks create many exceptions.<o:p></o:p>
5.6 ECHO<o:p></o:p>
Two problems that result from high end-to-end delay in a voice network are echo and talker overlap. Echo becomes a problem when the round-trip delay is more than 50 milliseconds. In circuit switched systems, echo is caused by signal reflections generated by the hybrid connection that converts between a 4-wire circuit (2 separate transmit and receiver pair) and the 2-wire circuit (1 transmit and receiver pair). When the signals pass from the 4- wire to the 2-wire, some of the energy in the 4-wire circuit is reflected back towards the speaker. <o:p></o:p>
[IMG]file:///C:/DOCUME%7E1/ADMINI%7E1/LOCALS%7E1/Temp/msohtml1/01/clip_image022.jpg[/IMG]<o:p></o:p>
<o:p> </o:p>

Fig 5-10 ECHO<o:p></o:p>

<o:p> </o:p>

Echo is not always bad. A small amount of echo is called side tone, which is positive. The sidetone echo reinforces that your voice is being carried towards the conversation partner. As long as the round-trip delay is less than 50 ms and is not too loud, it is acceptable. However as the delay between your voice and the reflected signal increases, the echo becomes intrusive. In circuit switch networks, the round-trip delay of echo is less then 50 ms because circuit switched networks are configured to cancel out any echo with delay above 45 to 50 ms depending on the network. Talker overlap: (the problem of one caller stepping on the other talker's speech) becomes significant if the one-way delay becomes greater than 250 milliseconds. The end-to-end delay budget is therefore the major constraint and driving requirement for reducing delay through a packet network.<o:p></o:p>
Echo canceller processing can be divided into two processes, storing and comparing/filtering:<o:p></o:p>
Storing: In order to identify a reflected signal, the echo canceller must first store the incoming signal. Therefore, all of the voice traffic transiting the IP network is stored in a First In First Out (FIFO) buffer.<o:p></o:p>
<o:p> </o:p>

[IMG]file:///C:/DOCUME%7E1/ADMINI%7E1/LOCALS%7E1/Temp/msohtml1/01/clip_image024.jpg[/IMG]<o:p></o:p>

Fig 5-11:Principle of Echo Cancellation<o:p></o:p>

The size of the buffer is determined by the expected echopath delay. The longer the expected echo path delay the larger the required buffer. There are two types of echo cancellation. Near End Echo Cancellation is so called because the echo is cancelled nearest to the echo source. It is possible to perform echo cancellation at the far end, but the echo path will be much longer, normally requiring more processing power.<o:p></o:p>
Comparing and Filtering:The echo canceller compares the signal similarity and power level coming from the hybrid point to the previously stored signals passed from the packet network. The comparison is made using a model of the hybrid circuit, which was created by an adaptive algorithm inside the echo canceller. However, identifying echo is a difficult task for the adaptive filter because of ``double-talk'' issues. Double talk occurs when both sides of a conversation attempt to speak simultaneously. The near end speaker may confuse the echo identification process, thus limiting the effectiveness of the echo canceller. Also, The echo canceller must not clip the beginning of the double talk session.<o:p></o:p>
To alleviate the double talk issue, a double talk detector (DTD) is implemented. The double-talk detector works in most cases, but when it fails to detect double-talk, some echo is still noticeable. The echo canceller also needs to identify and remove background noise. Once an echo is identified, the echo canceller subtracts the echo from the returning signal. Echo cancellation can be very resource intensive. The amount of processing power necessary to accurately compare and filter out echo can be high. This need for high processing power can be detrimental to the vendor's overall solution. Since processing power is a finite resource, as more resources are diverted to echo cancellation, fewer resources are available for processing additional voice channels. In summary, high processing power requirements for echo cancellation mean more power consumption, higher cost, and lower port density<o:p></o:p>
5.7 JITTER<o:p></o:p>
Jitter is the measure of time between when a packet is expected to arrive to when it actually arrives. In other words, with a constant packet transmission rate of every 20 ms, every packet would be expected to arrive at the destination exactly every 20 ms. This situation is not always the case. For example, Figure 5-12 shows packet one (P1) and packet three (P3) arriving when expected, but packet two (P2) arriving 12 ms later than expected and packet four (P4) arriving 5 ms late.<o:p></o:p>
<o:p> </o:p>
[IMG]file:///C:/DOCUME%7E1/ADMINI%7E1/LOCALS%7E1/Temp/msohtml1/01/clip_image026.jpg[/IMG] <o:p></o:p>

FIGURE 5-12 : EXAMPLE JITTER<o:p></o:p>

<o:p> </o:p>
Jitter, also called delay variation, indicates the differences in arrival times among all datagrams sent during a VoIP call. When a datagram is sent, the sender gives it a timestamp which is placed in the RTP header. When it’s received, the receiver adds another timestamp. These two timestamps are used to calculate the packet’s transit time. If the transit times for datagrams within the same call are different, the call contains jitter. In a video application, jitter manifests itself as a flickering image, while in a telephone call, its effect may be similar to the effect of lost data: some words may be missing or garbled.<o:p></o:p>
The greatest culprit of jitter is queuing variations caused by dynamic changes in network traffic loads. Another cause is packets that might sometimes take a different equal-cost link that is not physically (or electrically) the same length as the other links. The amount of jitter in a call depends on the degree of difference between the datagrams’ transit times. If the transit time for all datagrams is the same (no matter how long it took for the datagrams to arrive), the call contains no jitter. If the transit times differ slightly, the call contains some jitter. As jitter values exceed 50 ms, the MOS declines, indicating poor call quality. Jitter provides a short-term measurement of network congestion and can show the effects of queuing within the network. IP phones send voice datagrams at a constant rate based on the codec’s default datagram size. <o:p></o:p>
5.8 TRANSCODING<o:p></o:p>
Transcoding is a voice signal converted from analog to digital or digital to analog (possibly with or without compression and decompression). If calls are routed using multiple voice coders, as in the case of call coverage on an intermediary system back to a centralized voice mail system, the calls may experience multiple transcodings (including the one in and out of the voice mailbox). Each transcoding episode results in some degradation of voice quality.<o:p></o:p>
5.9 SILENCE SUPPRESSION AND VOICE ACTIVITY DETECTION<o:p></o:p>

[IMG]file:///C:/DOCUME%7E1/ADMINI%7E1/LOCALS%7E1/Temp/msohtml1/01/clip_image028.jpg[/IMG]<o:p></o:p>

FIGURE 5-13 VAD<o:p></o:p>

<o:p> </o:p>
VAD BEHAVIOUR<o:p></o:p>
Voice Activity Detection (VAD) monitors the received signal for voice activity. When no activity is detected for the configured period of time, the software informs the Packet Voice Protocol. This prevents the encoder output from being transported across the network when there is silence, resulting in additional bandwidth savings. This software also measures the idle noise characteristics of the telephony interface. It reports this information to the Packet Voice Protocol to relay this information to the remote end for noise generation when no voice is present. Aggressive VADs cause voice clipping and can result in poor voice quality, but the use of VAD can greatly conserve bandwidth and is therefore a very important detail to consider when planning network bandwidth – especially in the WAN (Wide Area Network).<o:p></o:p>
5.10 NETWORK PACKET MIS-ORDER<o:p></o:p>
Network packet mis-order is, for voice over IP, very much like packet loss. If a packet arrives out of order, it is generally discarded, as it makes no sense to play it out of order. Specifically, packets are discarded when they arrive later than the jitter buffer can hold them. Mis-order can occur when networks send individual packets over different routes. Planned events like load-balancing or unplanned events such as re-routing due to congestion, or other transient difficulties can cause packet misorder. Packets traversing the network over different routes may arrive at their destination out of order. Network latency over multiple yet unequal routing paths can also force packet mis-order.<o:p></o:p>
5.11 RELIABILITY<o:p></o:p>
Although network failures are rare, planning for them is essential. Failover strategies are desirable for cases when network devices malfunction or links are broken. An important strategy is to deploy redundant links between network devices and/or to deploy redundant equipment. To ensure continued service, plan carefully for how media gateways and media gateway controllers can make use of the redundant schemes.<o:p></o:p>
5.12 SCALABILITY<o:p></o:p>
The ability to grow and scale to meet new requirements is critical when deploying voice services, either traditional voice networks or converged VoIP infrastructures. As companies experience growth, the existing infrastructure deployed must support this growth without having to replace the infrastructure. By leveraging from common components amongst the product families, Foundry infrastructures allow customers to retain the initial investment by not requiring replacement components. It also helps reduce TCO by allowing common sparing for the infrastructure<o:p></o:p>
5.13 SECURITY<o:p></o:p>
Security, especially in a converged voice and data network, is a high priority. You need to protect the voice communications devices from unauthorized access and malicious attack. It is conceivable that such attacks would either cripple or completely disable voice services.<o:p></o:p>
5.14 PC CONSIDERATIONS USING IP SOFT PHONE<o:p></o:p>
IP SoftPhone is software on a PC that simulates a telephone. The “perceived” audio/voice quality at the PC endpoint is a function of at least four factors:<o:p></o:p>
1. Transducer Quality:The selection of speaker and microphone or headset has an impact on the reproduction of the sound.<o:p></o:p>
2. Sound Card Quality:There are several parameters that affect sound card quality. The most important is whether or not the sound card supports full-duplex operation.<o:p></o:p>
3. End-to-End Delay: A PC can be a major component of delay in a conversation. PC delay consists of the jitter buffer and sound system delays, as well as the number of other processes running and the speed of the processor.<o:p></o:p>
4. Speech Breakup: Speech breakup may be the result of a number of factors:<o:p></o:p>
• Network jitter in excess of the jitter buffer size<o:p></o:p>
• Loss of packets (due to excessive delay, etc.)<o:p></o:p>
• Aggressiveness of Silence Suppression<o:p></o:p>
In an effort to reduce network load, silence suppression is used to eliminate the transmission of silence. However, some silence suppression algorithms may clip speech and have an effect on perceived audio quality.<o:p></o:p>
• Performance bottleneck in the PC Lower speed PCs (or PCs with slow hard drives) may have adverse interactions with sound playback and recording. <o:p></o:p>
5.15 NETWORK DESIGN RECOMMENDATIONS<o:p></o:p>
In the early days of networking, network designers used hubs to attach servers and workstations, and routers to segment the network into manageable pieces. Because of the high cost of router interfaces and the inherent limitations of shared-media hubs, network design was generally well done. In recent years, with the rise of switches to segment networks, designers could hide a number of faults in their networks and still get good performance. As a result, network design has suffered. VoIP will place new demands on the network. Suboptimal designs will not be able to cope with these demands. Even with switches installed, a company must pay attention to industry “best practices” in order to have a properly functioning voice network. Because users will not tolerate poor voice quality, administrators must implement a sound network before beginning VoIP pilots or deployments.<o:p></o:p>
Best Practices<o:p></o:p>
Industry best practices dictate that a network be designed with the following factors in mind:<o:p></o:p>
Reliability/redundancy, Scalability, Manageability , Bandwidth<o:p></o:p>
Voice mandates the following additional considerations when designing a network:<o:p></o:p>
Delay, Jitter, Loss, Duplex<o:p></o:p>