ECE 358
1: Introduction
1.1 What is the Internet?
Millions of connected computing devices.
- Hosts = End systems.
- Running network apps.
Protocols define format, order of messages sent and received among network entities, and actions taken on message transmission, receipt.
1.2 Network Structure
Network Edge: Hosts (clients and servers). Servers are often in data centers.
Access Networks, Physical Media: Wired, wireless communication links.
Network Core: Interconnected routers, a network of networks.
Wired Access Networks
Shared wireless access network connects end systems to router via a base station (access point).
Wireless LANs: Within building (100ft), 802.11 b/g (WiFi), 11 / 54 Mbps transmission rate.
Wide-Area Wireless Access: Provided by telco (cellular) operator. 1-10 Mbps transmission rate.
Hosts break application message into smaller chunks known as packets, of length L bits. Packets are transmitted across the network at a transmission rate T. The transmission delay is then \frac{L}{R}.
- Bit: propagates between transmitter / receiver pairs.
- Physical Link: What lies between the transmitter and receiver.
- Guided Media: Signals propagate in solid media (copper, fiber, coax).
- Coaxial Cable: Two concentric copper conductors. Bidirectional. Broadband (multiple channels on cable).
- Fiber Optic Cable: Glass fiber carrying light pulses, each pulse is a bit. High speed point-to-point transmission. Low error rate because repeaters are spaced far apart, and they are immune to electromagnetic noise.
- Unguided Media: Signals propagate freely (radio).
- Radio: Signal carried in electromagnetic spectrum. Propagation environment effects (reflection, obstruction, interference).
- Twisted Pair (TP): Two insulated copper wires.
1.4 Network Core
Mesh of interconnected routers.
Packet Switching
Hosts break application-layer messages into packets. Packets are forwarded from one router to the next, across links on a path from source to destination.
- Store and forward: Entire packet must arrive at router before it can be transmitted to the next link.
- End-end delay: \frac{2L}{R} assuming zero propagation delay.
- Queuing and Loss: If the arrival rate to the link exceeds the transmission rate of the link, packets can be dropped if memory (buffer) fills up.
The two key network core functions are routing and forwarding.
Circuit Switching
- Instead of breaking up packets into smaller messages, divide the end-end resources and assign to “calls” between source and destination.
- Circuit segment is idle if it is not used in a call.
- Dedicated resources allow for circuit-link (guaranteed) performance.
- Commonly used in traditional telephone networks.
Packet Switching vs. Circuit Switching
Packet switching allows for more users to use the network.
- Great for “bursty” data. There is no call setup, and resources are shared.
- Excessive congestion is possible which could lead to packet delay and loss. There is a need for protocols (reliable data transfer, congestion control).
- Bandwidth guarantees are needed for audio / video applications.
1.4 Loss and Delay in Networks
Four sources of packet delays.
- Nodal Processing: Checking bit errors, determining the output link. Usually < 1ms.
- Queueing Delay: Time waiting at output link for transmission. Depends on the level of congestion at the router.
- Transmission Delay: \frac{L}{R}, where L is the packet length and R is the link bandwidth.
- Propagation Delay: \frac{d}{s} where d is the length of the physical link and s is the propagation speed in the medium.
Throughput: Rate at which bits are transferred between the sender and receiver.
Internet protocol stack
Application: Running the network applications that you want (HTTP, FTP, SMTP). Implemented using software only.
Transport: Responsible for communication of processes between layers (TCP, UDP). Mainly implemented using software.
Network: Responsible for routing from source to destination. Implemented using software and hardware.
Link: Transferring to data between physically connected (adjacent) hubs (e.g. Computer -> Router, Router -> Router …). Hardware.
Physical: How you are transferring physical bits on the wire. Not covered in this course.
Before the internet, there were two extra section that were planned.
Presentation: Applications to interpret meaning of data (e.g. encryption, compression, etc.).
Session: Synchronization, checkpointing.
If you want these features, it must be implemented in the application layer.
Encapsulation
Switch: Link and physical layers. Router: Network, link, and physical layers.
Information is added from the above layer to the lower.
Transport: Segment
Network Layer: Datagram
Link Layer: Frame
When passing data through, you need to pop the stack (decapsulation) and then replace (encapsulation).
Network Security
Internet was not designed with security in mind. How can bad guys attack and how can we defend? How can we design architectures that are immune to attacks.
In real life: Hackers hack your system and then you fix the error and catch up.
Malware any harmful software.
Virus: Interaction required.
Worm: Weakness in application you are already running.
Spyware: Record keystrokes, web sites, etc.
Infected hosts can be enrolled in botnet to help in future attacks.
Examples of Attacks
Denial of Service (DoS): Attackers infect local computers with a worm in order to all send messages to the target at the same time. Target will be busy services request so they cannot service normal customers.
Packet “sniffing”: Passive program that is on the link which listens to and records packets.
IP spoofing: Send a message with a false source address.
Routers: Implements the bottom 3 layers (network, data link, physical).
Switches: Implements the bottom 2 layers (data link, physical).
2: Application Layer
2.1: Principles of network applications
Creating a network app
Only write to run on end systems (not on the internal network). Does not matter if they are servers or clients. We also only care about what we want to do on the application layer and then push to the transport layer which will take care of transmitting the data. This allows for fast development.
Client-Server: Server is always on with a static IP. Client will communicate with servers (intermittent). Client IP addresses will be dynamic.
P2P: Arbitrary end systems directly communicate. Peers request service from other peers and provide service in return (self scalability). There is more complex management because there is no central network that controls all uses. Security is also a big issue.
Process Communication
Process: program running within a host.
- Within same host, two process communication using inter-process communication defined by the OS.
- Processes in different hosts communicate by exchanging messages.
Client process: Process that initiates communication.
Server process: Process that waits to be contacted.
Sockets
Process sends / recieves messages to / from it’s socket.
Socket: The gate between the application layer and the transport layer. Sockets are similar to doors. Sending process shoves message out of door and relies on the other side of the door to deliver message to the socket at the receiving process.
Example: Chrome tabs. How do we know which tab to put data? There is an identifier for each socket to route the data to the correct tab (You will always talk to port 80 of a server for HTTP requests but you will use a random port).
Sockets consists of two parts (for now).
- Address of host.
- Port number.
The combination of the two parts will create a unique identifier for the socket. Each process has it’s own unique socket (there are over 65k ports).
Application Layer Protocol Defines …
Rules for when and how process send and respond to messages.
Types of messages exchanged: Request, response.
Message syntax: What does each field mean and how fields are delineated.
Message semantics: What do you want the information in the fields to mean.
Examples of open protocols are HTTP and DNS which are defined in RFCs (Request For Comments is a formal document from the Internet Engineering Task Force). An example of a proprietary protocol is Skype.
What Transport Services Does an Application Need?
Will be implemented at the application level.
- Data Integrity. Some applications require 100% data transfer whereas other applications (e.g. audio) can tolerate some loss.
- Timing. Time sensitive applications (e.g. Internet telephone, interactive games) require low delay to be effective.
- Throughput (How fast the data can be transferred). Some applications require a minimum amount to be effective.
- Security. Encryption.
File Transfer |
No |
Elastic |
No |
E-Mail |
No |
Elastic |
No |
Web Documents |
No |
Elastic |
No |
Real-Time AV |
Tolerant |
Minimum requirements |
Yes, 100’s ms |
Stored AV |
Tolerance |
Minimum requirements |
Yes, few seconds |
Interactive Games |
Tolerant |
Minimum requirements |
Yes, 100’s ms |
Text Messaging |
No |
Elastic |
Yes and No |
Internet transport protocol services
What is offered by the transport layer? Which protocol should I Use? There are two main protocols …
- TCP (Transport Control Protocol)
- Reliable transport between sending and receiving process. Everything will get there without errors and in order.
- Flow control: Sender won’t overwhelm receiver.
- Congestion control: Throttle sender when network is overloaded.
- Does not provide: Timing, minimum throughput guarantee, security.
- Connection-oriented: Setup required between client and server processes.
- UDP (User Datagram Protocol)
- Unreliable data transfer: No data integrity or reliability between sending and receiving process.
- Does not provide: Reliability, flow control, congestion control, timing, throughput guarantee, security, or connection setup.
Why bother with UDP?
- UDP is faster than TCP. There is no connection setup.
- There are no overhead bits required.
Email |
SMTP |
TCP |
Remote Terminal Access |
Telnet |
TCP |
Web |
HTTP |
TCP |
File Transfer |
FTP |
TCP |
Streaming Multimedia |
HTTP, RTP |
TCP or UDP (Usually UDP) |
Internet Telephone |
SIP, RTP |
TCP or UDP |
Securing TCP
TCP and UDP have no encryption. Plaintext passwords send into sockets.
If you want security, you can use SSL (secury socket layer) which provides encryption for TCP connection. SSL is used at the application layer (need to use a SSL library such as OpenSSL).
2.2: Web and HTTP
A web page consists of objects. (HTML, JPEG, Java applet, auto file, …).
- Web page consists of base HTML-file which includes several referenced objects.
- Each object is addressable by a URL (Uniform Resource Locator).
HTTP (hypertext transfer protocol): We have a client / server model. The client sends requests to the server and the server replies with an HTTP request.
- HTTP uses TCP. The client initiates TCP connection (creates socket) to server, port 80. (A random port is used on the client side).
- Server accepts TCP connection from client.
- HTTP messages (application-layer protocol messages) exchanged between browser (HTTP client) and Web server (HTTP server).
- TCP connection closed.
HTTP is “stateless”. The server maintains no information about past client requests.
Protocols that maintain “state” are complex. Past history must be maintained and if the client / server crashes, their views of the “state” may be inconsistent.
Types of HTTP Connections
- Non-Persistent HTTP
- At most one object sent over TCP connection.
- Downloading multiple objects requires multiple connections.
- Server is free more frequently to server other users.
- Persistent HTTP
- Multiple objects sent over a single TCP connection.
- Avoid excessive overhead when requesting multiple objects.
Response Time
RTT (Round Trip Time): Time it takes a small packet to travel from client to server and back.
HTTP Response Time
- One RTT to initiate TCP
- One RTT for HTTP request and first few bytes of HTTP response in return.
- File transmission time (transmission delay).
Non-Persistent HTTP Response Time = 2RTT + transmission time.
Persistent HTTP Response Time = RTT + n(RTT + transmission time).
For non-persistent HTTP, we require 2 RTT per object. There is OS overhead for each TCP connection. Browsers often open parallel TCP connections to fetch referenced objects.
For persistent HTTP, subsequent HTTP messages between client / server will be over the same connection so the client will sent requests as soon as it encounters a referenced object. There is as little as on RTT per referenced object.
HTTP Request
- Request Line (method URL version).
- Header Lines (header_field_name value).
HTTP/1.0
- GET (request data).
- POST (send data).
- HEAD (leave requested object out of response).
HTTP/1.1
- PUT (uploads file in entry to path specified in URL).
- DELETE (delete files specified in URL).
HTTP Response
- Status Line (protocol status_code status_phrase).
- Header Lines (content_length).
- Data Requested (e.g. HTML file).
Cookies
Used to identify users. ID is stored in client side which is sent in the header of subsequent messages.
- Cookie header line in HTTP response from server.
- Cookie header line in next HTTP request from client.
- Cookie file kept client-side.
- Back-end database for server.
Cookies are used to keep state.
Web Caches (Proxy Server)
Satisfy client request without involving the original server.
All requests are sent through a proxy server instead which caches requests and responses. If the request is already cached, it does not have to travel all the way to the original server.
- Typically the cache is installed by ISP.
Example:
Average object size: 100\ K bits.
Average request rate: 15\ /\ sec.
RTT from institutional router to original server: 2\ sec
Access link rate: 1.54\ Mbps
We then have LAN utilization: \frac{1.5\ Mbps}{1\ Gbps} = 0.15\%
Access link utilization: \frac{1.5\ Mbps}{1.54\ Mbps} = 97.4\%
Total Delay = Internet Delay + Access Delay + LAN Delay
Access Delay: \frac{100\ K}{(1 - 0.974)1.54\ Mbps}. (New users can only use whatever is left of the access link).
LAN Delay: \frac{100\ K}{(1 - 0.0015)1\ Gbps}
Total Delay = 2 + 2.5 + 0.0001
Solutions:
Get a faster access link. If the access link rate was increased to 154\ Mbps, the access link delay drops to 0.0007 so the total delay is only 2 sec.
Caching. If the cache hit rate was 0.4, 40% of requests are satisfied at the cache. The data rate to browsers over the access link is 0.6 \cdot 1.50\ Mbps = 0.9\ Mbps. Utilization becomes 0.58. Total Delay = 0.6 \cdot(delay from original servers) + 0.4 \cdot(delay when satisfied by the cache) = 0.6(2.15) = 1.29\ sec.
Conditional GET - If the cache has an up-to-date version, don’t send the object. - The web server sends a conditional get with the If-modified-since. - Response code 304 Not Modified if the cached version is fine. - Response code 200 OK will send the updated data.
2.3: DNS
Domain Name System.
Internet uses 32-bit IP address to identify hosts connected to the internet. The public cannot remember the real address so we use easy to remember names.
- Distributed Database: Implemented in hierarchy of many name servers
- Main function is to resolve names into addresses (IP address).
For complex objects, we push complexity to the edge (host systems).
DNS Services
- Hostname to IP address translation.
- Host aliasing.
- Mail server aliasing.
- Load distribution.
- Many IP addresses correspond to the same name.
Why Not Centralize DNS?
Not scalable.
- Single point of failure.
- Traffic volume.
- Distant centralized database.
- Maintenance.
Hierarchy
- Bottom Tier: amazon.com, umass.edu.
- Middle Tier: com DNS servers, edu DNS servers.
- Top Tier: Root DNS servers.
If the client wants the IP for www.amazon.com.
- Queries root domain server to find com DNS server.
- Queries .com DNS server to get amazon.com DNS server.
- Queries amazon.com DNS server to get IP address for www.amazon.com.
TLD Servers
- Responsible for com, org, etc.
- Network Solutions maintains servers for .com TLD
- Educause for .edu TLD
Authoritative DNS Servers
- Organization’s own DNS server(s), providing authoritative hostname to IP mappings for organization.
Local DNS Name Server
- Each ISP has one, acts as a proxy and forwards query to hierarchy.
If the host at cis.poly.edu wants IP address for gaia.cs.umass.edu.
Iterated Query: - Contacted server replies with name of server to contact. - “I don’t know this name, but ask this server”.
Recursive Query: - Puts burden of name resolution on contacted name server. - Heavy load at upper levels of hierarchy?
Once (any) name server learns the mapping, it caches mapping.
- Cache entries timeout after some time (TTL).
- TLD servers typically cached in local name servers.
- Cached entries may be out-of-date until everyone’s TTLs expire.
DNS Records
Distributed database storing resource records (RR).
RR Format: (Name, Value, Type, TTL).
- type = A.
- name is hostname.
- value is IP address.
- (www.amazon.com, 220.180.90.10, A, TTL).
- type = NS.
- name is domain (e.g. foo.com).
- value is hostname of authorative name server for this domain.
- (uwaterloo.ca, dns.uwaterloo.ca, NS, TTL).
- type = CNAME.
- name is alias name for some “canonical” (the real) name.
- value is canonical name.
- (www.ibm.com, servereast.backup2.ibm.com, CNAME, TTL).
- type = MX.
- value is the name fo mailserver associated with name.
DNS Protocol, Messages
Query and reply messages have the same message format.
msg Header:
Chapter 2 slides have a better visualization for format.
- Identification (2 bytes)16 bit number for query, reply to user uses the same number.
- Flags (2 bytes):
- Query or reply.
- Recursion desired.
- Recursion available.
- Reply is authoritative.
- Questions: Name, type fields for a query.
- Answers: RRs in response to a query.
- Authority: Records for authoritative servers.
- Additional Info: Additional “helpful” info that may be used.
Attacking DNS
- DDoS Attacks.
- Bombard root servers with traffic.
- Not successful to date.
- Traffic filtering (not all traffic provided to root servers).
- Local DNS servers cache IPs of TLD servers, allow the root server to be bypassed.
- Bombard TLD servers.
- Potentially more dangerous.
- Redirect Attacks.
- Man-in-middle.
- Attacker intercepts query and returns a modified response. They can redirect you to their own websites.
- DNS poisoning.
- Inserts an incorrect record to the DNS server.
- Exploit DNS for DDoS.
- Use already existing DNS to attack.
- Send a DNS query and claim it is from the victim.
- The victim is busy processing all of the responses.
- Requires amplification (difficult to launch because you need to query many servers at the same time).
3: Transport Layer
3.1: Transport-Layer Services
- Provides logical communication between a process on the source to a process on the receiver.
- Transport protocols run in end systems.
- Source side: breaks app messages into segments, passes into network layer.
- Receiving side: Reassembles segments into messages, passes to the application layer.
- More than one transport protocol available to apps.
Transport Layer vs. Network Layer
- Network layer: Logical communication between hosts.
- Transport layer: Logical communication between processes.
- Relies on enhanced network layer services.
Household Analogy
12 kids in Ann’s house sending letters to 12 kids in Bill’s house. - hosts: houses. - processes: kids. - app messages: letters in envelopes. - transport protocol: Ann and Bill who demux to in-house siblings. - network protocol: postal service.
Internet Transport-Layer Protocols
- Reliable, in-order delivery (TCP).
- Congestion control (about the network itself).
- Flow control (sender will not overflow the receiver).
- Connection setup.
- Unreliable, unordered delivery (UDP).
- No-frills extension of best-effort IP.
- Services not available are delay guarantees and bandwidth guarantees.
3.2: Multiplexing / Demultiplexing
Multiplexing at sender: Handle data from multiple sockets, add transport header (later used for demultiplexing).
Demultiplexing at receiver: Use header info to deliver received segments to the correct socket.
How Demultiplexing Works
- Host receives IP datagrams.
- Datagram has a source IP address, destination IP address.
- Datagram carries one transport-layer segment.
- Segment has source, destination port number.
- Host uses IP addresses and port numbers to direct segment to appropriate socket.
Connectionless Demultiplexing
- When creating a datagram to send into UDP socket, we must specify the destination IP address and port number.
- When a host receives a UDP segment, it checks the destination port number and directs the segment to the proper socket.
- If IP datagram shave the same destination port but different source IP addresses or source port, numbers will be directed to the same socket at the destination.
Connection-oriented Demux
- TCP socket identified by a 4 tuple: (source IP address, source port number, destination IP address, destination port number).
- Web servers have different sockets for each connecting client.
- Non-persistent HTTP will have a different socket for each request. Multiple segments will be sent to the host at port 80 but will be demultiplexed to different sockets.
3.3 Connectionless Transport: UDP
When we want the speed to be as fast as possible and we are loss tolerant.
- “No frills”, “bare bones” Internet transport protocol.
- “Best effort” service, UDP segments may be lost or delivered out-of-order to app.
- Connectionless meaning that there is no handshaking between UDP sender and receiver. Each UDP segment is handled independently of others.
- UDP is used in streaming multimedia apps, DNS, SNPM (Simple Network Management Protocol).
- Reliable transfer over UDP can be achieved by adding reliability at the application layer such as application-specific error recovery.
UDP adds a segment header of 64 bits. The first 32 are used for the source port number and the destination port number. The next 16 are used for the length in bytes of the segment (including header), and the final 16 are used for checksum.
UDP Checksum
We want to detect “errors” (e.g. flipped bits) in transmitted segmet.
The sender treats the segment’s contents as a sequence of 16-bit integers. The checksum is the addition (one’s complement sum) of segment contents. The receiver then computes their own checksum and compares. (When adding numbers, a carryout from the msb gets re-added to the result).
3.4 Principles of Reliable Data Transfer
- When we have a reliable channel, we do not have to worry about reliable data transfer.
- Characteristics of unreliable channels will determine the complexity of reliable data transfer protocol (rdt).
rdt_send() is called from the application layer whenever we have data to send which passes udt_send() to transfer the packet over the unreliable channel to receiver. It is received by rdt_rcv() which gets passed to the protocol on the transport layer. After some processing, it is passed to the upper application layer through deliver_data().
RDT 1.0: Reliable transfer over a reliable channel.
Underlying channel is perfectly reliable. There are no bit errors and no loss of packets.
We have separate Finite State Machines (FSMs) for sender, receiver. The sender sends data into the underlying channel and the receiver reads data from the underlying channel.
RDT 2.0: Channel with Bit Errors
Underlying channel may flip bits in packet which is detected by the checksum.
- ACKs (acknowledgements): The receiver explicitly tells sender that the packet was received ok.
- NAKs (negative acknowledgements): The receiver explicitly tells sender that the packet has errors.
- Sender retransmits packet on receipt of NAK.
- The sender needs to add a checksum to the data and will wait for ACK or NAK.
- There is a fatal flaw because if ACK / NAK is corrupted, the sender does not know what happened at the receiver.
RDT 2.1
- The sender retransmits the current packet if ACK / NAK is corrupted and adds a sequence number to each packet.
- We only need 2 sequence numbers (0, 1) so that we can distinguish between consecutive messages.
- The receiver discards duplicate packets.
Stop and Wait protocol: sender sends 1 packet and then waits for receiver response.
RDT 2.2: NAK-free
- We can have less state transitions if we only use ACKs.
- The receiver can send a packet number as well.
- Duplicate ACK at sender results in the same action as NAK: retransmit the current packet.
RDT 3.0: Channels with Errors and Loss
Now the underlying channel can also lose packets (data, ACKs).
- Checksum, sequence number, ACKs, retransmission will be of help … but it is not enough.
The sender will wait a “reasonable” amount of time for ACK. It will retransmit if no ACK is received in time.
- If packet of ACK is delayed (not lost), the retransmission would be a duplicate, but we already handle this with sequence numbers.
- This approach works but the performance is bad.
- Example: 1 Gbps link, 15 ms propagation delay, 8000 bit packet.
D_{trans} = \frac{L}{R} = \frac{8000\ bits}{10^9\ bits/sec} = 8\ \mu s.
U_{sender}: utilization, the fraction of the time the sender is busy sending.
U_{sender} = \frac{L/R}{RTT + L/R} = \frac{0.008}{30.008} = 0.00027.
- Throughput is just the utilization multiplied by the link rate.
- The solution is to transfer more packets!
Pipelined Protocols
Sender allows multiple, “in-flight”, packets. - The range of sequence numbers must be increased. - There is buffering at the sender and / or receiver.
Go-Back-N
- The sender can have up to N unACKed packets in a pipeline.
- The receiver only sends a cumulative ACK (doesn’t ACK packet if there is a gap).
- Sender has a timer for the oldest unACKed packet. When the timer expires, it retransmits all unACKed packets.
- Any out of order packet (at the receiver) or duplicate ACK (at the sender) will be discarded.
Selective Repeat
- The sender can have up to N unACKed packets in the pipeline.
- The receiver sends individual ACK for each packet.
- Sender maintains a timer for each unACKed packet and only transmits that specific packet when the timer expires.
Sender
- Data from above: If next available sequence number in window, send packet.
- Timeout(n): Resent packet n, restart timer.
- ACK(n) (n \in [send\_base, send\_base + N - 1]): Mark packet n as received, if n is the smallest unACKed packet, advance the window base to the next unACKed sequence number.
Receiver
- Packet n \in [rcv\_base, rcv\_base + N - 1]: Send ACK(n). If n is out of order we need to buffer. If n is in order, we can deliver until the next not received packet.
- Packet n \in [rcn\_base - N, rcv\_base - 1]: ACK(n) (We need to acknowledge duplicate packets to move the sender’s window).
Selective Repeat: Dilemma: While sliding windows, we may accidentally accept old data as new. The problem is that the receiver knows nothing about the sender. We can fix this by setting the window size to be equal to less than half of the sequence number space.
No Errors: \rho = \min(1, \frac{W\frac{L}{C}}{t_T}).
Automatic Repeat Request (ARQ) Protocols
Applied at the transport and link layers.
- Stop-and-Wait Protocol.
- Go-Back-N.
- Selective Repeat.
3.5 Connection-Oriented Transport: TCP
- Connection-Oriented: Handshaking (exchanging control messages), initializes sender and receiver state before data exchange.
- Point-To-Point: One sender, one receiver.
- Reliable, In-Order Byte Steam: No “message boundaries”, byte by byte.
- Full Duplex Data: Bidirectional. MSS: Maximum Segment Size.
- Pipelined: TCP congestion and flow control set window size.
- Flow Controlled: Sender will not overwhelm receiver.
TCP Segment Structure
- 16 bits for source port number, 16 for destination port number.
- 32 bits for sequence number (for the bytes).
- Acknowledgement number (for the bytes).
- Header length, not used (URG: urgent data, ACK: ACK # valid, PSH: push data now, RST, SYS, FIN: connection establishment).
- Receive window for the number of bytes the receiver is willing to accept.
- 16 bits for checksum, URG data pointer for the last byte of urgent data (16 bits).
- Options (variable length).
- Application data (variable length).
Default header length is 20 bytes compared to UDP 8 bytes.
Acknowledgements include the sequence number of the next byte expected from the other side (cumulative ACK). The TCP spec doesn’t say how to handle out-of-order segments, it is up to the implementer. There are two choices, discard or buffer. Buffering is used in practice.
TCP Round Trip Time, Timeout
How to set TCP timeout value?
It should be longer than the RTT, but the RTT varies. If it is too short, we will have unnecessary retransmissions. If it is too long, we will have slow reactions to segment loss.
We estimate RTT by sampling from segment transmissions. SampleRTT is the measured time from segment transmission to ACK receipt (ignoring retransmissions). We use an exponential weighted moving average, EstimatedRTT = (1 - \alpha)\cdot EstimatedRTT + \alpha \cdot SampleRTT. The typical value is \alpha = 0.125.
Is this estimation enough? We want to add a safety margin so that a large variation in EstimateRTT results in a larger safety margin. We use 4 standard deviations. DevRTT = (1 - \beta)\cdot DevRTT + \beta \cdot |SampleRTT - EstimatedRTT|. The typical value is \beta = 0.25.
The timeout interval is then TimeoutInterval = EstimatedRTT + 4 \cdot DevRTT.
TCP Reliable Data Transfer
TCP Created a RDT service on top of IP’s unreliable service. It uses pipelined segments, cumulative ACKs, and single retransmission timer (similar to Go-Back-N). Retransmission are triggered by timeouts or duplicate ACKs.
Let’s initially consider a simplified TCP sender with no duplicate ACKs and ignoring flow / congestion control.
TCP Sender Events:
Three major events in sender.
- Data received from application layer.
- Create segment with sequence number.
- Sequence number is byte-stream number of the first data byte in the segment.
- Start timer if not already running.
- Timeout.
- Retransmit segment that caused timeout.
- Restart timer.
- ACK received.
- If ACK acknowledges previously unACKed segments, we update and then start timer if there are still unACKed segments.
TCP ACK Generation
Arrival of in-order segment with expected sequence number. All data up to the expected sequence number are already ACKed. |
Delayed ACK. Wait up to 500ms for the next segment. If no next segment, send ACK. |
Arrival of in-order segment with expected sequence number. One other segment has ACK pending. |
Immediately send single cumulative ACK, ACKing both in-order segments. |
Arrival of out-of-order segment (heigher than expected sequence number). Gap detected. |
Immediately send duplicate ACK, indicating sequence number of the next expected byte. |
Arrival of segment that partially or completely fills the gap. |
Immediate send ACK, provided that the segment starts at the lower end of the gap. |
TCP Fast Retransmit
If you receive 3 duplicate ACKs, it is an indicator that a packet has been lost. The sender will retransmit immediately instead of waiting for the timeout.
TCP Flow Control
Receiver controls sender, so that the sender won’t overflow receiver’s buffer by transmitting too much, too fast.
Applications may remove data from TCP socket buffers slower than the TCP receiver is delivering (sender is sending).
- Receiver “advertises” free buffer space by including rwnd value in the TCP header of receiver-to-sender segments.
- RcvBuffer size is set via socket options (typical default is 4096 bytes).
- Many operating systems auto adjust RcvBuffer.
- Sender limits the amount of unACKed (in-flight) data to the receiver’s rwnd value.
- This guarantees that the buffer will not overflow.
Connection Management
Before exchanging data, the sender and receiver “handshake” (agree to establish connection, agree on connection parameters).
3.6 Principles of Congestion Control
Too many sources sending too much data too fast for the network to handle.
- Different from flow control.
- Manifests as lost packets (buffer overflow at routers) and long delays (queueing in router buffers).
Causes / Costs of Congestion
- Infinite buffer, 2 senders and 2 receivers use the same link with an output link capacity of R.
- Maximum per-connection throughput: \frac{R}{2}.
- Large delay as the arrival rate \lambda_{in} approaches \frac{R}{2}.
- Finite buffer. Input and output rates are the same.
- Transport-layer input includes retransmissions: \lambda_{in}^\prime \ge \lambda_{in} (offered load).
- Multiple routers.
- One connection may take up all of the routers’ time.
- More routers involved in delivering data means more wasted network efforts. This is the reason for sudden drop on throughput.
- When packet is dropped, any “upstream” transmission capacity is wasted!
Approaches to Congestion Control
Two broad approaches.
- End-End Congestion Control.
- No explicit feedback from network.
- Congestion inferred from end-system observed loss, delay.
- Approach taken by TCP.
- Network-Assisted Congestion Control.
- Routers provide feedback to end systems.
- Single bit indicating congestion.
- Explicit rate for sender to send at.
3.7 TCP Congestion Control
Additive increase, multiplicative decrease (AIMD).
- Sender increases transmission rate (window size), probing for usable bandwidth, until loss occurs.
- Increase cwnd by 1 MSS (maximum signal size) every RTT until loss detected.
- Cut cwnd in half after loss.
- Recall that the sender limits transmission.
- LastByteSent - LastByteAcked \le \min(cwn, cwnd).
- TCP sending rate is roughly sending cwnd bytes, then waiting an RTT for ACKs.
- \frac{cwnd}{RTT} bytes / sec.
TCP Slow Start
- Initially cwnd is 1 MSS and is doubled for every RTT.
- This is done by incrementing cwnd for every ACK received.
- Initial rate is slow but ramps up exponentially fast.
Detecting, Reacting to Loss
- Loss indicated by timeout.
- cwnd set to 1 MSS.
- Window then grows exponentially (as in slow start) to the threshold, then grows linearly.
- Loss indicated by 3 duplicate ACKs (TCP RENO).
- Duplicate ACKs indicate that the network is capable of delivering some segments.
- cwnd is cut in half and then the grows linearly.
- (TCP TAHOE) always sets cwnd to 1 (timeout of 3 duplicate ACKs).
Switching from Slow Start to CA (Congestion Avoidance)
When should the exponential increase switch to linear?
When cwnd gets to \frac{1}{2} of its value before timeout.
Implemented with a variable ssthresh. On a loss event, ssthresh is set to \frac{1}{2} of cwnd before loss event.
TCP Throughput
- Average TCP throughput as a function of window size, RTT?
- Ignore slow start, assume we always have data to send.
- Throughput fluctuates between the max of \frac{W}{RTT} (congestion) and \frac{W}{2RTT} (after congesion, drop rate).
- Average window size is \frac{3W}{4}.
- Average throughput is \frac{3W}{4RTT} bytes / sec.
Example: Average Throughput = 10 Gbps, 100ms RTT, 1500 byte segment = 3/4 W/RTT bytes/sec Solve for W, Number of segments is W / 1500
Throughput in terms of segment loss probability L = \frac{1.22 \cdot MSS}{RTT \sqrt L}, MSS -> Maximum Segment Size.
To achieve a 10gbps throughput, we need a lost rate of L = 2\cdot10^{-10} which is a very small loss rate.
We need a new version of TCP for high-speed. It is hard to implement TCP because there are hardware that implement and are hard to change.
TCP Fairness
Fairness goal: If K TCP sessions share the same bottleneck link of bandwidth R, each should have average rate of \frac{R}{K}.
Two competing sessions will have additive increase which gives slope of 1, as throughput increases. Multiplicative decrease decreases throughput proportionally.
Multimedia apps often do not use TCP because they do not want the rate throttled by congestion control. They send audio / video at constant rate, tolerate packet loss.
This is fair over each connection, but you can cheat the system by having multiple TCP connections.
4. Network Layer
Unlike the application and transport layers, the network layer is implemented every host and router.
Recall: Transport layer encapsulates the data by breaking it into segments and adding headers.
The network layer further encapsulates segments by adding it’s own headers.
Two Key Functions
- Forwarding: Move packets from router’s input to appropriate router output.
- Routing: Determining route taken by packets from source to destination.
Some network architectures require connection setup. Before datagrams flow, two end hosts and intervening routers establish virtual connection.
Recall: Network layer connection service is between two hosts (may also involve intervening routers in case of VCs). Transport layer connection service is between two processes.
Network Service Model
- Individual datagrams: Guaranteed delivery with less than 40ms delay.
ATM: Asynchronous Transfer Mode
Datagram network provides network-layer connectionless service.
Datagrams include the IP address of the receiver. Virtual circuit cells include a virtual circuit number.
ATM, VC forwarding table includes some info specific to the call between sender and receiver.
VC Implementation
- Path from source to destination.
- VC numbers, one number for each link along path.
- VC number can be changed on each link. The new number comes from forwarding table.
- Entries in forwarding tables in routers along path.
VC numbers change in order to have unique numbers between each router’s connections.
VC: Signaling Protocols
- Used to setup, maintain, teardown VC.
- Used in ATM, frame-relay, X.25
- Not used in today’s Internet.
Datagram Networks
Connectionless.
- No call setup at network layer.
- Routers have no state, they do not know what connection each datagram belongs to.
- Packets are forwarding using only destination host address.
Datagram Forwarding Table
- The destination IP address in the arriving packet’s header is compared to the forwarding table. There are too many IP addresses, so rather than listing individual destination addresses, we list ranges.
What happens when the ranges don’t divide up nicely?
Longest Prefix Match: Match the IP address to an entry in the forwarding table based on the longest prefix that matches.
Datagram or VC Network?
Datagram (Internet)
- Data exchange among computers. "Elastic service, no string timing requirements.
- Many different link types with different characteristics.
- End systems are smart so we push complexity.
VC (ATM)
- Evolved from telephone.
- Strict timing requirements, reliability requirements.
- “Dumb” end systems and complexity is pushed to network.
4.3 Router
Input ports, output ports, high-seed switching fabric, routing processor.
- Running routing algorithms / protocol (RIP, OSPF, BFP).
- Routing processor.
- Update forwarding table as a result which is pushed to the input ports.
- Forward datagrams from incoming to outgoing link.
Every router has three layers (network, link, physical).
Line termination -> Link layer protocol -> Lookup, forwarding, queueing.
- Physical layer: bit-level reception.
- Data link layer: e.g. Internet
- Decentralized switching:
Switching Fabrics
- Switching rate: rate at which packets can be transferred from inputs to outputs.
- N inputs, switching rate N times line rate is desirable to avoid buffering.
- Three types of switching fabrics (memory, bus, crossbar).
Switching via Memory
First generation routers.
- Traditional computers with switching under direct control of CPU.
- Packet copied to system’s memory.
- Speed limited to memory bandwidth (only 2 bus crossings per datagram).
- One packet maximum on the bus at a time.
Switching via Bus
- Datagram from input port memory to output port memory via a shared bus.
- Bus contention: Switching speed is limited by bus bandwidth.
- 32 Gbps bus, Cisco 5600, sufficient speed for access and enterprise routers.
We can still only send one packet on the bus at a time.
Switching via Interconnection Network
- Overcome bus bandwidth limitations.
- Banyan networks, crossbar, or other interconnected nets initially developed to connect processors in multiprocessor.
- Advanced design, fragmenting data into fixed length cells, switch cells through the fabric.
- Cisco 12000 switches 60 Gpbs through the interconnection network.
Output Ports
- Buffering required if the fabric has a faster rate than the output line speed. Datagram can be lost due to congestion, lack of buffers.
- Scheduling datagrams, priority scheduling, who gets best performance, network neutrality.
How Much Buffering?
- RFC 3439 rule of thumb is that the average buffering should be equal to “typical” RTT multiplied by the link capacity C.
- This is the absolute maximum amount of packets you can receive.
Recent recommendation says that with N flows, buffering should be equal to \frac{RTT \cdot C}{\sqrt N}.
Head-of-the-Line (HOL) Blocking: Queued datagram at the front of the queue prevents others in the queue from moving forward.
4.4 IP: Internet Protocol
In the network layer, we have routing protocols, IP protocol, and ICMP protocol (error reporting and router signalling).
IP Fragmentation, Reassembly
- Each network link has a maximum transfer size (MTU).
- Different link types, different MTUs.
- Large IP Datagram is divided within the network and reassembled only at the final destination.
- IP header bits are used to identify and order related fragments.
- Every fragment belonging to a datagram will have the same 16-bit id. Offset is measured in chunks of 8 bytes.
IPV4 Addressing
- IP Address: 32-bit identifier for host, router interface. (X.X.X.X), X \in [0, 256).
- Interface: Connection between host / router and physical link.
- Routers typically have multiple interfaces.
- Host typically has one or two interfaces (e.g. wired Ethernet, wireless 802.11).
Subnets
- IP Address.
- Subnet part, high order bits.
- Host part, low order bits.
- Devices that have the same subnet part can physically reach each other without the need for a router.
- To determine the subnets, detach each interface from its host or router and find the disjoint networks.
- Subnets used to be classified into A, B, and C where the subnet has 8, 16, and 24 bits respectively.
CIDR (Classless InterDomain Routing)
- Subnet portion of an address has an arbitrary length.
- Address format is a.b.c.d/x, where x is the number of bits in the subnet portion of address.
- In every subnet, we need to reserve two special IP addresses, when the host part is all 0s or all 1s.
- All 0s, identifier, acts as a wildcard for clients who do not have an IP address yet.
- All 1s, broadcast.
IP Address: How To Get One?
- Hard-coded by the system admin in a file.
- DHCP (Dynamic Host Configuration Protocol), dynamically get address from a server.
DHCP Overview
- Host broadcasts DHCP discover message (optional).
- DHCP server responds with DHCP offer message (optional).
- Host requires IP address, DHCP request.
- DHCP server sends address, DHCP ack.
- Client sends to port 67, server sends to port 68.
DHCP can return more than just the allocated IP address on subnet.
- Address of first-hop router for client.
- Name and IP address of DNS server.
- Network mark (indicating network versus host portion of address).
By doing bitwise & between a host IP address and subnet mask, you will get the subnet portion of the host IP address.
How does network get subnet part of IP address? It gets an allocated portion of its provider ISP’s address space.
Heirarchical Addressing
Hierarchical addressing allows for efficient advertisement of routing information. Longest prefix match is used.
How does an ISP get a block of addresses? ICANN (Internet Corporation for Assigned Names and Numbers) allocates addresses, manages DNS, assigns domain names, and resolves disputes.
NAT (Network Address Translation)
Households with one router only has one IP address. There may be multiple devices that need to connect, so the router does translation between local IP addresses and the outside world.
All datagrams leaving the network have same single source NAT IP address.
Motivation: local network uses just one IP address as far as the outside world is concerned.
- Range of addresses is not needed from ISP.
- Local network addresses can be changed without notifying the outside world.
- ISP can be changed without changing addresses of devices in local network.
- Devices inside local network are not explicitly addressable which is a security plus.
Implementation: a NAT router must.
- Outgoing datagrams: replace (source IP address, port number) of every outgoing datagram to (NAT IP address, new port number).
- Remember (in NAT translation table) every (source IP address, port number) to (NAT IP address, new port number) translation pair.
- Incoming datagrams: replace (NAT IP address, new port number) in destination fields of every incoming datagram with corresponding (source IP address, port number) sorted in NAT table.
NAT has a 16-bit port number field which allows for 60000 simultaneous connections with a single LAN-side address!
NAT is controversial.
- Routers should only process up to layer 3.
- Violates end-to-end argument.
- NAT possibility must be taken into account by application designers, e.g. P2P applications.
- Address shortage should instead be solved by IPv6.
NAT Traversal Problem
- Client wants to connect to server with address X.
- Server address local to LAN.
- Only one externally visible NATed address.
- Solution 1: Statically configure NAT to forward incoming connection requests at given port to server.
- Solution 2: Universal Plug and Play (UPnP) Internet Gateway Device (IGD) Protocol. Allows NATed host to.
- Learn public IP address.
- Add / remove port mappings with lease times.
- Solution 3 (used in Skype).
- NATed client establishes connection to replay.
- External client connects to relay.
- Relay bridges packets between connections.
ICMP (Internet Control Message Protocol)
- Used by hosts and routers to communicate network level information such as error reporting.
- Network-layer is “above” IP, ICMP messages are carried in IP datagrams.
- ICMP Message: type, code plus first 8 bytes of IP datagram causing error.
Traceroute and ICMP
We want to know which router I go through and the delays.
- Source sends a series of UDP segments to dest.
- First set has TTL = 1
- Second set has TTL = 2, etc.
- Unlikely port number is used.
- When the nth set of datagrams arrives to the nth router.
- Router discards datagrams and sends source ICMP messages (type 11, code 0).
- ICMP messages include name of router and IP address.
- When ICMP messages arrive, source records RTTs.
Stopping Criteria:
- UDP segment eventually arrives at destination host.
- Destination returns ICMP “port unreachable” message (type 3, code 3).
- Source stops.
IPv6
- Initial motivation: 32-bit address space is soon to be completely allocated (assigned in 2013).
- Additionally, header format helps to speed processing / forwarding and header changes to facilitate QoS.
- Fixed-length 40 byte header.
- 128-bit addressing, enough for every grain of sand to have an IP address.
- No fragmentation allowed.
- Priority: identify priority among datagrams in flow.
- Flow Label: identify datagram in same “flow” (concept of flow is not well defined).
- Next Header: identify upper layer protocol for data.
Changes from IPv4
- Checksum: removed entirely to reduce processing time at each hop. We already have error detection at the transport and link layers.
- Options: allowed, but outside of header, indicated by “Next Header” field.
- ICMPv6: new version of ICMP.
- Additional message types such as “Packet Too Big”.
- Multicast group management functions.
Transition from IPv4 to IPv6
- Not all routers can be upgraded simultaneously. How will networks operate with mixed IPv4 and IPv6 routers?
- Tunnelling: IPv6 datagram carried as payload in IPv4 datagram among IPv4 routers.
IPv6 Adoption
- US National Institutes of Standards estimate (2013):
- 3% of industry IP routers.
- 11% of US government routers.
- It takes a long time to update, because we would need to replace every router on the Internet.
4.5 Routing Algorithms
Routing algorithm determines end-end path through the network. Forwarding table determines local forwarding at this router.
Routing Algorithm Classification
Global: All routers have complete topology and cost information (link state algorithms).
Decentralized: Router knows only physically-connected neighbours. There is a iterative process of computation (distance vector algorithms).
Static: Routes change slowly over time.
Dynamic: Periodic updates in response to link cost changes.
Link-State Routing Algorithm
Dijkstra’s algorithm is used to compute least cost paths from one node (source) to all other nodes. Tracing predecessor nodes gives a forwarding table for that node.
Link-State Problems
Oscillations are possible, when the support link cost is equal to the amount of carried traffic.
Distance Vector Routing Algorithm
Bellman-Ford equation is used to compute least cost paths at each node. The node v that achieves the minimum cost for y is the next-hop in the forwarding table.
d_x(y) = \min_v\{c(x, v) + d_v(y)\}
Each node x knows the cost to each neighbour v and maintains its neighbours distance vectors. For each neighbour v, x maintains \mathbf{D}_v = [D_v(y), y \in N]. x also maintains a distance vector for itself.
From time-to-time, each node sends its own distance vector estimate to its neighbours. When x receives a new distance vector estimate from a neighbour, it updates its own distance vector using the Bellman-Ford equation. Under minor, natural conditions, the estimate D_x(y) will converge to the actual least cost d_x(y).
D_x(y) = \min_v\{c(x, y) + D_v(y)\}, \text{ for each node } y \in N
- Iterative, Asynchronous: Local iterations are caused by.
- Local link cost change.
- Distance vector update message from neighbour.
- Distributed: Each node notifies neighbours only when its distance vector changes.
Distance Vector Problems
When link cost changes, the node needs to update routing information, and recalculate the distance vector. If the distance vector changes, it must notify neighbours.
Bad news travels slow. There can be a routing loop between two nodes if a cost update reverses the minimum cost path. In order to broadcast the distance vector to its neighbours, nodes will keep passing information back to each other and updating distance vectors. Counting to infinity problem.
We used poisoned reverse in order to solve the problem of routing loops. If Z routes through Y to get to X, Z tells Y that D_y(x) = \infty. This does not completely solve the problem, we need to poison every node from source to destination.
Link-State vs. Distance Vector
- Message Complexity.
- LS: with n nodes, E links, O(nE) messages sent.
- DV: Exchange between neighbours only.
- Speed of Convergence.
- LS: O(n^2) algorithm which requires O(nE) messages. There may be oscillations.
- DV: Convergence time varies. There may be routing loops and count-to-infinity problems.
- Robustness (router malfunction).
- LS: Nodes can advertise incorrect link costs. Each node only computes its own table.
- DV: Nodes can only advertise incorrect path costs. Each node’s table is used by others, so errors will propagate through the network.
Heirarchical Routing
In real life, all routers are not identical, and the network is not flat.
With 600 million destinations, storing all destinations in routing tables would not be scalable. Routing table exchange would swamp links.
The internet is a network of networks, and each network admin may want to control routing in its own network.
- Aggregate routers into regions, autonomous systems (AS).
- Routers in AS run same routing protocol (intra-AS routing protocol).
- Routers in different AS can run different protocols / algorithms.
- Gateway routers are at the edge of its own AS and has links to routers in other AS.
- The forwarding table is updated by both Intra-AS and Inter-AS routing algorithms.
Inter-AS Tasks
Example: Single AS.
- Suppose router in AS1 receives a datagram destined outside of AS1. The router should forward packet to a gateway router, but which one?
- AS1 must learn which destinations are reachable through AS2, and which are reachable through AS3. AS1 must propogate this reachability information to all routers in AS1.
Example: Multiple ASes.
- Now suppose AS1 learns from the inter-AS protocol that subnet x is reachable from both AS2 and AS3.
- To configure the forwarding table, router 1d must determine which gateway it should forward packets to.
- This is also the job of inter-AS routing protocol.
Hot Potato Routing: We want to get rid of our package as fast as possible, so we choose the gateway that has the lowest local cost.
4.6 Routing in the Internet
Intra-AS Routing
Interior Gateway Protocols (IGP).
- RIP: Routing Information Protocol.
- OSPF: Open Shortest Path First.
- IGRP: Interior Gateway Routing Protocol (Cisco proprietary)>
- Distance vector algorithm except the distance is measured by the number of hops (edge cost of 1), with a maximum of 15 hops.
- DVs are exchanged every 30s in response messages (advertisements).
- Up to 25 destination subnets in each advertisement.
If no advertisement is heard after 180s, the neighbour / link is declared dead.
- Routes via neighbour are invalidated.
- New advertisements send to neighbours.
- Neighbours in turn send out new advertisements (if tables are changed).
- We use poison reverse to prevent ping-pong loops.
- Infinite distance is 16 hops.
RIP routing tables are managed by application-level process called route-d (daemon).
- Routers need more than the bottom 3 layers to run RIP.
- Advertisements are sent in UDP packets, periodically repeated.
OSPF (Open Shortest Path First)
- “Open”: Publicly available.
- Uses link state algorithm.
- LS packet dissemination.
- Topology map at each node.
- Route computation using Dijkstra’s algorithm.
- OSPF advertisement carries one entry per neighbours.
- Advertisements are flooded to the entire AS.
- Carried in OSPF messages directly over IP rather than TCP or UDP.
- IS-IS Routing Protocol: Nearly identical to OSPF.
- Security: All messages are authenticated to prevent malicious intrusion.
- Multiple same-cost paths are allowed. Allows for load balancing.
- For each link, multiple cost metrics for different TOS.
- Unicast and multicast support. Unicast is 1-to-1, multicast is 1-to-many.
- Allows for hierarchical OSPF in large domains.
Hierarchical OSPF
- Two-level hierarchy: Local area, backbone.
- Link-state advertisements only done in areas.
- Each node has detailed area topology, only knows direction to networks in other areas.
- Area border routers: “Summarize” distances to nets in own area, and advertise to other area border routers.
- Backbone routers: Run OSPF limited to backbone.
- Boundary routers: Connect to others AS’s.
Inter-AS Routing (BGP)
Border Gateway Protocol. De facto inter-domain routing protocol, the glue that holds the Internet together.
- eBGP (external): Obtains subnet reachability from neighbouring ASs.
- iBGP (internal): Propagate reachability information to all AS-internal routers.
- Determines good routes to other networks based on reachability information and policy.
BGP Messages
- Exchanged between peers over TCP connection.
- OPEN: Opens TCP connection to peer and authenticates sender.
- UPDATE: Advertise new path or withdraw old.
- KEEPALIVE: Keep connection alive, ACKs OPEN requests.
- NOTIFICATION: Reports errors in previous messages, also used to close connections.
Path Attributes and BGP Routers
- Advertised prefix includes BGP attributes.
- AS-PATH: Contains ASs through which prefix advertisement has passed.
- NEXT-HOP: Indicates specific internal-AS router to next-hop AS.
- Gateway router receiving route advertisement uses import policy to accept / decline.
BGP Route Selection
We only continue to the next step if there are still ties.
- Local preference value attribute, policy decision.
- Shortest AS-PATH.
- Closest NEXT-HOP (OSPF).
- Additional criteria.
BGP Routing Policy
- Customers will not advertise worse paths for themselves.
- Providers will not advertise links to competitors because they only want to route to / from its customers.
Why Intra-AS and Inter-AS Routing?
- Policy: Inter admins wants to control how traffic is routed, intra is a single admin, so no policy decisions are needed.
- Scale: Hierarchical routing saves table slides, reduced update traffic.
- Performance: Intra will focus on performance, but inter may be dominated by policy.
Entry in Forwarding Table
- Router becomes aware of prefix.
- BGP message contains “routes”. Route is a prefix and attributes, AS-PATH, NEXT-HOP, …
- Router may receive multiple routes for the same prefix. We must select one route using the routine above.
- Use selected route’s NEXT-HOP attribute.
- Router determines output port for prefix.
- Use BGP route selection to find the best inter-AS route.
- Use OSPF to find the best intra-AS route.
- Router identify router port for that best route.
- Router enters prefix-port in forwarding table.
4.7 Broadcast Routing
- A simple approach is to transmit N copies to the N destinations using unicast routing. N-way-unicast.
- Uncontrolled flooding: When a node receives a broadcast, it sends a copy to all neighbours (except the one who send the packet).
- Cycles: Endless cycling of one clockwise, one counterclockwise.
- Broadcast Storm: When a node is connected to more than two other nodes, it will create and forward multiple copies of the broadcast packet, endless multiplication of broadcast packets.
Controller Flooding
Node only broadcasts packet if it has not broadcasted the same packet before.
- Sequence-number-controlled Flooding: Source nodes put address as well as broadcast sequence number into a broadcast packet.
- Each node maintains a list to the source address and sequence number of each broadcast packet.
- When a node receives a broadcast packet, if it has already been seen, then it is dropped. If not, the packet is duplicated and sent to all relevant neighbours.
- Reverse Path Forwarding (RPF): Only forward packet if it arrived on the shortest path between the node and source.
- RPF does not require that a router know the compute shortest path, only local shortest.
- Spanning Tree: Construct spanning tree and then only forward / make copies along it. Creation is done by having nodes in the network send unicast join messages to the center node and tracking the path until a node already belonging to the spanning tree.
5: Link Layer
Responsible for transferring datagram from one node to physically adjacent node over a link.
- Host and routers: nodes.
- Communication channels that connect adjacent nodes across communication path: links.
- Layer-2 packet: frame, encapsulates datagram.
Link Layer Services
- Framing, link access.
- Encapsulates datagram into frame, adding header and trailer.
- Channel access if shared medium.
- “MAC” address used in frame headers to identify source, destination. This is different from IP address!
- Reliable delivery between adjacent nodes.
- Chapter 3. (Go-Back-N, Selective Repeat, …).
- Seldom used on low bit-error link such as fiber.
- Wireless links have higher error rates.
- Flow Control.
- Pacing between adjacent sending and receiving nodes.
- Error Detection.
- Errors caused by signal attenuation, noise.
- Receiver detects presence of errors and signals sender for retransmission or drops frame.
- Error Correction.
- Receiver identifies and corrects bit errors without resorting to retransmission.
- Half-Duplex and Full-Duplex.
- With half duplex, nodes at both ends of the link can transmit, but not at the same time.
Where is Link Layer?
- In each and every host.
- Link layer implemented in “adaptor”, also known as NIC (network interface card), or on a chip.
- Ethernet card, 802.11 card. Ethernet chipset.
- Implements link, physical layer.
- Attaches into host’s system buses.
- Combination of hardware, software, firmware.
5.2 Error Detection and Correction
- EDC = Error Detection and Correction bits.
- D = Data protected by error checking, may include header fields.
- Error detection is not 100% reliable.
- Protocol may miss some errors, but rarely.
- Larger EDC field yields better detection and correction.
Parity Checking
Detect and correct single bit errors.
- Single bit parity: store the parity of the d data bits.
- Two-dimensional bit parity. Store the parity of rows and columns for the d data bits (group into words of j bits). Find where errors should lie by analyzing single bit parity errors across all rows and columns.
Internet Checksum (Review)
Cyclic Redundancy Check
- More powerful error-detection coding.
- View data bits, D, as a binary number.
- Choose r+1 bit pattern (generator), G.
- Goal: choose r CRC bits, R, such that …
- <D, R> exactly divisible by G \mod 2.
- Receiver knows G, divides <D, R> by G. If non-zero remainder, error detected!
- Can detect all burst errors less than r+1 bits.
- Widely used in practice.
Example: D = 101110, G = 1001.
|G| = 4 = r + 1. Shift D left by r, 101110000.
5.3 Multiple Access Protocols
- Point-to-Point.
- PPP for dial-up access.
- Link between Ethernet switch, host.
- Broadcast.
- Old-fashioned Ethernet.
- Upstream HFC.
- 802.11 wireless LAN.
Multiple Access Protocol
Distributed algorithm that can determine how nodes share channel, determine when node can transmit.
- Single shared broadcast channel.
- Two or more simultaneous transmissions by nodes can cause a collision if the node receives two or more signals at the same time.
Ideal Protocol
Given a broadcast channel rate R bps.
- When one node wants to transmit, it can send at rate R.
- When M nodes want to transmit, each can send at average rate \frac{R}{M}.
Fully decentralized.
- No special node to coordinate transmissions.
- No synchronization of clocks, slots.
Simple.
MAC Protocol Taxonomy
Three broad classes.
- Channel Partitioning.
- Random Access.
- Taking Turns.
Channel Partitioning
TDMA: Time Division Multiple Access.
- Access to channel in rounds.
- Each station gets a fixed length slot in each round. Unused slots go idle.
- Advantages: Fair, collision free.
- Disadvantages: Waste of bandwidth, unnecessary waiting.
FDMA: Frequency Division Multiple Access.
- Channel spectrum divided into frequency bands.
- Each station assigned fixed frequency band. Unused bands go idle.
- Similar advantages and disadvantages as TDMA.
CDMA: Code Division Multiple Access.
- Each user is allocated a unique code to transmit.
Random Access Protocols
- When node has a packet to send, it transmits at full channel data rate R.
- Random Access MAC Protocol specifies how to detect collisions and how to record from collisions.
- Slotted ALOHA, ALOHA, CSMA, CSMA/CD, CSMA/CA.
Slotted ALOHA
Collisions can be detected in physical medium based on superposition.
Assumptions.
- All frames are the same size.
- Time divided into equal size slots.
- Nodes start to transmit only at slot beginning.
- Nodes are synchronized.
- If 2 or more nodes transmit in slot, all nodes detect collision.
Operation.
- When a node obtains a fresh frame, transmit in next slot.
- If there is no collision, we send new frame in next slot.
- If there is a collision, we retransmit frame in each subsequent slot with probability p until success.
- Each one picks a different probability.
Pros.
- Single active node can continuously transmit at full rate.
- Highly decentralized, only slots in nodes need to be in sync.
- Simple.
Cons.
- Collisions, wasting slots.
- Idle slots.
- Nodes may be able to detect collision in less than time to transit packet. If they detect collisions, they must continue to transmit for the entire time slot.
- Clock synchronization.
The efficiency is the long-run fraction of successful slots.
- Probability that a given node has success in a slot is p(1-p)^{N-1}.
- Probability that any node has a success is Np(1-p)^{N-1}.
- Max efficiency is p^* which maximizes Np(1-p)^{N-1}.
- As N \to \infty, the max efficiency is \frac{1}{e} = 0.37.
Pure (Unslotted) ALOHA.
No synchronization between nodes.
- When a frame first arrives, we immediately transmit.
- Collision probability increases, frame sent at t_0 collides with any other frames sent in [t_0 - 1, t_0 + 1].
- Calculating the efficiency will lead to \frac{1}{2e}, which is worse than slotted ALOHA.
CSMA (Carrier Sense Multiple Access)
Listen before retransmit.
- If channel sensed idle, transmit the entire frame.
- If channel sensed busy, defer transmission.
- Collisions can still occur, mainly due to propagation delays.
- Entire packet transmission time wasted.
- Distance and propagation delay play a role in determining collision probability.
CSMA/CD (Collision Detection)
- Colliding transmissions are aborted, reducing channel wastage.
- Easy in wired LANs, measure signal strengths, compare transmitted, received signals.
- Difficult in wireless LANs, received signal strength overwhelmed by local transmission strength.
- NIC (Network Interface Card) receives datagram, encapsulates into creates frame.
- If NIC senses channel is busy, waits until idle, then transmits.
- If NIC transmits entire frame without detecting another transmission, then we are done.
- If NIC detects another transmission, abort and send a jam signal.
After aborting NIC enters binary (exponential) backoff.
- Maintain a counter for collisions, after the mth collision, pick a K randomly from ${0, 1, …, 2^{m} - 1} $. NIC waits for K \cdot 512 bit times, then returns to step 2. Counter is reset upon successful transmission.
- Longer backoff interval with more collisions.
- Medium sensing is done for 96 bit times.
- Jamming signal length is 48 bits.
- The maximum amount of collisions is set to 10. Any more will generate an error.
Assume that T_{prop} is the maximum propagation delay between 2 nodes in LAN, and t_{trans} is the time to transmit a max-size frame.
\text{efficiency} = \frac{1}{1 + 5\frac{t_{prop}}{t_{trans}}}
Efficiency \to 1 as t_{prop} \to 0 or t_{trans} \to \infty.
Do not need to memorize efficiency formulas, just which is better and why).
Better performance than ALOHA, simple, cheap, decentralized!
Taking Turns
We look for the best mix of channel partitioning and random access.
- Polling.
- Master node invites slave nodes to transmit in turn by polling.
- Typically used with dumb slave devices.
- No collisions and no empty slots.
- Concerns with polling overhead, latency, and single point of failure.
- Token Passing.
- Control token passed from one node to next sequentially.
- Token message.
- Concerns with token overhead, latency, and single point of failure (token).
Cable Access Network
An example of how we can use multiple access.
- Two channels, upstream and downstream.
- Downstream has only one sender (CMTS / IP), so we do not need multiple access.
- Upstream requires multiple access since we have multiple senders.
- DOCSIS: Data Over Cable Service Interface Spec.
- FDM over upstream, downstream frequency channels.
- TDM upstream, some slots assigned, some have contention.
- Downstream MAP frame assigns upstream slots.
- Request for upstream slots (and data) transmitted random access (binary backoff) in selected slots.
- Visualized as FDM split up into time slots similar to TDM.
5.4 LANs
MAC Addresses and ARP
Analogy: MAC address like SSN, IP address like postal address.
- Used locally to get frame from one interface to another physically-connected interface (same network, in IP-addressing).
- 48 bit MAC address (for most LANs) burned in NIC ROM, also sometimes software settable.
- Hexadecimal notation.
LAN Addresses
- Each adapter on LAN has a unique LAN address.
- MAC address allocation administered by IEEE.
- Flat address allows for portability, we can move from one LAN card to another.
ARP (Address Resolution Protocol)
How to determine interface’s MAC address, knowing its IP address.
ARP Table: Each IP node (host, router) on LAN has table.
- IP to MAC address mappings for some LAN nodes.
- TTL time after which address mapping will be forgotten (typically 20 min).
Consider A wants to send a datagram to B but B’s MAC address is not in A’s ARP table.
- A broadcasts ARP query packet, containing B’s IP address.
- Destination MAC address set to FF-FF-FF-FF.
- All nodes on LAN receive ARP query.
- B receives ARP packet, replies to A with its MAC address (unicast to A’s MAC address).
- A caches IP to MAC address pair in its ARP table until the information becomes old.
- ARP is “plug-and-play”. Nodes create their ARP tables without intervention from network administrator.
Example: Sending a datagram from A to B via R. Assume that A knows B’s IP address, IP address of first hop router R (DHCP), and R’s MAC address (ARP).
- A creates IP datagram with IP source A, destination B. - A creates link-layer frame with R’s MAC address at destination, frame contains A-to-B IP datagram. - Frame sent from A to R, upon arrival, datagram is removed and passed up to IP. - R forwards datagram with IP source A, destination B. - R creates link-layer frame with B’s MAC address as destination, frame contains A-to-B IP datagram.
Ethernet
“Dominant” wired LAN technology. Family of protocols.
- First widely used LAN technology.
- Simpler, cheap than token LANs and ATM.
Physical Topology
- Bus, popular through the mid 90s. All nodes are in the same collision domain.
- Star, prevails today. Active switch is in the center, and each “spoke” runs a separate Ethernet protocol (nodes do not collide with each other).
Frame Structure
Sending adapter encapsulates IP datagram (or other network layer protocol packet) in Ethernet frame.
- Preamble. 7 bytes with pattern 10101010 followed by one byte with pattern 10101011 (wake up bytes). Used to synchronize receiver to the clock of the sender.
- Addresses, 6 byte source, destination MAC addresses.
- Type, indicates higher layer protocol (mostly IP).
- CYC, cyclic redundancy check at receiver.
Ethernet is unreliable, connectionless.
- Connectionless, no handshaking between sending and receiving NICs.
- Unreliable, receiving NIC doesnt send ACKs or NACKs to sending NIC.
- Data in dropped frames recovered only if initial sender uses higher layer rdt, otherwise dropped data is lost.
- Ethernet’s MAC protocol, unslotted CSMA/CD with binary backoff.
There are many difference Ethernet standards. They each have common MAC protocol and frame format. Offer different speeds, and different physical layer media.
Ethernet Switches
Implement the bottom two levels.
- Link-layer device, takes an active role.
- Store, forward, Ethernet frames.
- Examine incoming frame’s MAC address, selectively forward frame to one-or-more outgoing links when frame is to be forwarded on segment, uses CSMA/CD to access segment.
- Transparent. Hosts are unaware of the presence of switches.
- Plug-and-play, self-learning. Switches to not have to be configured.
Switches allow for multiple simultaneous transmissions.
- Hosts have dedicated, direct connection to switch.
- Switches buffer packets.
- Ethernet protocol used on each incoming link, but no collisions. Full duplex.
- Each link is in its own collision domain.
- Switching can transmit A-to-A^\prime and B-to-B^\prime simultaneously, without collisions.
Switch Forwarding Table
How does the switch know A^\prime is reachable via interface 4, etc.
- Each switch has a switch table, similar to routing table but for MAC address.
- (MAC address of host, interface to reach host, timestamp).
How are entries created and maintained in the switch table?
Interconnecting Switches
Switches can be connected in order to create a hierarchy.
- Very similar to other hierarchies that we have studied.
- Self learning (flooding) is still used, exactly the same as in the single-switch case.
Switches vs. Routers
Both are store-and-forward.
- Routers: Network-layer devices.
- Advantages: Hierarchical addressing, traffic isolation. We can send information within subnets without worrying about information leaking to the outside.
- Disadvantages: Large processing time, needs IP configuration.
- Switches: Link-layer devices.
- Advantages: Fast plug-and-play.
- Disadvantages: Broadcast storm if loops exist, large ARP table for large network.
Both have forwarding tables.
- Routers: Compute tables using routing algorithms, IP addresses.
- Switches: Learn forwarding table using flooding, learning, MAC addresses.
A router has multiple IP addresses. There is one for every interface.
VLANS
- Consider a CS user who moves office to ECE, but still wants to connect to the CS switch?
- Single broadcast domain.
- All layer-2 broadcast traffic (ARP, DHCP, etc.) must cross entire LAN.
- Security / privacy, efficiency issue.
We want to achieve some sort of traffic isolation. Switch(es) supporting VLAN capabilities can be configured to define multiple virtual LANS over single physical LAN infrastructure.
Port-based VLAN
Switch ports grouped (by switch management software) that a single physical switch operates as multiple virtual switches.
- Traffic Isolation: Frames to / from ports 1-8 can only reach ports 1-8. VLAN can also be based on MAC addresses of endpoints, rather than switch ports.
- Dynamic Membership: Ports can be dynamically assigned among VLANs.
- Forwarding between VLANs: Done via routing (just as with separate switches).
- In practice, vendors sell combined switches plus routers.
- Trunk Port: Carries frames between VLANs defined over multiple physical switches.
- Frames forwarded within VLAN between switches can’t be vanilla 802.1 frames. They must carry VLAN ID info.
- 802.1q protocol adds / removed additional header fields for frames forwarded between trunk ports.
5.5 Link Virtualization MPLS
Multiprotocol Label Switching.
- Initial goal is high-speed IP forwarding using fixed length label (instead of IP address).
- Fast lookup using fixed length identifier rather than longest prefix matching.
- Borrowing ideas from Virtual Circuit (VC) approach.
- But IP datagram still keeps IP address!
- We add in a MPLS header, which contains a 20 bit label.
MPLS Capable Routers
Label-switched router.
- Forward packets to outgoing interface based only on label value and does not check IP address.
- MPLS forwarding table distinct from IP forwarding table.
- Flexibility: MPLS forwarding decision can differ from those of IP.
- Use of destination and source address to tour flows to same destination differently (traffic engineering).
- Re-route flows quickly if link fails using pre-computed backup paths (useful for VoIP).
MPLS Signaling
- Modify OSPF (Open Shortest Path First), IS-IS link-state flooding protocols to carry info used by MPLS routing.
- Link bandwidth, amount of “reserved” link bandwidth.
- Entry MPLS router uses RSVP-TE signalling protocol to set up MPLS forwarding at downstream routers.
5.6 Data Center Networking
- 10s to 100s of thousands of hosts, often closely coupled, in close proximity.
- Challenges.
- Multiple applications, each serving massive numbers of clients.
- Managing / balancing load, avoiding processing, networking, data bottlenecks.
- Load Balancing: Application-layer routing.
- Receives external client requests.
- Directs workload within data center.
- Returns result to external client (hiding data center internals from client).
- Firewall.
- Redundant connections must be used in order to reduce chances of faliure.
- Expenses for maintaining data centers ~ millions per week.
5.7 Web Request Example
A student attaches a laptop to a campus network, requests / receives www.google.com.
- Connecting laptop needs to get its own IP address, address of first-hop router, address of DNS server. DHCP.
- DHCP request encapsulated in UDP, encapsulated in IP, encapsulated in 802.3 Ethernet.
- IP source (0.0.0.0), IP destination (255.255.255.255).
- Need the MAC address of the source (physically set by NIC manufacturer). Destination MAC address is broadcast address (FFFF…)
- Switch learns the <MAC, Interface> mapping for the new client.
- Ethernet frame broadcast on LAN, received at router running DHCP server.
- Ethernet demuxed to IP demuxed, UDP demuxed to DHCP.
- DHCP server formulates DHCP ACK containing client’s IP address, IP address of first-hop router for client, name & IP address of DNS server.
- Encapsulation at DHCP server, frame forwarded (switch learning) through LAN, demultiplexing at client.
- DHCP client receives DHCP ACK reply.
Client now has IP address, knows name and address of DNS server, and the IP address of its first-hop router.
- Before sending HTTP request, need the IP address of www.google.com. Communicate with DNS.
- We need the MAC address of the router, ARP. ARP query broadcasted to the router, ARP reply with the MAC address of the router.
- Client now knows MAC address of first hop router, so can now send frame containing DNS query.
- IP datagram containing DNS query forwarded via LAN switch from client to first-hop router.
- IP datagram forwarded from campus network into comcast network, router (tables created by RIP, OSPF, IS-IS and / or BGP routing protocols) to DNS server.
- Demuxed to DNS server.
- DNS server replies to client with IP address of www.google.com.
- To send a HTTP request, client first opens a TCP socket to web server.
- TCP SYN segment (3-way handshake) inter-domain routed to web server.
- Web server responds with TCP SYNACK.
- TCP connection established!
- HTTP request sent into TCP socket.
- IP datagram containing HTTP request routed to google.com.
- Web server responds with HTTP reply containing web page.
- IP datagram containing HTTP reply routed back to client.