distributed systems notes - wordpress.com · by running a distributed system software the computers...
TRANSCRIPT
Distributed Systems Notes
1
DISTRIBUTED SYSTEMS
1.1 Distributed System
A distributed system is a collection of independent computers that appears to its
users as a single coherent system.
A distributed system is one in which components located at networked
communicate and coordinate their actions only by passing message.
Characteristics of distributed System:
1. Programs are executed concurrently
2. There is no global time
3. Components can fail independently (isolation, crash)
By running a distributed system software the computers are enabled to:
Coordinate their activities
Share resources: hardware, software, data.
1.2 Examples of Distributed Systems
The Internet:
Collection of computer networks
Enables programs to communicate over arbitrary distance
Makes available services
– mail, file transfer, documents, telephony, ...
Communication via message passing according to Internet protocols
(IP, UDP, TCP, ICMP, SMTP, FTP, ...)
A back bone is a network link with high transmission capacity,employing
satellite connections ,fibre optic cables and other fibre optic circuits.
Infrastructure: backbones, routing, naming
Extensible (new services, new protocols)
Open communication channels (security!)
Technology applicable to other distributed systems
Multimedia services such as music , radio and TV ,video conferences available
in the internet.
Distributed Systems Notes
2
Figure 1.1 A typical portion of the Internet
Intranets
– a single authority
– protected access
a firewall
total isolation
– may be worldwide
– typical services:
• infrastructure services
file service, name service
• application services
• application services
Distributed Systems Notes
3
Mobile and ubiquitous Computing
Mobile: computing devices are being carried around
Ubiquitous: little computing devices are all over the place
- Portable devices
– laptops
– handheld devices-personal digital assitants (PDA s),mobile phones ,
pagers,video cameras and digital cameras.
– wearable devices-smart watches with the
functionality similar to PDA
– devices embedded in appliances-as washing machines, Hi-fi systems,
– car and refrigerator.
Difference between mobile and ubiquitous computing:
Ubiquitous computing used in single environment such as home or hospital.
Mobile computing has advantage when using different devices such as
laptops and printers.
Distributed Systems Notes
4
Figure 1.3 Portable and handheld devices in a distributed system
1.3 Resource Sharing and the Web
Hardware resources (reduce costs)
• Data resources (shared usage of information)
• Service resources
– search engines
– computer-supported cooperative working
• Service vs. server (node or process )
(palvelu, palvelin, palvelija)
Figure 1.4 Web servers and web browsers
Distributed Systems Notes
Advantages of Distributed Systems
Performance: very often a collection of processors can provide higher performance (and better
price/performance ratio) than a centralized computer.
Distribution: many applications involve, by their nature, spatially separated machines (banking,
commercial, automotive system).
Reliability (fault tolerance): if some of the machines crash, the system can survive.
Incremental growth: as requirements on processing power grow, new machines can be added
incrementally.
Sharing of data/resources: shared data is essential to many applications (banking, computer
supported cooperative work, reservation systems); other resources can be also
shared (e.g. expensive printers).
Communication: facilitates human-to-human communication.
Disadvantages of Distributed Systems
Difficulties of developing distributed software:
How should operating systems, programming languages and applications look like?
Networking problems:
Several problems are created by the network infrastructure, which have to be dealt with: loss of
messages, overloading.
Security problems: Sharing generates the problem of data security.
1.4 Challenges
1. Heterogeneity
2. Openness
3. Security
4. Scalability
5. Failure handling
6. Concurrency
7. Transparency
Distributed Systems Notes
Heterogeneity appears at several levels:
Network (Ethernet, token ring, ISDN,...)
Computing hardware (data representation)
Operating systems (different APIs to protocols)
Programming languages (data structures, APIs)
Applications by different developers (data exchange standards)
Middleware:
Software layer which abstracts from the above providing a uniform computational model
(CORBA, Java RMI, ODBC,Web Services...)
Openness
The degree to which a computer system can be extended and re-implemented.
IEEE = Institute of Electrical and Electronic Engineers
e.g., IEEE 802.11 WLAN, IEEE 802.3 Ethernet
W3C = World Wide Web Consortium
e.g., HTML Recommendations
Security
The resources are accessible to authorized users and used in the way they are intended.
Confidentiality
Protection against disclosure to unauthorized individual.
E.g. ACLs (access control lists) to provide authorized access to information.
Integrity
Protection against alternation or corruption.
E.g. changing the account number or amount value in a money order
Availability
Protection against interference targeting access to the resources.
E.g. denial of service (DoS, DDoS) attacks
Non-repudiation
Proof of sending / receiving an information
E.g. digital signature
Security Mechanism:
Encryption
E.g. Blowfish, RSA
Authentication
E.g. password, public key authentication
Distributed Systems Notes
Authorization
E.g. access control lists
Scalability
System should work efficiently at many different scales, ranging from a small Intranet to
the Internet.
Remain effective when there is a significant increase in the number of resources and the
number of users.
Challenges of designing scalable distributed systems:
Cost of physical resources
Cost should linearly increase with system size
Performance Loss
For example, in hierarchically structure data, search performance loss due
to data growth should not be beyond O(log n), where n is the size of data.
Preventing software resources running out:
Numbers used to represent Internet address (32 bit->64bit),Y2K like
problem.
Avoiding performance bottlenecks:
Use decentralized algorithms (centralized DNS to decentralized)
Failure handling
Failure: an offered service no longer complies with its specification
Fault: cause of a failure (e.g. failure of a component)
Fault tolerance: no failure despite faults
Fault Tolerance mechanism:
Fault detection -Checksums, heartbeat, …
Fault masking -Retransmission of corrupt messages, redundancy,
Fault toleration -Exception handling, timeouts,…
Fault recovery -Rollback mechanisms,…
Distributed Systems Notes
Redundancy:
Services available in failure by the use of redundant components.
Ex:
Concurrency
Two different routes in internet.
in domain naming system , names is replicated in atleast two servers.
data repeated in more servers
Shared access to resources must be possible.
(i.e)Becomes a problem when two or more parties access a the same resources
Transparency
To hide from the user and the application programmer of the separation/distribution of
components, so that the system is perceived as a whole rather than a collection of
independent components.
ISO Reference Model for Open Distributed Processing (ODP) identifies the following
forms of transparencies
Distributed Systems Notes
13
Software Architecture:
Distributed Systems Notes
The same, looking at two distributed nodes:
1.5 System Models
Systems that are intended for use in real-world environments should be designed to function
correctly in the widest possible range of circumstances and in the face of many possible
difficulties and threats.
Resources in a distributed system are shared between users. They are normally encapsulated
within one of the computers and can be accessed from other computers by communication.
• Each resource is managed by a program, the resource manager; it offers a communication
interface enabling the resource to be deceased by its users.
• Resource managers can be in general modeled as processes.
If the system is designed according to an object oriented methodology, resources are
encapsulated in objects.
Difficulties and threats for distributed systems
–Widely varying modes of use.
–Wide range of system environments
Distributed Systems Notes
–Internal problems: non-synchronized clocks, conflicting data updates, many modes of hardware
and software failure involving the individual components of a system.
Different types of Models
Architectural Models:
An architectural model defines the way in which the components of system
interact with one another and the way in which they are mapped onto an
underlying network of computers
Fundamental models
Fundamental models that help to reveal key problems for the designers of
distributed systems. Their purpose is to specify the design issues,
difficulties and threats that must be resolved in order to develop distribute
systems that fulfill their tasks correctly, reliably and securely. The
fundamental mode provides abstract views of just those characteristics of
distributed systems that affect the dependability characteristics -
correctness, reliability and security.
1.6 Architectural models
The architecture of a system is its structure in terms of separately specified components.
The architectural design of a building has similar aspects - it determines not only its
appearance but also its general structure and architectural style (gothic, neo-classical,
modem) provides a consistent frame of reference for the design.
An architectural model of a distributed system first simplifies and abstracts the functions
of the individual components of a distributed system and then it considers:
o the placement. of the components across a network of computers - seeking to
define useful patterns for the distribution of data and workload;
o the interrelationships between the components that is. their functional roles and
the patterns of communication between them.
An initial simplification is achieved by classifying processes as server processes, client
processes and peer processes
This classification of processes identifies the responsibilities of each and hence helps us
to assess their workloads and to determine the impact of failures in each of them.
Distributed Systems Notes
16
The results of this analysis can then be used to specify the placement of the processes in
a manner that meets performance and reliability goals for the resulting system.
Some more dynamic systems can be built as variations on the client-server model:
o The possibility of moving code from one process to another allows a process to
delegate tasks to another process: for example, clients can download code from
servers and run it locally. Objects and the code that accesses them can be moved
to reduce access delays and minimize communication traffic.
o Some distributed systems are designed to enable computers and other mobile
devices to be added or removed seamlessly, allowing them to discover the
available services and to offer their services to others.
Software layers
The term software architecture referred originally to the structuring of software as layers
or modules in a single computer and more recently in terms of services offered and
requested between processes located in the same or different computers.
A server is a process that accepts requests from other processes. A distributed service can
be provided by one or more server processes, interacting with each other and with client
processes in order to maintain a consistent system-wide view of the service's resources.
For example, a network time service is implemented on the Internet based on the
Network Time Protocol (NTP) by server processes running on hosts throughout the
Internet that supply the current time to any client that requests it and adjust their version
of the current time as a result of interactions with each other.
Platform
Applications, services
Middleware
Operating system
Computer and networkhardware
Distributed Systems Notes
The figure introduces the important terms platform and middleware, which we define as follows:
Platform:
The lowest-level hardware and software layers are often referred to as a platform for
distributed systems and applications. These low-level layers provide services to the layers above
them, which are implemented independently in each computer, bringing the system's
programming interface up to a level that facilitates communication and coordination between
processes.
Example: Intel x86/Windows, Sun SPARC/SunOS, Intel x86/Solaris, PowerPC/MacOS, Intel
x86/Linux.
Middleware :
Middleware a layer of software whose purpose is to mask heterogeneity and to provide a
convenient programming model to application programmers. Middleware is represented
by processes or objects in a set of computers that interact with each other to implement
communication and resource sharing support for distributed applications.
Middleware is concerned with providing useful building blocks for the construction of
software components that can work with one another in a distributed system.
Middleware can also provide services for use by application programs. They are
infrastructural services, tightly bound to the distributed programming model that the
middleware provides.
For example. CORBA offers a variety of services that provide applications with facilities,
which include naming, security, transactions, persistent storage and event notification.
Limitations of middleware:
Many distributed applications rely entirely on the services provided by the available
middleware to support their needs for communication and data sharing.
For example, an application that is suited to the client-server model such as a database of
names and addresses can rely on middleware that provides only remote method
invocation,
Distributed Systems Notes
It has been achieved in simplifying the programming of distributed systems through the
development of middleware support, but some aspects of the dependability of systems
require support at the application level.
System Architectures
The main types of architectural model are
Client-server model
Peer to Peer
Client-server model
The system is structured as a set of processes, called servers, that offer services to the users,
called clients.
• The client-server model is usually based on a simple request/reply protocol,
implemented with send/receive primitives or using remote procedure calls (RPC) or
remote method invocation (RMI):
- the client sends a request (invocation) message to the server asking for some service;
- the server does the work and returns a result (e.g. the data requested) or an error
code if the work could not be performed.
A server can itself request services from other servers; thus, in this new relation, the
server itself acts like a client.
Clients invoke individual servers
Distributed Systems Notes
Peer-to-Peer
All processes (objects) play similar role.
• Processes (objects) interact without particular distinction between clients and servers.
• The pattern of communication depends on the particular application.
• A large number of data objects are shared; any individual computer holds only a small part of
the application database.
• Processing and communication loads for access to objects are distributed across many
computers and access links.
• This is the most general and flexible model.
Some problems with client-server:
• Centralisation of service poor scaling
- Limitations:
capacity of server
bandwidth of network connecting the server
Peer-to-Peer tries to solve some of the above
Distributed Systems Notes
Problems with peer-to-peer:
• High complexity due to
- Cleverly place individual objects - retrieve the objects
- maintain potentially large number of replicas.
Variations of the Basic Models
Client-server and peer-to-peer can be considered as basic models.
• Several variations have been proposed, with considering factors such as:
- multiple servers and caches
- mobile code and mobile agents
- low-cost computers at the users‘ side
- mobile devices
Services provided by multiple servers
Services may be implemented as several server processes in separated host computers interacting
as necessary to provide a service to client processes.
The servers may partition the set of objects on which the service is based and distribute them
between themselves, or they may maintain replicated copies of them on serverl hosts.
Distributed Systems Notes
Distributed Systems Notes
Proxy Server
A proxy server provides copies (replications) of resources which are managed by other servers.
Proxy servers are typically used as caches for web resources. They maintain a cache of recently
visited web pages or other resources.
When a request is issued by a client, the proxy server is first checked, if the requested object
(information item) is available there.
Proxy servers can be located at each client, or can be shared by several clients.
The purpose is to increase performance and availability, by avoiding frequent accesses to
remote servers.
Mobile Code
Mobile code: code that is sent from one computer to another and run at the destination.
Advantage: remote invocations are replaced by local ones.
Typical example: Java applets.
Distributed Systems Notes
Mobile Agents
Mobile agent: a running program that travels from one computer to another carrying out a task on
someone‘s behalf.
• A mobile agent is a complete program, code + data that can work (relatively) independently.
• The mobile agent can invoke local resources/data.
Typical tasks:
• Collect information
• Install/maintain software on computers
• Compare prises from various vendors bay visiting their sites.
Network Computers
Network computers do not store locally operating system or application code. All code is loaded
from the servers and run locally on the network computer.
Advantages:
• The network computer can be simpler, with limited capacity; it does not need even a local hard
disk (if there exists one it is used to cache data or code).
• Users can log in from any computer.
• No user effort for software management/ administration.
Thin Clients
The thin client is a further step, beyond the network computer:
• Thin clients do not download code (operating system or application) from the server to run it
locally. All code is run on the server, in parallel for several clients.
• The thin client only runs the user interface!
Advantages:
• All those of network computers but the computer at the user side is even simpler (cheaper).
Strong servers are needed!
Mobile Devices
Mobile devices are hardware, computing components that move (together with their software)
between physical locations.
• This is opposed to software agents, which are software components that migrate.
• Both clients and servers can be mobile (clients more frequently).
Distributed Systems Notes
Particular problems/issues:
• Mobility transparency: clients should not be aware if the server moves (e.g., the server keeps its
Internet address even if it moves between networks).
• Problems due to variable connectivity and bandwidth.
• The device has to explore its environment:
- Spontaneous interoperation: associations between devices (e.g. clients and servers) are
dynamically created and destroyed.
- Context awareness: available services are dependent on the physical environment in which the
device is situated.
Design requirements for distributed architectures
Performance issues
Use of caching and replication
Dependability issues
Performance issues
Responsiveness
–Users of interactive aplication require a fast and consistent response to interaction.
Throughput
–The rate at which computational work is done.
Quality of services
–The ability to meet the deadlines of users need.
Balancing computer loads
–In some case load balancing may involve moving partially-completed work as the loads
on hosts changes.
Use of caching and replication
The performance issues often appear to be major obstacles to the successful deployment of
DS, but much progress has been made in the design of systems that overcome them by the
use of data replication and caching.
Dependability issues
The dependability of computer systems as correctness, security and fault
tolerance. Fault tolerance: reliability is achieved through redundancy.
Distributed Systems Notes
–Security: the architectural impact of the requirement for security concerns the need to locate
sensitive data and other resources only in computers that can be effectively secured against
attack.
1.7Fundamental Models
Interaction model
Failure model
Security model
Interaction model
Performance of communication channels
Computer clocks and timing events
Two variants of the interaction model
Agreement in pepperland
Event ordering
Performance of communication channels
Communication performance is often a limiting characteristic.
The delay between the sending of a message by one process and its receipt by
another is referred to as latency.
Bandwidth
Jitter is the variation in the time taken to deliver a series of messages.
Computer clock and timing event
It is impossible to maintain a single global notion of time.
There are several approaches to correcting the times on computer clocks. (from GPS)
Two variants of the interaction model
–Synchronous distributed system
–The time to execute each step f a process has known lower and uper bounds.
–Each message transmitted over a channel is received within a known bounded time
–Each process has a local clock whose drift rate from real time has a known bound.
–Asynchronous distributed system
–No bound on process executiong speeds
–No bound on message transmisson delays
–No bound on clock drift rates.
Distributed Systems Notes
Agreement in pepperland
•The pepperland divisions need to agree on which of them will lead the charge against
the Blue Meanies, and when the charge will take place.
•In asynchronous pepperland, the messengers are very variable in their speed.
•The divisions know some useful constraints: every message takes at least min.
Minutes and at most max minutes to arive.
•The leading division sends a message ‘charge!‘, then waits for min minutes, then
it
charges.
•The other division‘s charge is guaranteed to be after the leading division‘s, but no
more
than (max-min) after it.
Event ordering
–In many cases, we are interested in knowing whether an event (sending or
receiving a message) at one process occurred before, after or concurrently with
another event at another process. The execution of a system can be described in
terms of events and their ordering despite the lack of accurate clocks.
Example
send
receive
send
receive
m1 m2
2
1
3
4X
Y
Z
Physical
time
A
m3
receive receive
send
receive receive receive
t1 t2 t3
receive
receive
m2
m1
Distributed Systems Notes
Failure model
Omission failures
A processor or communication channel fails to perform actions it is supposed to
do. This means that the particular action is not performed!
• We do not have an omission fault if:
- An action is delayed (regardless how long) but finally executed.
- An action is executed with an erroneous result.
With synchronous systems, omission faults can be detected by timeouts.
• If we are sure that messages arrive, a timeout will indicate that the sending process
has crashed. Such a system has a fail-stop behaviour.
Arbitrary failures
–This is the most general and worst possible fault semantics.
–Intended processing steps or communications are omitted or/and unintended
ones are executed.
–Results may not come at all or may come but carry wrong values.
process p process q
Communication channel
send
Outgoing message buffer Incoming message buffer
receivem
Distributed Systems Notes
Timing failures
o Timing faults can occur in synchronous distributed systems, where time limits are set to process
execution, communications, and clock drifts.
o A timing fault occurs if any of this time limits is exceeded.
Masking failure Each component in a distributed system is generally constructed from a collection of other componetns. It is possible to construct relibable services from components that exhibit failures.
A service masks a failure, either by hiding tit altogether or by converting it into a more acceptable
tyoe a failure.
Eg: checksums are used to mask corrupted messages-effectively converting an arbitary
failure into an omission failure.
Reliability of one to one communication
A basic communication channel can exhibit the omission failures, I*t is possible to use it
to build a communication service that masks some of those failures.
Distributed Systems Notes
The term reliable communication is defined in terms of validity and integrity as follows:
validity: any message in the outgoing message buffer is eventually delivered to the
incoming message buffer;
integrity: the message received is identical to one sent, and no messages are delivered
twice.
The threats to integrity come from two independent sources:
• Any protocol that retransmits messages but does not reject a message that arrives
twice. Protocols can attach sequence numbers to messages so as detect those that
are delivered twice.
• Malicious users that may inject spurious messages, replay old messages or tamper
with messages. Security measures can be taken to maintain the integrity property in the
face of such attacks.
Security model
The security of a distributed system can be achieved by securing the processes and the
channels used for their interactions and by protecting the objects that they encapsulate
against unauthorized access
Protection is described in terms of objects, although the concepts apply equally well to
resources of all types.
Network
invocation
result
ClientServer
Principal (user) Principal (server)
ObjectAccess rights
Distributed Systems Notes
This shows a server that manages a collection of objects on behalf of some users. The users
can run client programs that send invocations to the server to perform operations on the objects.
The server carries out the operation specified in each invocation and sends the result to the client.
Objects are intended to be used in different ways by different users. For example, some
objects may hold a user's private data, such as their mailbox, and other objects may hold shared data
such as web pages. To support this, access rights specify who is allowed to perform the
operations of an object for example, who is allowed to read or to write its state.
It must include users in and beneficiaries of access rights. Such an authority is called a
principal. A principal may be a user or a process.
The server is responsible for verifying the identity of the principal behind each invocation and
checking that they have sufficient access rights to perform the requested operation on the
particular object invoked, rejecting those that do not. The client may check the identity of the
principal behind the server to ensure that the result comes from the required server.
Securing processes and their interactions
Processes interact by sending messages. The messages are exposed to attack because the
network and the communication service that they use is open, to enable any pair of processes to
interact. Servers and peer processes expose their interfaces, enabling invocations to be sent to
them by any other process.
Distributed systems are often deployed and used in tasks that are likely to be subject to
external attacks by hostile users. This is especially true for applications that handle financial
transactions, confidential or classified information or any other information whose secrecy or
integrity is crucial. Integrity is threatened by security violations as well as communication failures
.
Distributed Systems Notes
The enemy
To model security threats, we postulate an enemy that is capable of sending any message to
any process and reading or copying any message between a pair of processes, as shown in the
following figure.
Such attacks can be made simply by using a computer connected to a network to run a program
that reads network messages addressed to other computers on the network, or a program that
generates messages that make false requests to services and purporting to come from authorized users.
The attack may come from a computer that is legitimately connected to the network or from one
that is connected in an unauthorized manner.
Threats to processes:
A process that is designed to handle incoming requests may receive a message from any other
process in the distributed system, and it cannot necessarily determine the identity of the sender.
Communication protocols such as IP do include the address of the source computer in each
message, but it is not difficult for an enemy to generate a message with a forged source address.
This lack of reliable knowledge of the source of a message is a threat to the correct functioning
of both servers and clients.
Servers:
Since a server can receive invocations from many different clients, it cannot necessarily determine
the identity of the principal behind any particular invocation. Even if a server requires the inclusion
of the principal's identity in each invocation, an enemy might generate an invocation with a false
identity.
Distributed Systems Notes
Without reliable knowledge of the sender's identity, a server cannot tell whether to
perform the operation or to reject it. For example, a mail server would not know whether the user
behind an invocation that requests a mail item from a particular mailbox is allowed to do so or
whether it was a request from an enemy.
Clients:
When a client receives the result of an invocation from a server, it cannot necessarily tell
whether the source of the result message is from the intended server or from an enemy, perhaps
'spoofing' the mail server. Thus the client could receive a result that was unrelated to the original
invocation, such as a false mail item (one that is not in the user's mailbox).
Threats to communication channels:
An enemy can copy, alter or inject messages as they travel across the network and its
intervening gateways. Such attacks present a threat to the privacy and integrity of information as
it travels over the network and to the integrity of the system.
For example, a result message containing a user's mail item might be revealed to another
user or it might be altered to say something quite different. Another form of attack is the attempt
to save copies of messages and to replay them at a later time, making it possible to reuse the
same message over and over again.
For example, someone could benefit by resending an invocation message requesting a
transfer of a sum of money from one bank account to another. All these threats can be defeated
by the use of secure channels, which are described below and are based on cryptography and
authentication.
Distributed Systems Notes
Defeating security threats :
Cryptography and shared secrets: Suppose that a pair of processes (for example a
particular client and a particular server) share a secret; that is they both know the secret but no
other process in the distributed system knows it. Then if a message exchanged by that pair of
processes includes information that proves the sender's knowledge of the shared secret. The
recipient knows for sure that the sender was the other process in the pair.
Cryptography is the science of keeping messages secure, and encryption is the process of
scrambling a message in such a way as to hide its contents.
Modem cryptography is based on encryption algorithms that use secret keys large numbers that
are difficult to guess - to transform data in a manner that can only be reversed with knowledge of
the corresponding decryption key.
Authentication:
The use of shared secrets and encryption provides the basis for the authentication of
messages proving the identities supplied by their senders. The basic authentication technique is
to include in a message an encrypted portion that contains enough of the contents of the message
to guarantee its authenticity.
The authentication portion of a request to a file server to read part of a file, for example,
might include a representation of the requesting principal's identity, the identity of the file and the
date and time of the request, all encrypted with a secret key shared between the file server and
the requesting process. The server would decrypt this and check that it corresponds to the
unencrypted details specified in the request.
Secure channels: Encryption and authentication are used to build secure channels as a
service layer on top of existing communication services. A secure channel is a communication
channel connecting a pair of processes, each of which acts on behalf of a principal, as shown in
the following figure.
Distributed Systems Notes
A secure channel has the following properties:
• Each of the processes knows reliably the identity of the principal on whose behalf
the other process is executing. Therefore if a client and server communicate via a
secure channel, the server knows the identity of the principal behind the
invocations and can check their access rights before performing an operation. This
enables the server to protect its objects correctly and allows the client to be sure
that it is receiving results from a bona fide server.
• A secure channel ensures the privacy and integrity (protection against tampering)
of the data transmitted across it.
• Each message includes a physical or logical time stamp to prevent message from
being replayed or reordered.
Other possible threats from an enemy:
Denial of service:
This is a form of attack in which the enemy interferes with the activities of authorized
users by making excessive and pointless invocations on services or message transmissions in a
network, resulting in overloading of physical resources (network bandwidth, server processing
capacity).
Such attacks are usually made with the intention of delaying or preventing actions by
other users. For example, the operation of electronic door locks in a building might be disabled
by an attack that saturates the computer controlling the electronic locks with invalid requests.
Mobile code:
Mobile code raises new and interesting security problems for any process that receives
and executes program code from elsewhere, such as the email attachment mentioned. A Trojan
horse role, purporting to fulfill an innocent purpose but in fact including code that accesses or
modifies resources that are legitimately available to the host process but not to the originator of
the code.
Distributed Systems Notes
The methods by which such attacks might be carried out are many and varied, and the
host environment must he very carefully constructed in order to avoid them. Many of these issues
have been addressed in Java and other mobile code systems, but the recent history of this topic
has included the exposure of some embarrassing weaknesses. This illustrates well the need for
rigorous analysis in the design of all secure systems.
The uses of security models:
The use of security techniques such as encryption and access control incurs substantial
processing and management costs. The security model outlined above provides the basis for the
analysis and design of secure systems in which these costs are kept to a minimum, hut threats to a
distributed system arise at many points, and a careful analysis of the threats that might arise from
all possible sources in the system's network environment, physical environment and human
environment is needed.
This analysis involves the construction of a threat model listing all the forms of attack to
which the system is exposed and an evaluation of the risks and consequences of each. The
effectiveness and the cost of the security techniques needed can then be balanced against the
threats.
1.8 Networking and Internetworking
The networks used in distributed systems are built from a variety of transmission media,
including wire, cable, fiber and wireless channels; hardware devices, including routers, switches,
bridges. hubs, repeaters and network interfaces; and software components, including protocol
stacks, communication handlers and drivers.
The resulting functionality and performance available to distributed system and
application programs is affected by all of these. The computers and other devices that use the
network for communication purposes are referred to as hosts. The term node is used to refer to
any computer or switching device attached to a network.
The Internet is a single communication subsystem providing communication between all
of the hosts that are connected to it. The Internet is constructed from many subnets.A subnet is a
set of interconnected nodes, all of which employ the same technology to communicate amongst
themselves.
Distributed Systems Notes
The Internet's infrastructure includes an architecture and hardware and software
components that effectively integrate diverse subnets into a single data communication service.
The design of a communication subsystem is strongly influenced by the characteristics of the
operating systems used in the computers of which the distributed system is composed as well as
the networks that interconnect them.
Networking issues for distributed systems Performance:
The network performance parameters like latency and data transfer rates those affecting
the speed with which individual messages can be transferred between two interconnected
computers.
Latency is the delay that occurs after a send operation is executed before data starts to
become available at the destination. It can be measured as the time required transferring an empty
message.
Data transfer rate is the speed at which data can be transferred between two computers
in the network once transmission has begun, usually quoted in bits per second. Following from
these definitions, the time required for a network to transfer a message containing length bits
between two computers is:
Message transmission time = latency + length/data transfer rate
The above equation is valid for messages whose length does not exceed a maximum that
is determined by the underlying network technology. Longer messages have to be segmented and
the transmission time is the sum of the times for the segments.
The transfer rate of a network is determined primarily by its physical characteristics,
whereas the latency is determined primarily by software overheads, routing delays and a load-
dependent statistical element arising from conflicting demands for access to transmission
channels.
Many of the messages transferred between processes in distributed systems are small in
size; latency is therefore often of equal or greater significance than transfer rate in determining
performance. .
Distributed Systems Notes
The total system bandwidth of a network is a measure of throughput the total volume of
traffic that can be transferred across the network in a given time. The performance of networks
weakens in conditions of overload when there are too many messages in the network at the same
time.
Scalability:
Computer networks are an necessary part of the infrastructure of modern societies. The
potential future size of the Internet is adequate with the population of the planet. It is realistic to
expect it to include several billion nodes and hundreds of millions of active hosts.
These figures indicate the huge changes in size and load that the Internet must handle.
The network technologies on which it is based were not designed to cope with even the Internet's
current scale; but they have performed remarkably well. Some substantial changes to the
addressing and routing mechanisms are planned in order to handle the next phase of the Internet's
growth.
Reliability :
Many applications are able to recover from communication failures and hence do not
require guaranteed error-free communication. The end-to-end argument further supports the view
that the communication subsystem need not provide totally error-free communication; the
detection of communication errors and their correction is often best performed by application-
level software.
The reliability of most physical transmission media is very high. When errors occur they
are usually due to timing failures in the software at the sender or receiver (for example, failure by
the receiving computer to accept a packet) or buffer overflow rather than errors in the network.
Security
A firewall creates a protection boundary between the organization's intranet and the rest
of the Internet. The purpose of the firewall is to protect the resources in all of the computers
inside the organization from access by external users or processes and to control the use of
resources outside the firewall by users inside the organization.
A firewall runs on a gateway a computer that stand at the network entry point to an
organization's intranet. The firewall receives and filters all of the messages traveling into and out
36
Distributed Systems Notes
of an organization. It is configured according to the organization's security policy to allow certain
incoming and outgoing messages to pass through it and to reject all others.
Security can be achieved through the use of cryptographic techniques. It is usually
applied at a level above the communication subsystem.Exceptions include the need to protect
network components such as routers against unauthorized interference with their operation and
the need for secure links to mobile devices and other external nodes to enable them to participate
in a secure intranet called Virtual Private Network (VPN).
Mobility:
Distributed systems to support portable computers and handheld digital devices and
mentioned the need for wireless networks in order to support continuous communication with
such devices. But the consequences of mobility extend beyond the need for wireless networking.
Mobile devices are frequently moved between locations and reconnected at convenient network
connection points.
The addressing and routing schemes of the Internet and other networks were developed
before the advent of mobile devices. Although the current mechanisms have been adapted and
extended to support them, the expected future growth in the use of mobile devices will require
further extensions.
Quality of service:
Quality of service as the ability to meet deadlines when transmitting and processing
streams of real-time multimedia data. This imposes major new requirements on computer
networks.
Applications that transmit multimedia data require guaranteed bandwidth and bounded
latencies for the communication channels that they use. Some applications vary their demands
dynamically and specify both a minimum acceptable quality of service and a desired optimum.
Multicasting
Most communication in distributed systems is between pairs of processes, but there often
is also a need for one-to-many communication. While this can be simulated by sends to several
destinations, this is more costly than necessary, and may not exhibit the fault-tolerance
Distributed Systems Notes
characteristics required by applications. For these reasons many network technologies support the
simultaneous transmission of messages to several recipients.
1.9Types of network
1.10 Networking Principles
Packet switching was a radical step beyond the switched telecommunication networks that
used telephone and telegraph communication, exploiting the capability of computers to store data
while it is in transit.
This enables packets addressed to different destinations to share a single communications
link. Packets are queued in a buffer and transmitted when the link is available. Communication is
asynchronous - messages arrive at their destination after a delay that varies depending upon the
time that packets take to travel through the network.
Packet transmission
In most applications of computer networks the requirement is for the transmission of
logical units of information or messages sequences of data items of arbitrary length. But before
a message is transmitted it is subdivided into packets.
Distributed Systems Notes
The simplest form of packet is a sequence of binary data (an array of bits or bytes) of
restricted length, together with addressing information sufficient to identify the source and
destination computers.
Packets of restricted length are used:
so that each computer in the network can allocate sufficient buffer storage to hold the
largest possible incoming packet;
to avoid the undue delays that would occur in waiting for communication channels to
become free if long messages were transmitted without subdivision.
Data streaming
There are major exceptions to the rule that message-based communication meets most
application needs .Multimedia applications rely upon the transmission of streams of audio and
video data elements at guaranteed rates and with hounded latencies. Such streams differ
substantially from the message-based type of traffic for which packet transmission was designed.
The streaming of audio and video requires much higher bandwidths than most other forms
of communication in distributed systems. The transmission of a video stream for real-time
display requires a bandwidth of about 1.5 Mbps if the data is compressed or 120 Mbps if
uncompressed.
The play time of a multimedia element is the time at which it must be displayed (for a
video element) or converted to audio (for a sound sample). For example, in a stream of video
frames that has a frame rate of 24 frames per second, frame N has a play time that is N/24
seconds after the stream's start time. Elements that arrive at their destination later than their play
time are no longer useful and will be dropped by the receiving process.
The timely delivery of such data streams depends upon the availability of connections
with guaranteed quality of service bandwidth, latency and reliability must all be guaranteed.
With a predefined route through the network, a reserved set of resources at each node
through which it will travel and buffering where appropriate to cushion any irregularities in the
flow of data through the channel. Data can then be passed through the channel from sender to
receiver at the required rate.
Distributed Systems Notes
Switching schemes
A network consists of a set of nodes connected together by circuits. To transmit
information between two arbitrary nodes, a switching system is required.
The four types of switching that are used in computer networking are:
Broadcast :
Broadcasting is a transmission technique that involves no switching. Everything is
transmitted to every node, and it is up to potential receivers to notice transmissions addressed to
them.
Some LAN technologies. including Ethernet, are based on broadcasting. Wireless
networking is necessarily based on broadcasting, but in the absence of fixed circuits the
broadcasts are arranged to reach nodes grouped in cells.
Circuit switching:
At one time telephone networks were the only telecommunication networks. Their
operation was simple to understand: when a caller dialed a number, the pair of wires from her
phone to the local exchange was connected by an automatic switch at the exchange to the pair of
wires connected to the other party's phone.
For a long-distance call the process was similar but the connection would be switched
through a number of intervening exchanges to its destination. This system is sometimes referred
to as the plain old telephone system, or POTS. It is a typical circuit-switching network.
Packet Switching:
This new type of communication network is called a store-and-forward network.. There is
a computer at each switching node (wherever several circuits need to be interconnected). Packets
arriving at a node are first stored in the memory of the computer at the node and then processed
by a program that forwards it towards their destination by choosing an outgoing circuit that will
transfer the packet to another node that is closer to its ultimate destination.
Frame relay
The store-and-forward transmission of packets is not instantaneous. It typically takes
anything from a few tens of microseconds to a few milliseconds to switch a packet through each
network node, depending on packet size, hardware speeds and the quantity of other traffic.
Packets may he routed through many nodes before they reach their destination.
Distributed Systems Notes
The Internet is based on store-and-forward switching even short Internet packets typically
take around 200 milliseconds to reach their destinations. Delays of this magnitude are much too
long for applications such as telephony, where delays of less than 50 milliseconds are needed to
sustain a telephone conversation without interference.
Since the delay is additive - the more nodes a packet passes through, the more it is
delayed and since much of the delay at each node arises from factors that are inherent to the
packet-switching technique.
Protocols
The term protocol is used to refer to a well-known set of rules and formats to be used for
communication between processes in order to perform a given task. The definition of a protocol
has two important parts to it:
• a specification of the sequence of messages that must be exchanged;
• a specification of the format of the data in the messages.
A protocol is implemented by a pair of software modules located in the sending and
receiving computers. For example, a transport protocol transmits messages of any length from a
sending process to a receiving process.
A process wishing to transmit a message to another process issues a call to a transport
protocol module, passing it a message in the specified format. The transport software then
concerns itself' with the transmission of the message to its destination, subdividing it into packets
of some specified size and format that can be transmitted to the destination via the network
protocol another, lower-level protocol.
The corresponding transport protocol module in the receiving computer receives the
packet via the network-level protocol module and performs inverse transformations to regenerate
the message before passing it to a receiving process.
Distributed Systems Notes
Protocol layers
Network software is arranged in a hierarchy of layers. Each layer presents an interface to
the layers above it that extends the properties of the underlying communication system. A layer
is represented by a module in every computer connected to the network. Following figure
illustrates the structure and the flow of data when a message is transmitted using a layered
protocol.
Eachmodule appears to communicate directly with a module at the same level in another
computer in the network, but in reality data is not transmitted directly between the protocol
modules at each level. Instead, each layer of network software communicates by local procedure
calls with the layers above and below it.
On the sending side, each layer (except the topmost, or application layer) accepts items of
data in a specified format from the layer above it and applies transformations to encapsulate the
data in the format specified for that layer before passing it to the layer below for further
processing.
Distributed Systems Notes
The following figure illustrates this process as it applies to the top four layers of the OSI
protocol suite. The figure shows the packet headers that hold most network-related data items, but for
clarity it omits the trailers that are present in some types of packet; it also assumes that the application-
layer message to be transmitted is shorter than the underlying network's maximum packet size.
On the receiving side, the converse transformations are applied to data items received from the
layer below before they are passed to the layer above. The protocol type of the layer above is included
in the header of each layer, to enable the protocol stack at the receiver to select the correct software
components to unpack the packets.
Protocol suites :
A complete set of protocol layers is referred to as a protocol suite or a protocol stack,
reflecting the layered structure. The following figure shows a protocol stack that conforms to the
seven-layer Reference Model for open systems interconnection (OS!) adopted by the International
Standards Organization (ISO) [ISO 1992). The OST
Reference Model was adopted in order to encourage the development of protocol standards that
would meet the requirements of open systems.
Distributed Systems Notes
Protocol layering brings substantial benefits in simplifying and generalizing the software
interfaces for access to the communication services of networks, but it also carries significant
performance costs. The transmission of an application-level message via a protocol stack with N
layers typically involves N transfers of control to the relevant layer of software in the protocol
suite, at least one of which is an operating system entry, and taking N copies of the data as a part
of the encapsulation mechanism. All of these overheads result in data transfer rates between
application processes that are much lower than the available network bandwidth.
The figure includes examples from protocols used in the Internet, but the implementation of the
Internet does not follow the OSI model in two respects
Packet assembly
The task of dividing messages into packets before transmission and reassembling them at
the receiving computer is usually performed in the transport layer.
The network-layer protocol packets consist of a header and a data field. In most network
technologies, the data field is variable in length, but with a limit called the maximum transfer unit
(MTU).
Distributed Systems Notes
If the length of a message exceeds the MTU of the underlying network layer, it must be
fragmented into chunks of the appropriate size, with sequence numbers for use on reassembly,
and transmitted in multiple packets. For example, the MTU for Ethernets is 1500 bytes no more
than that quantity of data can be transmitted in a single Ethernet packet.
Although the IP protocol stands in the position of a network layer protocol in the Internet
suite of protocols, its MTU is unusually large at 64 kbytes, (8 kbytes is often used in practice
because some nodes are unable to handle such large packets). Whichever MTU value is adopted
for IP packets, packets larger than the Ethernet MTU can arise and they must be fragmented for
transmission over Ethemets.
Ports :
The transport layer's task is to provide a network-independent message transport service
between pairs of network ports. Ports are software-definable destination points for
communication within a host computer.
Ports are attached to processes, enabling them to communicate in pairs. The specific
details of the port abstraction may be varied to provide additional useful properties. Here we shall
describe the addressing of ports as it occurs in the Internet and most other networks.
Addressing :
The transport layer is responsible for delivering messages to destinations with transport
addresses that are composed of the network address of a host computer and a port number. A
network address is a numeric identifier that uniquely identifies a host computer and enables it to
be located by nodes that are responsible for routing data to it.
In the Internet every host computer is assigned an IP number, which identifies it and the
subnet to which it is connected, enabling data to be routed to it from any other node as described
in the following sections. In Ethernets there are no routing nodes; each host is responsible for
recognizing and picking up packets addressed to it.
A distributed system generally has a multiplicity of servers, which differs from one time
to another and from one organization to another. Clearly, the allocation of fixed hosts or fixed
port numbers to these services is not feasible.
Distributed Systems Notes
Packet delivery :
There are two approaches to the delivery of packets by the network layer:
Datagram packet delivery:
The term 'datagram' refers to the similarity of this delivery mode to the way in which
letters and telegrams are delivered. The essential feature of datagram networks is that the delivery
of each packet is a `one-shot' process; no setup is required and once the packet is delivered the
network retains no information about it.
In a datagram network a sequence of packets transmitted by a single host to a single
destination may follow different routes (if, for example, the network is capable of adaptation to
handle failures or to mitigate the effects of localized congestion) and when this occurs they may
arrive out of sequence.
Every datagram packet contains the full network address of the source and destination hosts;
Datagram delivery is the concept on which packet networks were originally based and it can be
found in most of the computer networks in use today. The Internet's network layer IP the
Ethernet and most wired and wireless local network technologies are based on datagram delivery.
Virtual circuit packet delivery
A virtual circuit must be set up before packets can pass from a source host A to
destination host B. The establishment of a virtual circuit involves the identification of a route
from the source to the destination, possibly passing through several intermediate nodes. At each
node along the route a table entry is made, indicating which link should be used for the next stage
of the route.
Once a virtual circuit has been set up, it can be used to transmit any number of packets.
Each network-layer packet contains only a virtual circuit number in place of the source and
destination addresses. The addresses are not needed, because packets are routed at intermediate
nodes by reference to the virtual circuit number.
When a packet reaches its destination the source can be determined from the virtual
circuit number. In the POTS a telephone call results in the establishment of a physical circuit
from the caller to the callee, and the voice links from which it is constructed are reserved for their
exclusive use.
Distributed Systems Notes
In virtual circuit packet delivery the circuits are represented only by table entries in
routing nodes, and the links along which the packets are routed are used only for the time taken
to transmit a packet; they are free for other uses for the rest of time.
Routing:
Routing is a function that is required in all networks except those LANs, such as the
Ethernet, that provide direct connections between all pairs of attached hosts. In large networks,
adaptive routing is employed: the best route for communication between two points in the
network is re-evaluated periodically, taking into account the current traffic in the network and
any faults such as broken connections or routers.
The delivery of packets to their destinations in a network such as the one shown in the
following figure is the collective responsibility of the routers located at connection points. The
determination of routes for the transmission of packets to their destinations is the responsibility
of a routing algorithm implemented by a program in the network layer at each node.
routing algorithm has two parts:
1. It must make decisions that determine the route taken by each packet as it travels
through the network. In circuit-switched network layers such as X.25 and frame
relay networks such as ATM the route is determined whenever a virtual circuit or
connection is established.
Distributed Systems Notes
In packet-switched network layers such as IP it is determined separately for each
packet, and the algorithm must be particularly simple and efficient if it is not to degrade
network performance.
2. It must dynamically update its knowledge of the network based on traffic
monitoring and the detection of configuration changes or failures. This activity is
less time-critical; slower and more computation-intensive techniques can be used.
Routing tables for the network
A router exchanges information about the network with its neighbouring nodes by
sending a summary of its routing table using a router information protocol (RIP). The RIP actions
performed at a router are described informally as follows:
1.. Periodically, and whenever the local routing table changes, send the table tin a
summary form) to all accessible neighbours. That is, send an RIP packet containing a copy of the
table on each non-faulty outgoing link.
2. When a table is received from a neighbouring router, if the received table shows
a route to a new destination, or a better (lower cost) route to an existing destination, then update
Distributed Systems Notes
the local table with the new route. If the table was received on link n and it gives a different
cost than the local table for a route that begins with link n. then replace the cost in the local
table with the new cost.
This is done because the new table was received from a router that is closer to the
relevant destination and is therefore always more authoritative for routes that pass through it.
Pseudo-code for RIP routing algorithm
Congestion control
The capacity of a network is limited by the performance of its communication links
and switching nodes. When the load at any particular link or node approaches its capacity,
queues will build up at hosts trying to send packets and at intermediate nodes holding
packets whose onward transmission is blocked by other traffic.
If the load continues at the same high level, the queues will continue to grow until
they reach the limit of available buffer space. Once this state is reached at a node. the node
has no option but to drop further incoming packets. If packets are dropped at intermediate
nodes.
The network resources that they have already consumed are wasted and the
resulting
Distributed Systems Notes
retransmissions will require a similar quantity of resources to reach the same point in
the network. As a rule of thumb, when the load on a network exceeds 80% of its capacity,
the total
throughput tends to drop as a result of packet losses unless usage of heavily loaded links is
controlled.
In general, congestion control is achieved by informing nodes along a route that congestion
has occurred, and their rate of packet transmission should therefore be reduced. For
intermediate nodes, this will result in the buffering of incoming packets for a longer period. For
hosts that are sources of the packets, the result may he to queue packets before transmission or to
block the application process that is generating them until the network can handle them.
All datagram-based network layers including IP and Ethernets rely on the end-to- end
control of traffic. That is, the sending node must reduce the rate at which it transmits packets
based only on information that it receives from the receiver. Congestion information may be
supplied to the sending node by explicit transmission of special messages (called choke !packets)
requesting a reduction in transmission rate, or by the implementation of a specific transmission
control .
Internetworking
There are many network technologies with different network, link and physical layer
protocols. Local networks are built from Ethernet and ATM technologies, wide area networks are
built over analogue and digital telephone networks of various types, satellite links and wide-area
ATM networks. Individual computers and local networks are linked to the Internet or intranets by
modems, ISDN links and DSL connections.
To build an integrated network (an internetwork) we must integrate many subnets, each of which
is based on one of these network technologies. To make this possible, the following are
needed:
1. A unified internetwork addressing scheme that enables packets to be addressed to any
host connected to any subnet.
2. A protocol defining the format of internetwork packets and giving rules according to
which they are handled.
3. Interconnecting components that route packets to their destinations in terms of
internetwork addresses, transmitting the packets using subnets with a variety of
network technologies.
Distributed Systems Notes
For the Internet, (1) is provided by IP addresses,(2) is the IP protocol and (3) is performed
by the components called Internet Routers. The following figure shows a small part of the
intranet located at Queen Mary and Westfield College (QMW), University of London.
Routers
Routing is required in all networks except those such as Ethernets and wireless networks, in
which all of the hosts are connected by a single transmission medium. In an internetwork, the
routers may be linked by direct connections or they may be interconnected through subnets. In
both cases, the routers are responsible for forwarding the internetwork packets that arrive on any
connection to the correct outgoing connection.
Bridges: Bridges link networks of different types. Some bridges link several networks, and
these are referred to as bridge/routers because they also perform routing functions. For
example, the campus network at QMW includes a Fibre Distributed Data Interface FDDI back
Distributed Systems Notes
bone.
Hubs :
Hubs are simply a convenient means of connecting hosts and extending segments of Ethernet and other broadcast local network technologies. They have a number of sockets (typically 4-64), to each of which a host computer can be connected. They can also be used to overcome the distance limitations on single segments and provide a means of adding additional hosts.
Switches :
Switches perform a similar function to routers, but for local networks (normally
Ethernets) only. That is, they interconnect several separate Ethernets, routing the incoming
packets to the appropriate outgoing network.
They perform their task at the level of the Ethernet network protocol. They start with no
knowledge of the wider internetwork and build up routing tables by the observation of traffic,
supplemented by broadcast requests when they lack information.
The advantage of switches over hubs is that they separate the incoming traffic and transmit it only
on the relevant outgoing network, reducing congestion on the other networks to which they are
connected.
Tunnelling:
Bridges and routers transmit internetwork packets over a variety of underlying networks, but
there is one situation in which the underlying network protocol can be hidden from those
above without the use of a special internetwork protocol.
When a pair of nodes connected to two separate networks need to communicate through another
type of network or over an `alien' protocol, they can do so by constructing a protocol `tunnel'.
The following figure illustrates the proposed use of tunnelling to support the migration of the
Internet to the recently approved IPv6 protocol. IPv6 is intended to replace the version of IP
currently in use, IPv4, and is incompatible with it.
Distributed Systems Notes
Dept. of CSE,SIT 50 KNS
The
intervening network nodes do not need to be modified to handle the MobilelP protocol. The IP
multicast protocol is handled in a similar way, relying on a few routers that support IP multicast
routing to determine the routes, but transmitting IP packets.
1.11 Internet protocols
The Internet Protocol (IP) is the method or protocol by which data is sent from one
computer to another on the Internet. Each computer (known as a host) on the Internet has at least
one IP address that uniquely identifies it from all other computers on the Internet. When you send
or receive data (for example, an e-mail note or a Web page), the message gets divided into little
chunks called packets.
Each of these packets contains both the sender's Internet address and the receiver's address. Any
packet is sent first to a gateway computer that understands a small part of the Internet. The
gateway computer reads the destination address and forwards the packet to an adjacent
gateway that in turn reads the destination address and so forth across the Internet until one
gateway recognizes the packet as belonging to a computer within its immediate
neighborhood or domain.
That gateway then forwards the packet directly to the computer whose address is specified.
Because a message is divided into a number of packets, each packet can, if necessary, be sent by
a different route across the Internet. Packets can arrive in a different order than the order they
were sent in. The Internet Protocol just delivers them. It's up to another protocol, the
Transmission Control Protocol (TCP) to put them back in the right order.
IP is a connectionless protocol, which means that there is no continuing connection between
the end points that are communicating. Each packet that travels through the Internet is treated as
an independent unit of data without any relation to any other unit of data. (The reason the packets
Distributed Systems Notes
Dept. of CSE,SIT 51 KNS
do get put in the right order is because of TCP, the connection-oriented protocol that keeps track
of the packet sequence in a message.) In the Open Systems Interconnection (OSI)
communication model, IP is in layer 3, the Networking Layer.
The most widely used version of IP today is Internet Protocol Version 4 (IPv4). However, IP
Version 6 (IPv6) is also beginning to be supported. IPv6 provides for much longer addresses and
therefore for the possibility of many more Internet users. IPv6 includes the capabilities of IPv4
and any server that can support IPv6 packets can also support IPv4 packets.
TCP/IP Layers
Addressing
The scheme used for assigning host addresses to networks and the computers connected to them
had to satisfy the following requirements:
It must be universal any host must be able to send packets to any other host in the
Internet.
It must be efficient in its use of the address space it is impossible to predict the
ultimate size of the Internet and the number of network and host addresses likelyto
be required. The address space must be carefully partitioned to ensure that addresses will
not run out.
The addressing scheme must lend itself to the development of a flexible and
efficient routing scheme, but the addresses themselves cannot contain very much of
the information needed to route a packet to its destination.
Distributed Systems Notes
Dept. of CSE,SIT 52 KNS
The design adopted for Internet address space is shown in the following figure . There
are four allocated classes of Internet address - A, B, C and D. Class D is reserved for
Internet multicast communication. Class E contains a range of unallocated addresses, which
are reserved for future requirements.
These 32-bit Internet addresses containing a network identifier and host identifier are usually
written as a sequence of four decimal numbers separated by dots. Each decimal number
represents one of the four bytes, or octets of the IP address. The permissible values for each
class of network address are shown in the following figure.
Distributed Systems Notes
Dept. of CSE,SIT 53 KNS
Three classes of address were designed to meet the requirements of different types of
organization. The Class A addresses, with a capacity for 224
hosts on each subnet, are reserved for
very large networks such as the US NSFNet and other national wide area networks. Class B
addresses are allocated to organizations that operate networks likely to contain more than 255
computers, and Class C addresses are allocated to all other network operators.
The main difficulty is that network administrators in user organizations cannot easily predict
future growth in their need for host addresses and they tend to overestimate, requesting Class B
addresses when in doubt. Around 1990 it became evident that based on the rate of allocation at
the time, the NIC was likely to run out of IP addresses to allocate around 1996.
Two steps were taken. The first was to initiate the development of a new IP protocol and
addressing scheme, the result of which was the specification of lPv6. The second step was to
radically modify the way in which IP addresses were allocated. A new address allocation and
routing scheme, designed to make more effective use of the IP address space.
The IP protocol
The IP protocol transmits datagram from one host to another, if necessary via intermediate
routers. There are several header fields that are used by the transmission and routing algorithms.
IP provides a delivery service that is described as offering unreliable or best-effort delivery
semantics, because there is no guarantee of delivery. Packets can be lost, duplicated, delayed or
delivered out of order, but these errors arise only when the underlying networks fail or buffers at
the destination are full.
Distributed Systems Notes
Dept. of CSE,SIT 54 KNS
The only checksum in IP is a header checksum, which is inexpensive to calculate and
ensures that any corruptions in the addressing and packet management data will be detected.
There is no data checksum, which avoids overheads when crossing routers, leaving the higher-
level protocols (TCP and UDP) to provide their own checksums a practical instance of the end-
to-end argument.
The IP layer puts IP datagram into network packets suitable for transmission in the underlying
network (which might, for example, be an Ethernet). When an IP datagram is longer than the MTU
of the underlying network. it is broken into smaller packets at the source and reassembled at its
final destination. Packets can be further broken up to suit the underlying networks encountered
during the journey from source to destination.
The IP layer must also insert a `physical' network address of the message destination to the
underlying network. It obtains this from the address resolution module in the Internet Network
Interface layer.
Address Resolution
The address resolution module is responsible for converting Internet addresses to network
addresses for a specific underlying network (sometimes called physical addresses). For example, if
the underlying network is an Ethernet, the Address Resolution module converts 32-bit Internet
addresses to 48-bit Ethernet addresses.
This translation is network technology-dependent:
• Some hosts are connected directly to Internet packet switches; IP packets can be routed to them
without address translation.
• Some local area networks allow network addresses to be assigned to hosts dynamically, and the
addresses can be conveniently chosen to match the host identifier portion of the Internet address
- translation is simply a matter of extracting the host identifier from the IP address.
• For Ethernets and some other local networks the network address of each
computer is hard-wired into its network interface hardware and bears no direct relation to its
Internet address - translation depends upon knowledge of the correspondence between IP
addresses and Ethernet addresses for the hosts on the local Ethernet.
IP spoofing:
We have seen that IP packets include a source address IP address of the sending
computer. This, together with a port address encapsulated in the data field (for UDP and TCP
Distributed Systems Notes
Dept. of CSE,SIT 55 KNS
packets), is often used by servers to generate a return address. Unfortunately, it is not possible to
guarantee that the source address given is in fact the address of the sender.
A malicious sender can easily substitute an address that is different from its own. This
loophole has been the source of several well-known attacks, including the distributed denial of
service attacks of February 2000. These malicious ping requests all contained the IP address of a
target computer in their sender address field. The ping responses were therefore all directed to the
target, whose input buffers were overwhelmed, preventing any legitimate IP packets.
IP Routing:
The IP layer routes packets from their source to their destination. Each router in the Internet
implements IP-layer software to provide a routing algorithm.
Backbones
The topological map of the Internet is partitioned conceptually into autonomous systems
(AS). which are subdivided into areas. The intranets of most large organizations such as
universities and large companies are regarded as ASs, and they will usually include several areas.
The collection of routers that connect non-backbone areas to the backbone and the links that
interconnect those routers are called the backbone of the network. The links in the backbone are
usually of high bandwidth and are replicated for reliability.
Routing protocols
RIP-1, the first routing algorithm used in the Internet, is a version of the distance-vector
algorithm it subsequently to accommodate several additional requirements, including classless
interdomain routing, better multicast routing and the need for authentication of RIP packets to
prevent attacks on the routers
As the scale of the Internet has expanded and the processing capacity of routers has
increased, there has been a move towards the adoption of algorithms that do not suffer from the
slow convergence and potential instability of distance-vector algorithms
We should note that the adoption of new routing algorithms in IP routers can proceed
incrementally. A change in routing algorithm results in a new version of the RIP protocol, and a
version number is carried by each RIP packet. The IP protocol does not change when a new RIP
protocol is introduced.
Any IP router will correctly forward incoming IP packets on a reasonable, if not optimum route,
whatever version of RIP they use. But for routers to cooperate in the updating of their routing
tables, they must share a similar algorithm.
Distributed Systems Notes
Dept. of CSE,SIT 56 KNS
For this purpose the topological areas defined above are used. Within each area a single
routing algorithm applies and the routers within an area cooperate in the maintenance of their
routing tables.
Default routes
Routing algorithms has suggested that every router maintains a full routing table showing the
route to every destination (subnet or directly connected host) in the Internet. At the current scale
of the Internet this is clearly infeasible
Two possible solutions to this problem:
The first solution is to adopt some form of topological grouping of IP addresses.
The decision was taken that for future allocations, the following regional locations would be
applied:
Addresses 194.0.0.0 to 195.255.255.255 are in Europe Addresses
198.0.0.0 to 199.255.255.255 are in North America
Addresses 200.0.0.0 to 201.255.255.255 are in Central and South America Addresses 202.0.0.0
to 195.203.255.255 are in Asia and the Pacific
Because these geographical regions also correspond to well-defined topological regions in the
Internet and just a few gateway routers provide access to each region, this enables a substantial
simplification of routing tables for those address ranges. For example. a router outside
Europe can have a single table entry for the range of addresses 194.0.0.0 to
195.255.255.255 that sends all IP packets with destinations in that range on the same route to the
nearest European gateway router.
The second solution to the routing table size explosion is simpler and very effective. It is based on
the observation that the accuracy of routing information can be relaxed for most routers as long as
some key routers, those closest to the backbone links, have relatively complete routing tables. The
default entry specifies a route to be used for all IP packets whose destination is not included in
the routing table
Routing on a local subnet Q Packets addressed to hosts on the same network as the sender are
transmitted to the destination host in a single hop, using the host identifier
part of the address to obtain the address of the destination host on the underlying network. The IP
layer simply uses ARP to get the network address of the destination and
then uses the underlying network to transmit the packets.
Distributed Systems Notes
Dept. of CSE,SIT 57 KNS
If the IP layer in the sending computer discovers that the destination is on a different network,
it must send the message to a local router. It uses ARP to get the network address of the gateway
or router and then uses the underlying network to transmit the packet to it. Gateways and routers
are connected to two or more networks and they have several Internet addresses, one for each
network to which they are attached.
Classless interdomain routing (CIDR)
The main problem was a scarcity of Class B addresses - those for subnets with more than 255
hosts connected. Plenty of Class C addresses were available. The CIDR solution for this
problem is to allocate a batch of contiguous class C addresses to a subnet requiring more than
255 addresses.
The CIDR scheme also makes it possible to subdivide a Class B address space for allocation
to multiple subnets. The mask is a bit pattern that is used to select the portion of an IP address
that is compared with the routing table entry. This effectively enables the host/subnet address
to be any portion of the IP address, providing more flexibility than the classes A, B and C.
Once again, these changes to routers are made on an incremental basis, so some routers perform
CIDR and others use the old class-based algorithms. This works because the newly allocated
ranges of Class C addresses are assigned modulo 256, so each range represents an integral
number of Class C-sized subnet addresses.If a collection of subnets is connected to the rest of the
world entirely by CIDR routers, then the ranges of IP addresses used within the collection can be
allocated to individual subnets in chunks determined by a binary mask of any size.
For example, a Class C address space can be subdivided into 32 groups of 8. Figure 3.10 contains
an example of the use of the CLDR mechanism to split the 138.37.95 Class C-sized subnet
into several groups of eight host addresses that are routed differently. The separate groups are
enoted by notations 138.37.95.232/29, 138.37.95.248/29 and so on.
Firewalls
The purpose of a firewall is to monitor and control all communication into and out of an intranet.
A firewall is implemented by a set of processes that act as a gateway to an intranet (Figure
3.20(a)), applying a security policy determined by the organization.
The aims of a firewall security policy may include any or all of the following:
Service control:
To determine which services on internal hosts are accessible for external access and to reject all
other incoming service requests. Outgoing service requests and the responses to them may also be
controlled. These filtering actions can be based on the contents of IP packets and the TCP and
Distributed Systems Notes
Dept. of CSE,SIT 58 KNS
UDP requests that they contain. For example, incoming HTTP requests may be rejected unless
they are directed to an official web server host.
Behaviour control:
To prevent behaviour that infringes the organization's policies, is anti-social or has no
discernible legitimate purpose and is hence suspected of forming part of an attack. Some of these
filtering actions may be applicable at the IP or TCP level, but others may require interpretation of
messages at a higher level. For example, filtering of email `spam' attacks may require
examination of the sender's email address in message headers or even the message contents.
User control:
The organization may wish to discriminate between its users, allowing some access to
external services but inhibiting others from doing so. An example of user control that is perhaps
more socially acceptable than some is to prevent the acknowledging of software except to users
who are members of the system administration team, in order to prevent virus infection or to
maintain software standards. This particular example would in fact be difficult to implement
without inhibiting the use of the Web by ordinary users.
The policy has to be expressed in terms of filtering operations that are performed by filtering
processes operating at several different levels:
IP packet filtering:
This is a filter process examining individual IP packets. It may make decisions based on
the destination and source addresses. It may also examine the service type field of IP packets and
interpret the contents of the packets based on the type. For example, it may filter TCP packets
based on the port number to which they are addressed, and since services are generally located at
well-known ports, this enables packets to be filtered based on the service requested. For example,
many sites prohibit the use of NFS servers by external clients.
TCP gateway:
A TCP gateway process checks all TCP connection requests and segment transmissions. When a
TCP gateway process is installed, the setting-up of TCP connections can be controlled and TCP
segments can be checked for correctness (some denial of service attacks use malformed TCP
segments to disrupt client operating systems). When desired, they can be routed through an
application-level gateway for content checking.
Application-level gateway:
An application-level gateway process acts as a proxy for an application process. For
Distributed Systems Notes
Dept. of CSE,SIT 59 KNS
example, a policy may be desired that allows certain internal users to make Telnet connections to
certain external hosts. When a user runs a Telnet program on his local computer, it attempts to
establish a TCP connection with a remote host.
The request is intercepted by the TCP gateway. The TCP gateway starts a Telnet proxy process
and the original TCP connection is routed to it. If the proxy approves the Telnet operation (the
user is authorized to use the requested host) it establishes another connection to the requested host
and then it relays all of the TCP packets in both directions. A similar proxy process would
run on behalf of each Telnet client, and similar proxies might he employed for FTP and other
services.
.12 Question Bank 1. Define distributed systems?
Distributed Systems Notes
Dept. of CSE,SIT 60 KNS
2. Give examples of distributed systems . 3. Write the following (i)HTTP (ii) HTML (iii) URL 4. What are the uses of web services? 5. Define heterogeneity. 6. What are the characteristics of heterogeneity ? 7. What is the purpose of heterogeneity mobile code? 8. Why we need openness? 9. How we provide security? 10. Define scalability. 11. What are the types of transparencies? 12. Define transparencies. 13. Define System model. 14. What is the architectural model? 15. What is the fundamental model? 16. What are the difficult for treat and distributed system? 17. Define Middleware. 18. What are the different types of model? 19. Which type of network can be used by distributed system? 20. What are the different types of network? 21. Define latency. 22. What is the difference between networking and internetworking? 23. What is meant by networking? 24. What is meant by internetworking? 25. What are the different types of switching are used in computer networking? 26. Define protocol. 27. What is the function of router? 28. What is meant by internet protocol? 29. Define domain name. 30. Define mobile IP.
PART-B
1. a. Explain the Differences between intranet and internet (8)
b. Write in detail about www (8) 2. Explain the various challenges of distributed systems (16) 3. Write in detail about the characteristics of inter process communication (16) 4. a. Explain in detail about marshalling (8)
b. Explain about the networking principles. (8) 5. Describe in detail about client - server communication. (l6) 6. Write in detail about group communication. (l6) 7. Explain in detail about the various system models (16)
Distributed Systems Notes
Dept. of CSE,SIT 61 KNS
2.1 INTRODUCTION - INTER PROCESS COMMUNICATION
Inter process communication is concerned with the communication between processes
in a distributed system, both in its own right and as support for communication between
distributed objects.
The Java API for inter process communication in the internet provides both
datagram and stream communication.
The Application Program Interface (API) to UDP provides a message passing abstraction
- the simplest form of interprocess communication. This enables a sending process to transmit
a single message to a receiving process. The independent packets containing these messages
are called datagrams. In the Java and UNIX APIs, the sender specifies the destination using a
socket
- an indirect reference to a particular port used by the destination process at a
destination computer.
The API to TCP provides the abstraction of a two-way stream between pairs of processes.
The information communicated consists of a stream of data items with no message boundaries.
Request-reply protocols are designed to support client-server communication in the form
of either Remote Method Invocation (RMI) or Remote Procedure Call (RPC). Group
multicast protocols are designed to support group communication. Group multicast is a form of
interprocess communication in which one process in a group of processes transmits the same
message to all members of the group.
2.2 The API for the Internet Protocol
The characteristics of interprocess communication
Message passing between a pair of processes can be supported by two message
communication operations: send and receive.
In order for one process to communicate with another, one process sends a message to a
destination and another process at the destination receives the message. This activity
involves the communication of data from the sending process to the receiving process and may
involve the synchronization of the two processes.
A queue is associated with each message destination. Sending processes cause messages to be
added to remote queues and receiving processes remove messages from local queues.
Communication between sending and receiving processes may be either synchronous
or asynchronous.
Distributed Systems Notes
Dept. of CSE,SIT 62 KNS
In synchronous form of communication, the sending and receiving processes synchronize at
every message. In this case, both send and receive are blocking operations. Whenever a send is
issued the sending process is blocked until the corresponding receive is issued. Whenever
receive is issued, the process blocks until a message arrives.
In asynchronous form of communication, the use of the send operation is non-blocking in
that the sending process is allowed to proceed as soon as the message has been copied to a
local buffer and the transmission of the message proceeds in parallel with the sending
process. The receive operation can have blocking and non-blocking variants.
Messages are sent to (Internet address, local port) pairs. A local port is a message
destination within a computer, specified as an integer. A port has exactly one receiver but
can have many senders. Processes may use multiple ports from which to receive
messages. Servers generally publicize their port numbers for use by clients.
Sockets:
Both forms of communication (UDP and TCP) use the socket abstraction, which
provides an end point for communication between processes. Inter-process
communication consists of transmitting a message between a socket in one
process and a socket in another process.
For a process to receive messages, its socket must be bound to a local port and the
Internet address of the computer on which it runs. Processes may use the same
socket for sending and receiving messages.
Any process may make use of multiple ports to receive messages, but a process
cannot share ports with other processes on the same computer. Processes using IP
multicast are an exception in that they do share ports.
UDP datagram communication
Distributed Systems Notes
Dept. of CSE,SIT 63 KNS
A datagram sent by UDP is transmitted from a sending process to a receiving process
without acknowledgement or retries. If a failure occurs, the message may not arrive. To send or
receive messages, a process must first create a socket bound to an Internet address of the local
host and a local port. A server will bind its socket to a server port - one that it makes known to
clients so that they can send messages to it. A client binds its socket to any free local port.
Few issues relating to datagram communication are :
Message Size- The receiving process need to specify an array of bytes of a particular size.
If the message is too big for the array, it is truncated on arrival. Any application requiring
messages larger than the maximum must fragment them into chuncks of that size.
Blocking- Sockets normally provide non-blocking sends and blocking receives for
datagram communication.
Timeouts- The receive that blocks for ever is suitable for use by a server that is waiting to
receive requests from its clients. But in some program, it is not appropriate that a process
that has used a receive operation should wait indefinitely in situations where the potential
sending process has cashed or the expected message has been lost. To allow such
requirements, timeouts can be set on sockets.
Receive from any - The receive method does not specify an origin for messages. Instead
an invocation of receive gets a message addressed to its socket from any origin.
A failure model for UDP datagrams suffer from the following failures:
Omission failures - messages may be dropped occasionally.
Ordering - messages can sometimes be delivered out of sender order.
Use of UDP:
The Domain Name Service (DNS), which looks up DNS names in the Internet, is
implemented over UDP. UDP datagrams are sometimes an attractive choice because they do not
suffer from overheads associated with guaranteed message delivery.
Java API for UDP datagrams:
The Java API provides datagram communication by means of two classes:
DatagramPacket
DatagramSocket
Datagram Packet: This class provides a constructor that makes an instance out of an array of
bytes comprising a message, the length of the message and the Internet address and local port
Distributed Systems Notes
Dept. of CSE,SIT 64 KNS
number of the destination socket as shown in the following Figure. This class provides another
constructor for receiving a message. Its argument specify an array of byte to receive the message
and its length.
Array of bytes Containing
message
Length of
Message
Internet
address
Port number
Figure : Datagram Packet
DatagramSocket: This class supports sockets for sending and receiving UDP datagrams. It
provides a constructor that takes a port number as argument, for use by a processes that need
to use a particular port. It also provides a no-argument constructor that allows the system to
choose a free local port.
The class DatagramSocket provides the following methods:
Send and receive: These methods are for transmitting datagrams between a pair of sockets.
SetSoTimeout: This method allows a time out to be set. With a timeout set, the receive
method will block for the time specified and then throws an InterruptedlOException.
Connect: This method is used for connecting it to a particular remote port and Internet
address.
Program: UDP client
importjava.net .* ;
import Java, io.* ;
public class UDPCIient {
public static void main (String args [ ]) {
try {
DatagramSocket aSocket = new DatagramSocket();
byte[) m = args [0].getBytes() ;
InetAddress aHost = lnetAddress.getByName(args[1]);
int serverport = 6789;
DatagramPacket request = new DatagramPacket(m, args[0].length( ), aHost, serverport);
a Socket.send (request);
byte[ ] buffer = new byte [1000];
DatagramPacket reply = new DatagcamPacket (buffer, buffer.length);
Distributed Systems Notes
Dept. of CSE,SIT 65 KNS
System. out.printinfReply: " + new String(reply.getData( ) ) );
} catch (SocketException e) {System.out.println (―Socket: "+e.get MessageO )
} catch (lOException e) {System.out.println(―IO: "+e.get Message( )); }
finally { if (aSocket != null) aSocket.close(); }
}
}
In the above program, the client creates a socket, sends a message to a server at port 6789
and then waits to receive a reply. The arguments of the main method supply a message and the
DNS hostname of the server.
The code that follows is the corresponding server program, which creates a socket bound to
its server port (6789) then repeatedly waits for the request message from a client, to which it
replies by sending back the same message.
Program - UDP Server:
importjava.net.*;
import Java.io.* ;
public class UDPServer {
public static void main (String args[ ]){ try {
DatagramSocket aSocket = new DatagramSocket(6789); byte[ ]
buffer = new byte[1000];
while(true){
DatagramPacket request = new DatagramPacket (buffer, buffer.length);
aSocket. receive(request); DatagramPacket reply = new
DatagramPacket (request.getData(), request.getLength(), equest.getAddress(),
request.getPort());
aSocket.send (reply);
}
} catch (SocketException e) { System.out.println("socket: "+e.getMessage ( )) ;
} catch (lOException e) {System.out.println ("IO: "+ e.getMessage( ));
} finally { if (aSocket != null) aSocket.close( );}
}
}
Distributed Systems Notes
Dept. of CSE,SIT 66 KNS
TCP stream Communication.
The API to the TCP protocol provides the abstraction of a stream of bytes to which data may be
written and from which data may be read. The following characteristics of the network are
hidden by the stream abstraction.
Message size: The application can choose how much data it writes to a stream or reads
from it.
Lost messages: The TCP uses an acknowledgement scheme. If the sender does not
receive an acknowledgement within a timeout, it retransmits the message.
Flow control: TCP attempts to match the speed of the processes that read from and write
to a stream.
Message ordering and duplication: Message identifiers are associated with each IP packet,
which enables the recipient to detect and reject duplicates, or to reorder messages that do
not arrive in sender order.
Message destinations: A pair of communicating processes establishes a connection before
they can communicate over a stream. Once a connection is established, the process simply
read from and writes to the stream without the use of Internet address and ports.
The API for stream communication assumes that when a pair of processes are establishing a
connection, one of them plays the client role and the other plays the server role, but thereafter
they would be peers.
The client role involves creating a stream socket bound to any port and then making a
connect, request asking for a connection to a server at its server port. The server role involves
creating a listening socket bound to a server port and waiting for clients to request connections.
The listening socket maintains a queue of incoming connection requests. When the server
accepts a connection, a new stream socket is created for the server to communicate with a client,
meanwhile retaining its socket at the port for listening to other clients.
Some outstanding issues related to stream communication are
Matching of data items: Two communicating processes need to agree as to the contents of
the data transmitted over a stream. E.g., if one process writes an 'int' followed by a
'double' then the reader at the other end must read an 'int' followed by a 'double'.
Blocking: When a process attempts to read data from an input channel, it will get data
from the queue or it will block until data becomes available.
Distributed Systems Notes
Dept. of CSE,SIT 67 KNS
The process that writer data to a stream may be blocked by the TCP flow control
mechanism if the socket at the other end is queueing as much data as the protocol allows.
Threads: When a server accepts a connection, it generally creates a new thread to
communicate with the new client
Failure Model:
To satisfy the integrity property, TCP use checksums to detect and reject corrupt packets
and sequence numbers to detect and reject duplicate packets. In order to satisfy the validity
property, timeouts and retransmissions are used by TCP.
Use of TCP: Many frequently used services run over TCP connection with reserved port
numbers. These include HTTP, FTP, Telnet and SMTP.
Java API for TCP streams: The Java interface to TCP streams is provided in the classes
ServerSocket and Socket.
ServerSocket: This class is intended for use by a server to create a socket at a server port for
listening for connect requests from clients. Its accept method gets a connect request from the
queue, or if the queue is empty, it blocks until one arrives.
Socket: The client uses this constructor to create a socket specifying the DNS hostname and port
of a server. The Socket class provides methods getlnputStream and getOutputStream for
accessing the two streams associated with a Socket.
TCP client program:
importjava.net.*impor java.io.*;
public class TCPCIient {
public static void main (String args[ ]) {
try { int serverport = 7896;
Socket c = new Socket(args[1], serverport);
DatalnputStream in = new DatalnputStream (
c.getlnputStream ());
DataOutputStream out = new DataOutputStream (c. getOutputStream ());
out.writeUTF(args[0]);
String data = in.readUTF();
System.out.println("Received: " + data);
} catch (UnknownHostException e) { System.out.println("sock: " + e.getMessage( ));
Distributed Systems Notes
Dept. of CSE,SIT 68 KNS
} catch (EOFException e) { System.out.printin ("EOF:" + e.getMessage ( ));
} catch (lOException e) { System.out.printin ("IO:" + e.getMessage ( ));
} finally {if (c != null) try{ c.close() ;} catch (IO Exception e);
}
}
In the client program, the arguments of the main method supply a message and the DNS
hostname of the server. The client creates a socket bound to the hostname and server port 7896. It
makes a DatalnputStream and DataOutputStream then writes the message to its output stream and
waits to read a reply from its input stream. UTF is an encoding that represents string in a
particular format.
The server program opens a server socket on its server port (7896) and listens for connect
requests. When one arrives, it makes a new thread in which to communicate with the client.
TCP server Program:
importjava.net.*;
import java.io.*;
public class TCPServer {
public static void main (String args[ ]) { try {
int serverport = 7896;
ServerSocket lissoc = new ServerSocket(serverport);
while (true) {
Socket s = lissoc.accept( );
DatalnputStream in = new DatalnputStream
(s.getlnputStream ( ));
DataOutputStream out = new DataOutputStream
(s.getOutputStream ( )); String
line = in.readUTF( );
out.writeUTF(line);
}
} catch (EOFException e) { System.out.println ("EOF: " +
e.geiMessage( ));
} catch (lOException e) { System.out.println ("10: "+ e.getMessage( ));
Distributed Systems Notes
Dept. of CSE,SIT 69 KNS
} finally { try { lissoc.close( );} catch (lOException e);}
}
}
2.3 EXTERNAL DATA REPRESENTATION AND MARSHALLING
The information stored in running programs is represented as data structures whereas the
information in messages consists of sequences of bytes. Irrespective of the form of
communication used, the data structures must be flattened (converted to a sequences of bytes)
before transmission and rebuilt on arrival. The representation of data items differs between
architectures. Another issue is the set of codes used to represent characters for example, UNIX
system use ASCII character coding, taking one byte per character, whereas the Unicode standard
allows for the representation of text in many languages and takes two bytes per character.
One of the following methods can be used to enable any two computers to exchange data
values.
• The values are converted to an agreed external format before transmission and converted to
the local form on receipt.
• The values are transmitted in the sender's format, together with an indication of the format
used, and the recipient converts the values if necessary.
To support RMI (Remote Method Invocation) or RPC (Remote Procedure Call) any data
type that can be passed as an argument or returned as a result must be able to be flattened and the
individual primitive data values represented in an agreed format. An agreed standard for the
representation of data structures and primitive values is called an external data representation.
Marshalling and Unmarshalling :
Marshalling is the process of taking a collection of data items and assembling them into a
form suitable for transmission in a message. Unmarshalling is the process of disassembling them
on arrival to produce an equivalent collection of data items at the destination. Thus marshalling
consists of the translation of structured data items and primitive values into an external data
representation. Similarly, unmarshalling consists of the generation of primitive values from their
external data representation and the rebuilding of the data structures.
Two alternative approaches to external data representation and marshalling are:
CORBA's common data representation
Java's object serialization
Distributed Systems Notes
Dept. of CSE,SIT 70 KNS
CORBA's common Data Representation (CDR)
CORB A CDR is the external data representation defined with CORBA 2.0. CDR can
represent all of the data types that can be used as arguments and return values in remote
invocations in CORBA. It consists of 15 primitive types that include short (16-bit), long (32-bit),
unsigned short, unsigned long, float (32-bit), double(64-bit), char, Boolean (TRUE or FALSE),
octet (8-bit) and any constructed types as shown in following Figure
Type Representation
sequence
String
Array
Struct
enumerated
Union
length (unsigned long) followed by elements in order
length (unsigned long) followed by characters in order
array elements in order
in the order of declaration of the components
unsigned long
type tag followed by the selected member
0-3 5
4-7 "smit"
8-11 h-"
12-15 6
16-19 "Lond"
20-23 "on - "
24-27 1934
Figure: CORBA CDR message
The Figure shows a message in CORBA CDR that contains three fields of a structure whose
respective types are string, string and unsigned long. The representation of each string consists of an
unsigned long representing its length followed by the characters in the string. Variable length data is
padded with zeros. Each unsigned long occupies four bytes, so the index is a multiple of four.
Distributed Systems Notes
Dept. of CSE,SIT 71 KNS
The CORBA interface compiler generates appropriate marshalling and
unmarshalling of argument and results of remote invocation.
Java object serialization
In Java, the term serialization refers to the activity of flattening an object or a connected set of
objects into a serial form that is suitable for storing on disk or transmitting in a message, for example as
an argument and result of an RMI. Deserialization consists of restoring the state of an object or set
of objects from their serialized form.
Java objects can contain references to other objects. When on object is serialized, the entire
object that it references are serialized together with it. References are serialized as handles. The
handle is a reference to an object within the serialized form.
To serialize an object, its class information is written out, followed by the types and names
of its instance variables. If the instance variables belong to new classes, then their class
information must also be written out followed by the types and names of their instance variables.
This recursive procedure continues until the class information and instance variables of the
necessary classes have been written out. Each class is given a handle, and no class is written
more than once to the stream of bytes- the handles being written where necessary.
Eg: Person P = new Person ("Smith", 'Leaden', 1934);
The serialized form of the given example is shown in following Figure HO andHl are handles
PERSON 8-byte version number Ho
3 int year string name string place
1934 5-smith 6 London HI
Serialized form of Person Object
• Primitive types are written in portable format using methods of ObjectOutputStream class.
• Strings and characters are written by its method called writeUTF. (Universal Transfer
Format)
• To serialize the object (e.g. Person),
Create an instance of the class ObjectOutputStream and invoke its writeObject method by
passing the person object as argument
• To deserialize an object from a stream of data, open an ObjectOutputStream on the
stream and use its readObject method to reconstruct the original object.
Distributed Systems Notes
Dept. of CSE,SIT 72 KNS
Serialization and deserialization of the arguments and results of remote invocations are
generally carried out automatically by the middleware, without any participation by the
application programmer.
Remote object References:
A remote object reference is an identifier for a remote object that is valid throughout a
distributed system. A remote object reference is passed in the invocation message to specify
which object is to be invoked. Even after the remote object associated with a given remote object
reference is deleted, it is important that the remote object reference is not reused.
Remote object reference can be constructed by concatenating the Internet address of its computer,
port number of the process that created it with the time of its creation and a local object number.
The local object number is incremented each time an object is created in that process.
Internet
Address
Port
number
time Object
number
interface of remote
object
32-bits 32-bits 32-bits 32-bits
Figure: Representation of a remote object
2.4 CLIENT-SERVER COMMUNICATION
This form of commutation is designed to support the roles and message exchanges in typical
client-server interactions. In general, request-reply communication is synchronous because the
client process blocks until the reply arrives from the server. It can also be reliable because reply
is effectively an acknowledgement to the client.
The client-server exchanges messages in terms of send and receive operations in the JAVA
API. A protocol built over datagrams avoids unnecessary overheads associated with tue TCP
stream protocol.
The Request-Reply Protocol:
This protocol is based on three primitives: doOperation, getRequest and sendReply. It may
be designed to provide certain delivery guarantees. If UDP datagrams are used, the delivery
guarantees must be provided by the request-reply protocol, which may use the server reply
message as an acknowledgement of the client request message.
Distributed Systems Notes
Dept. of CSE,SIT 73 KNS
Distributed Systems Notes
Dept. of CSE,SIT 74 KNS
The doOperation method is used by client to invoke remote operations. Its arguments
specify the remote object and which method to invoke, together with additional information
required by the method. It is assumed that the client calling doOperation marshals the arguments
into an array of bytes and unmarshals the results from the array of bytes that is returned.
Syntax:
public byte[ ] doOperation(RemoteObjectRef o, int methodid, byte[ ] arguments)
The Request- Reply message structure is
Message Type int (0-request, 1 - reply)
The doOperation method sends a request message to the server whose Internet address and
port are specified in the remote object reference given as argument. After sending the request
message, doOperation invokes receive to get a reply message, from which it extracts the result
and returns it to the caller.
GetRequest is used by a server process to acquire service requests. When the server has
invoked the method in the specified object it then uses sendReply to send the reply message to
the client. When the reply message is received by the client, the doOperation is unblocked and
execution of the client program continues.
Syntax: public byte[ ] getRequest( );
public void sendReply(byte[ ] reply, InetAddress clientHost, int clientPort);
Figure:
Request
–Reply
Distributed Systems Notes
Dept. of CSE,SIT 75 KNS
Request Id int
Object Reference RemoteObj ectRef
Method Id int or method
Arguments array of bytes
Failure model of the request-reply protocol:
If the three primitive operations are implemented over UDP datagrams, they suffer from the
following communication failures:
They suffer from omission failures
Messages are not guaranteed to be delivered in sender order
The protocol can suffer from the failure of processes
To allow for occasions when a server has failed or a request or reply message is dropped,
doOperation uses a timeout when it is waiting to get the server's reply message. The action taken
when a timeout occurs depends upon the delivery guarantees to be offered.
The protocol is designed to recognize successive messages with the same request identifier
and to filter out duplicates. If the server has already sent the reply when it receives a duplicate request
it will need to execute the operation again to obtain the result. Some servers can execute their
operations more than once and obtain the same results each time. An idempotent operation is one that
can be performed repeatedly with the same effect as if it had been performed exactly once.
For servers that require retransmission of replies without re-execution of operations, a history may
be used. The term 'history' refers to a structure that contains a record of reply messages that
have been transmitted. An entry, in a history contains a request identifier, a message and an
identifier of the client to which it was sent. Its purpose is to allow the server to retransmit reply
messages when client processes request. A problem associated with the use of history is its
memory.
RPC exchange protocols :
The following three protocols are used for implementing various types of RPC:. These three
protocols produce differing behaviors in the presence of communication failures.
The request (R) protocol: It may be used when there is no value to be returned from the
procedure and the client requires no confirmation that the procedure has been executed.
The request-reply (RR) protocol: It is useful for most client-server exchanges. Special
acknowledgement messages are not required, because a server's reply message is regarded
Distributed Systems Notes
Dept. of CSE,SIT 76 KNS
as an acknowledgement.
The request-reply-acknowledgement (RRA) protocol: It is based on exchange of 3
messages: request, reply and acknowledgement. The acknowledgement contains the
requestld which will enable the server to discard entries from its history.
Use of TCP streams to implement the request-reply protocol
The desire to avoid implementing multi-packet protocols is one of the reasons for choosing
TCP streams allowing arguments and results of any size to be transmitted. If the TCP is used, it
ensures that the messages are delivered reliably, so there is no need for retransmission of
messages and filtering of duplicates or with histories. The overhead due to acknowledgement
messages is reduced when a reply message follows soon after a request message.
HTTP : an example of a request-reply protocol :
HTTP (Hyper Text Transfer Protocol) is a protocol that specifies the messages involved in a
request-reply exchange, the methods, arguments and results and the rules for representing them in
the messages. It supports a fixed set of methods (GET, PUT, POST, etc) that are applicable to all
of its resources. In addition to invoking methods on web resources, the protocol allows for
content negotiation and password-style authentication.
Content negotiation: Client's requests can include information as to what data
representation they can accept, enabling the server to choose the representation that is
most appropriate for the user.
Authentication : Password style authentication is provided to prove the identity of the
source
HTTP is implemented over TCP. Each client server interaction consists of the following
steps:-
1. The client requests and the server accept a connection at the default server port or at a
port specified in the URL.
2. The client sends a request message to the server.
3. The server sends a reply message to the client.
4. The connection is closed.
However, the need to establish and close a connection for every request-reply exchange is
expensive, both in overloading the server and in sending too many messages over the network. In
order to overcome it, a later version of the protocol uses persistent connections - connections that
remain open over a series of request-reply exchanges between client and server.
Distributed Systems Notes
Dept. of CSE,SIT 77 KNS
Requests and replies are marshaled into messages as ASCII text strings, but resources can be
represented as byte sequences and may be compressed. Recourses implemented as data are supplied
as Multipurpose Internet Mail Extension (MIME) like structures in arguments and results. MIME is a
standard for sending multipart data containing, text, images and sound in e-mail messages. Data is
prefixed with its MIME type so that the recipient will know how to handle it.
HTTP methods:
GET - requests the resources whose URL is given as argument.
HEAD - identical to GET, but it does not return any data. However it does return all the
information about the data such as the time of last modification, its type & size.
Post: It specifies the URL of a resource that can deal with the data supplied with the request. The
processing carried out on the data depends on the function of the program specified in the URL.
UT - requests that the data supplied in the request is stored with the given URL as its
identifier.
DELETE - The server deletes the resource identified by the given URL. Servers may not
always allow this operation, in which case the reply indicates failure.
OPTIONS - The server supplies the client with a list of methods it allows to be applied to
the given URL.
TRACE - used for diagnostic purposes.
Message Contents :
The Request Message specifies the name of a method, the URL of a resource, the protocol version,
some headers and an optional message body. Fig (a) below shows the contents of an HTTP Request
message whose method is Get.
A Reply message specifies the protocol version, a status code, and 'reason', some headers
and an optional message body.
HTTP request message (msg):
method URL or pathname HTTP version headers msg body
GET http://www.yahoo.com HTTP/1.1
HTTP reply msg:
HTTP version status code reason headers msg body
HTTP/1.1 200 OK resource data
Distributed Systems Notes
Dept. of CSE,SIT 78 KNS
Status code and reason provide a report on the success or otherwise in carrying out the
request: status code is a 3 digit number for interpretation by a program and reason is a
textual phrase understood by a person.
2.5 GROUP COMMUNICATION
The pair wise exchange of messages is not the best model for communication from one
process to a group of other processes. A multicast operation is more appropriate, i.e., an
operation that sends a single message from one process to each of the members of a group of
processes, usually in such a way that the membership of the group is transparent to the
sender.
Multicast messages provide a useful infrastructure for constructing distributed systems
with the following characteristics:
• Fault tolerance based on replicated services
• Finding the discovery servers in spontaneous networking
• Better performance through replicated data
• Propagation of event notifications
IP multicast is built on top of the Internet protocol IP. A multicast group is specified by a
class D Internet address. The membership is dynamic, allowing computers to join or leave at any
time. The Java API provides a datagram interface to IP multicast through the class
MulticastSocket, which is a subclass of DatagramSocket with the additional capability of being
able to join multicast groups. A process can join a multicast group with a given multicast address
by invoking the joinGroup method of its MulticastSocket. A process can leave a specified group
by invoking the leaveGroup method of its MulticastSocket.
Program :
importjava.net.*;
import java.io.*;
public class MulticastPeer {
public static void main (String args[ ]){ try {
InetAddress group = lnetAddress.getByName(args[1]);
MulticastSocket s = new MulticastSocket (6789);
s.joinGroup (group);
byte[ ] m = args[0].getBytes();
Distributed Systems Notes
Dept. of CSE,SIT 79 KNS
DatagramPacket message = new DatagramPacket (m, m.length, group,
6789); s.send
(message); byte [ ] buffer = new byte
[1000];
for (int i = 0; < n ; i + +) { // n - no of members in a group
DatagramPacket messageln = new DatagramPacket (
buffer, buffer.length);
s.receive(messageln);
}
s. leaveGroup(group);
} catch (socket Exception e) { system, out. print In ("Socket:" +
e. get message ());
} catch (lOException e) {System. out.printlnflO: "+e.getMessage( ) );
} finally { if (s != null) s.close(); }
}
}
In the above program, the arguments to the main method specify a message to be multicast and the
multicast address of a group. After joining that multicast group, the process makes an instance of
DatagramPacket containing the message and sends it through its multicast socket to the multicast
group. After that, it attempts to receive 'n' multicast messages from its peers via its socket, which also
belongs to the group on the same port.
2.6 CASE STUDY: INTERPROCESS COMMUNICATION IN UNIX
The IPC primitives in BSD 4.x versions of the UNDC- are provided as system calls that are
implemented as a layer over the Internet TCP and UDP protocols. Message destinations are specified
as socket addresses - a socket address consists of an Internet address and a local port number. The
interprocess communication operations are based on the socket abstractions. Messages are queued
at the sending socket until the networking protocol has transmitted them, and until an
acknowledgement arrives, if the protocol requires one. When messages arrive they are queued at the
receiving socket until the receiving process makes an appropriate system call to receive them.
Any process can create a socket to communicate with another process. This is done by invoking
socket system call, whose arguments specify the communication domain, the type and sometimes a
Distributed Systems Notes
Dept. of CSE,SIT 80 KNS
Distributed Systems Notes
Dept. of CSE,SIT 81 KNS
particular protocol. The protocol is particularly selected by the system according to whether the
communication is datagram or stream.
Datagram communication
In order to send datagrams, a socket pair is identified each time a communication is made.
This is achieved by sending process using its local socket descriptor and the socket address of the
receiving socket each time it sends a message. Figure 1.16 illustrates the sockets used for
datagrams. ClientAddress and ServerAddress are socket addresses.
s=socket (AF_INET,SOCK_DGRAM,0) s=socket (AF_INET,SOCK_DGRAM,0)
• • • bind(s,ClientAddress) • bind (s,ServerAddress)
• •
• •
amount=recvfrom (s,buffer,from) Sendto (s, "message", S erverAddress)
Figure: Sockets used for Datagrams
Both processes use the socket call to create a socket and get a descriptor for it. The first
argument of socket specifies the communication domain as the Internet domain and the second
argument indicates that datagram communication is required. The last argument to the socket
call may be used to specify a particular protocol, but setting it to zero causes the system to
select a suitable protocol-UDP in this case.
Both processes use the bind call to bind their sockets to socket addresses. The sending
process binds its socket to a socket address referring to any available local port number. The
receiving process binds its socket to a socket address that contains its server port and must
be made known to the sender.
The sending process uses sendto call with arguments specifying the socket through which
the message is to be sent, the message itself and the socket address of the destination. The
sendto call hands the message to the underlying UDP and IP protocols and returns the actual
number of bytes sent. As datagram service is requested the message is transmitted to its
destination without an acknowledgement. If the message is too long to be sent, there is an
error return.
The receiving process uses the recvfrom call with arguments specifying the local socket
on which to receive a message and memory locations in which to store the message and
Distributed Systems Notes
Dept. of CSE,SIT 82 KNS
the socket address of the sending socket. The recvfrom call collects the first message in
the queue at the socket, or if the queue is empty it will wait until a message arrives.
Communication occurs only when a sendto in one process addresses its message to the
socket used by a recvfrom in another process. In client-server communication there is no need for
servers to have prior knowledge of clients' socket addresses, because the recvfrom operation
supplies the sender's address with each message it delivers. The properties of datagram
communication in UNIX are the same as those described in Section 1.6.
Stream communication
In order to use the stream protocol, two processes must first establish a connection between
their pair of sockets. The arrangement is a asymmetric because one of the sockets will be
listening for a request for connection and the other will be asking for a connection. Once a pair of
socket has been connected, they may be used for transmitting data in both or either direction.
That is, they behave like streams in that any available data is read immediately in the same order
as it was written and there is no indication of the boundary of the messages. However there is a
bounded queue at the receiving socket and the receiver blocks if the queue is empty; the sender
blocks if it is full. Figure 1.17 illustrates stream communication, in which the details of the
arguments are simplified; it does not show the server closing the socket on which it listens.
Normally a server would first listen and accept a connection and then fork a new process to
communicate with the client. Meanwhile, it will continue to listen in the original process. The
properties of stream communication in UNIX are the same as those described in Section 1.6.
Server or listening process first uses the socket operation to create a stream socket and
the bind operation to bind its socket to the servers' socket address. The second argument
to the socket system call is given as STOCKSTREAM, to indicate that stream
communication is required. If the third argument is left as zero, the TCP/IP protocol will
be selected automatically. It uses the listen operation to listen on its socket for client
requests for connections. The second argument to the listen system call specifies the
maximum number of requests for connections that can Be*queued at this socket.
The server uses accept system call to accept a connection requested by a client and obtain
a new socket for communication with that client. The original socket may still be used to
accept further connections with other clients.
The client process uses the socket operation to create a stream socket and then uses the
connect system call to request a connection via the socket address of the listening process.
Distributed Systems Notes
Dept. of CSE,SIT 83 KNS
As the connect call automatically binds a socket name to the callers' socket prior binding is
unnecessary.
After a connection has been established, both processes may then use the write and read
operations on their respective sockets to send and receive sequences of bytes via the
connection. The write operation is similar to the write operation for files. It specifies a message to be
sent to a socket. It hands the message to the underlying TCP/IP protocol and returns the actual
number of characters sent. The read operation receives some characters in its buffer and returns
the number of characters received.
S =
socket(AF_INET,SOCK_STREAM,0)
• •
Connect (s,ServerAddress)
•
S = SOcket(AF_INET,SOCK_STREAM,0)
•
•
bind(s, ServerAddress);
listen(s,5);
write(s,"message",length) sNew = accept(s,ClientAddress);
• n = read (sNew, buffer, amount)
Figure : Sockets used for streams
Distributed Systems Notes
Dept. of CSE,SIT 84 KNS
DISTRIBUTED TRANSACTION PROCESSING
5.1-Transactions
Transactions protect a shared resource against simultaneous access by several
concurrent processes. In particular, transactions are used to protect shared data. They
allow a process to access and modify multiple data items as a single atomic operation. If
the process backs out halfway during the transaction, everything is restored to the point just
before the transaction started.
The goal of transactions is to ensure that all of the objects managed by a server
remain in a consistent state when they are accessed by multiple transactions and in the
presence of server crashes. A transaction is specified by the client as a set of operations
on objects to be performed as an indivisible unit by the servers managing those subjects.
Operations that are free from interface from concurrent operations being performed in
other threads are called atomic operations.
In some situations, clients require a sequence of separate requests to a server to be
atomic in the sense that:
They are free from interference by operations being performed on behalf of other
concurrent clients (Isolation).
Either all of the operations must be completed successfully or they must have no
effect at all in the presence of server crashes.
This all-or-nothing effect has two further aspects of its own:
Failure atomicity: The effects are atomic even when the server crashes.
Durability: after a transaction has completed successfully all its effects are saved in
permanent storage. Data saved in a file will survive if the server process crashes.
To support the requirement for failure atomicity and durability, the objects must be
recoverable; when a server process crashes unexpectedly due to a hardware fault or a
software error, the changes due to all completed transactions must be available in
permanent storage so that when the server is replaced by a new process, it can recover
Distributed Systems Notes
Dept. of CSE,SIT 85 KNS
the objects to reflect the all-or-nothing effect. Each transaction is created and managed
by a coordinator, which implements the Coordinator interface shown in the Figure 4.1.
The coordinator gives each transaction an identifier, or TID.
openTransaction ( ) → trans;
Starts a new transaction and delivers a unique TID trans. These identifiers will
be used in the other operations in the transaction.
closeTransaction (trans) → (commit, abort);
ends a transaction: a commit return value indicates that the transaction has
committed; an abort return value indicates that it has aborted.
abortTransaction(trans);
Aborts the transactions.
Figure: Operations in Coordinator interface
Transactions are:
1. Atomic: To the outside world, the transaction happens indivisibly.
2. Consistent: The transaction does not violate system invariants.
3. Isolated: Concurrent transactions do not interfere with each other.
4. Durable: Once a transaction commits, the changes are permanent.
These properties are often referred by their initial letters, ACID. A transaction is
achieved by cooperation between a client program, some recoverable objects and a
coordinator. The client specifies the sequence of invocations on recoverable objects that
are to compromise a transaction. To achieve this, the client sends with each invocation
the transaction identifier returned by open Transaction. One way to make this possible is
to include an extra argument in each operation of a recoverable object to carry the TID.
Normally, a transaction completes when the client makes a close Transaction request. If
the transaction has progressed normally the reply states that the transaction is committed-
this constitutes an undertaking to the client that all of the changes requested in the
transaction are permanently recorded and that any future transactions that access the
same data will see the results of all of the changes made during the transaction.
Alternatively, when a transaction is aborted the parties involved (the recoverable objects
and the coordinator) must ensure that none of its effects is visible to future transactions,
either in the objects or in their copies in permanent storage. Concurrency control.
Distributed Systems Notes
Dept. of CSE,SIT 86 KNS
In this section we illustrate two well- known problems of concurrent transactions in
the context of the banking example - the 'lost update' problem and the 'inconsistent
retrievals' problem. Then we discuss as to how both of these problems can be avoided by
using serially equivalent executions of transactions.
The lost update problem: The lost update problem is illustrated by the following
pair of transactions on bank accounts A, B and C, whose balances are $100, $200 and
$300, respectively. Transaction T transfers an amount from account A to account B.
Transaction U transfers an amount from account C to account B. In both cases, the
amount transferred is calculated to increase the balance of B by 10%. The net effects on
account B of executing the transactions T and U should be to increase the balance of
account B by 10% twice, so its final value is $242.
Let us consider the effects of allowing the transaction T and U to run concurrently, as
in Figure 4.2. Both transactions get the balance of B as $200 and then deposit $20. The result
is incorrect, increasing the balance of account B by $20 instead of $42. This is an illustration
of the 'lost update' problem. U's update is lost because T overwrites it without seeing it.
Both transactions have read the old value before either writes the new value.
Transaction T: Transaction U:
Balance = b.getBalance( ); balance = b.getBalance( );
b.setBalance(balance* 1.1);
c .withdraw(balance/1-0)
b.setBalance(balance*l.l); $220
a.withdraw(balance/10) $80
c .withdraw(balance/10) $28
b.setBalance(balance*l.l);
a.withdraw(balance/10)
Balance = b.getBalance( ); $200
$200
$220
Balance = b.getBalance();
b.setBalance(balance*l.l);
Figure: The lost update problem
Inconsistent retrievals: Figure shows another example related to a bank account in
transaction V transfers a sum from account A to B and transaction W invokes the branch
Total method to obtain the sum of the balances of all the accounts in the bank. The balances
Distributed Systems Notes
Dept. of CSE,SIT 87 KNS
of the two bank accounts, A and B, are both initially $200. The result of branchTotal includes the
sum of A and B as $300, which is wrong. This is an illustration of the 'inconsistent retrievals'
problem. W's retrievals are inconsistent because V has performed only the withdrawal
part of a transfer at the time the sum is calculated.
Transaction V:
a.withdraw (lOO)
b.deposit(lOO)
Transaction W:
aBranch.branchTotal( )
a. withdraw(lOO); $100
.
total=a.getBalance() $100
total=total + b.getBalance( ) $300
total=total + c.getBalance( )
b.deposit(lOO) $300
.
Serial equivalence:
An interleaving of the operations of transactions in which the combined effect is the
same as if the transactions had been performed one at a time in some order is a serially
equivalent interleaving. The use of serial equivalence as a criterion for correct concurrent
execution prevents the occurrence of lost updates and inconsistent retrievals.
The lost update problem occurs when two transactions read the old value of a
variable and then use it to calculate the new value. This cannot happen if one transaction
is performed before the other, because the later transaction will read the value written by
the earlier one. As a serially equivalent interleaving of two transactions produces the
same effect as a serial one, we can solve the lost update problem by means of serial
equivalence. Figure 4.4 shows one such interleaving in which the operations the affect
the shared account, B, are actually serial, for transaction T does all its operations on B
before transaction U does. Another Interleaving of T and U that has this property is one
in which transaction U completes it operations on account B before transaction T starts.
The inconsistent retrievals problem can occur when a retrieval transaction runs
concurrently with an update transaction. It cannot occur if the retrieval transaction is
performed before or after the update transaction. A serially equivalent interleaving of a
Distributed Systems Notes
Dept. of CSE,SIT 88 KNS
retrieval transaction and an update transaction, for example as in Figure 4.5, will prevent
inconsistent retrievals occurring.
Transaction T:
balance = b.getBalance( )
b.setBalance(balance* 1.1)
a.wihtdraw(balance/10)
Transaction U:
balance = b.getBalance()
b.setBalance(balance* 1.1)
c.withdraw(balance/10)
balance = b.getBalance( ) $200
balance = b.getBalance( )
$220
b.setBalance(balance*l.l) $220
b.setBalance(balance*l.l) $242
a.withdraw(balance/10) $80
c.withdraw(balance/10) $278
Transaction V:
a.withdrawal (lOO);
b.deposit (lOO)
Transaction W:
aBranch.branchTotal()
a.withdrawal (lOO);
b.deposit (lOO)
$100
$300
total = a.getBalance()
total = total+b.getBalance()
total = total+c.getBalance()
$100
$400
Figure: A serially equivalent interleaving of V and W
Conflicting operations: When we say that a pair of operations conflicts we mean that
their combined effect depends on the order in which they are executed. Consider a pair
of operations read and write. Read accesses the value of an object and write changes its
value. The effect of an operation refers to the value of an object set by a write operation
Distributed Systems Notes
Dept. of CSE,SIT 89 KNS
and the result returned by a read operation. The conflict rules for read and write
operations are given in Figure Serial equivalence can be defined in terms of operation
conflicts as follows:
For two transactions to be serially equivalent it is necessary and sufficient that all
pairs of conflicting operations of the two transactions be executed in the same order at all
of the objects they both access.
Operations of different
transaction
Conflict Reason
Read read
Read write
Write write
No
Yes
Yes
Because the effect of a pair of read operations does
not depend on the order in which they are executed
Because the effect of a read and write operation
depends on the order of their execution
Because the effect of a pair of write operations
depends on the order of their execution
Figure: Read and Write operation conflict rules
Recoverability from aborts:
Server must record the effects of all committed transaction and none of the effects
of aborted transactions. They must therefore allow for the fact that a transaction may
abort by preventing it affecting other concurrent transactions if it does so. In this section
we illustrate two problems that are associated with aborting transactions in the context of
the banking example. These problems are called 'dirty reads' and 'premature writes', arid
both of them can occur in the presence of serially equivalent executions of transactions.
Dirty reads: The 'dirty read' problem is caused by the interaction between a read
operation in one transaction and an earlier write operation in another transaction on the
same object. Consider the executions illustrated in Figure 4.7 in which T gets the balance
of account A and sets it to $10 more, then U gets the balance of account A and sets it to
$20 more, and the two executions are serially equivalent. Now suppose that the
transaction T aborts after U has committed. Then the transaction U will have seen a value
that never existed, since A will be restored to its original value. We say that the
transaction U has performed a dirty read. As it has committed, it cannot be undone.
Distributed Systems Notes
Dept. of CSE,SIT 90 KNS
Recoverability of transaction: The strategy for recoverability is to delay commits until
after the commitment of any other transaction whose uncommitted state has been observed.
In our example, U delays its commit until after T commits. In the case that T aborts, then
U must abort as well.
Cascading aborts: The aborting of one transaction may cause still further transactions to be
aborted. Such situations are called cascading aborts
Transaction T:
a.getBalance( )
a.setBalance(balance +10)
Transaction U:
a.getBalance( )
a.setBalance(balance +20)
balance = a.getBalance( ) $
100 a.setBalance(balance+10)
$110
balance = a.getBalance( ) $110
a.setBalance(balance+20) $130
abort transaction
commit transaction
Figure : A dirty read when transaction T aborts
Premature writes: This one is related to the interaction between write operations on the
same object belonging to different transactions. For an illustration, we consider two
setBalance transactions T and U on account A, as shown in Figure 4.8. Before the
transactions, the balance of account A was $100. The two executions are serially
equivalent, with T setting the balance to $105 and U setting it to $110. If the transaction U
aborts and T commits, the balance should be $105. Some database systems implement the
action of abort by restoring 'before images' of all the writes of a transaction. In our
example, A is $100 initially, which is the 'before image' of T's write; similarly $105 is the
'before image' of U's write. Thus if U aborts, we get the correct balance of $105.
Now consider the case when U commits and then T aborts. The balance should be $
110, but as the 'before image' of T's write is $100, we get the wrong balance of $100.
Similarly if T aborts and then U aborts, the 'before image' of U's write is $105 and we get the
wrong balance of $ 105 -the balance should revert to $ 100. To ensure correct results in a
recovery scheme that uses before images, write operations must be delayed until earlier
transactions that updated the same objects have either committed or aborted.
Distributed Systems Notes
Dept. of CSE,SIT 91 KNS
Transaction T:
a.setBalance(105)
Transaction U:
a.setBalance(HO)
a.setBalance(105)
$100
$105
a.setBalance(HO)
$110
Figure: Overwriting uncommitted values
Strict executions of transactions: Generally it is required that transactions delay both their
read and write operations so as to avoid 'dirty reads' and 'premature writes'. The executions
of transactions are called strict if the service delays both read and write operations on an
object until all transactions that previously wrote that object have either committed or
aborted. The strict execution of transactions enforces the desired property of isolation.
5.2 NESTED TRANSACTIONS
A nested transaction is constructed from a number of sub transactions. Transactions
other than the top level transaction are called sub transactions. Since transactions can
be nested arbitrarily deeply, considerable administration is needed to get everything right.
Several transactions may be started from within a transaction, allowing transactions
to be regarded as modules that can be composed as required. The outermost transaction
in a set of nested transactions is called the top-level transaction. For example in Figure
4.9 T is a top level transaction, which starts a pair of sub transactions T, and T2. The sub
transaction Tl starts its own pair of sub transactions TM and T12. Also, subtransaction T2
starts its own subtransaction T21 which starts another subtransaction T2U. A
subtransaction appears atomic to its parent with respect to transaction failures and to
concurrent access.
Nested transactions have the following main advantages:
Sub transactions at one level may run concurrently with other sub
transactions at the same level in the hierarchy. This can allow additional
concurrency in a transaction. When sub transactions run in different
servers, they can work in parallel.
Sub transactions can commit or abort independently The rules for
commitment of nested transactions are rather subtle:
A transaction may commit or abort only after its child transactions
have completed.
Distributed Systems Notes
Dept. of CSE,SIT 92 KNS
When a subtransaction completes it makes an independent decision either to
commit provisionally or to abort. Its decision to abort is final.
When a parent aborts, all of its sub transactions are aborted. For example, if T2
aborts then T21 and T211 must also abort, even though they may have provisionally
committed.
When a subtransaction aborts the parent can decide whether to abort or not. In our
example, T decides to commit although T2 has aborted.
If the top level transaction commits, then all of the sub transactions that have
provisionally committed can commit too, provided that none of their ancestors
has aborted. In our example, T's commitment allows T1 T11 and T12 to commit, but
not T21 and T21 since their parent T2 aborted. Note that the effects of a
subtransaction are not permanent until the top-level transaction commits.
5.3 LOCKS
When a process needs to read or write a data item as a part of a transaction, it
requests the scheduler to grant it a lock for that data item. Likewise, when a data item is no
longer needed, the scheduler is requested to release the lock. The task of the scheduler is to
grant and release locks in such a way that only valid schedules result. In other words, it needs
to apply an algorithm that provides only serializable schedules. One such algorithm is two
phase locking. Two phase locking
In two phase locking, the scheduler first acquires all the locles it needs during the
growing phase, and then releases them during the shrinking phase.
Transaction must be scheduled so that their effect on shared data is serially
equivalent. A server can achieve serial equivalence of transactions by serializing access to
the objects. A simple example of a serializing mechanism is the use of exclusive locks. In
this locking scheme the server attempts to lock any object that is about to be used by any
operation of a client's transaction. If a client requests access to an object that is already
locked due to another client's transaction, the request is suspended and the client must wait
until the object is unlocked.
In the example given in Figure 4.10, it is assumed that when transactions T and U
start the balances of the accounts A,B and C are not yet locked. When transaction T is about
Distributed Systems Notes
Dept. of CSE,SIT 93 KNS
to use account B, It is looked for T. Subsequently, when transaction U is about to use B it
is still locked for T and transaction U waits. When transaction T is committed, B is
unlocked, whereupon transaction U is resumed. The use of the lock on B effectively
serializes the access to B. Note that if, for example, T had released the lock on B
between its getBalance and setBalance operations, transaction U's getBalance operation
on B could be interleaved
Transaction T:
balance = b.getBalance( )
b.setBalance(bal*l.l)
a.withdraw(bal/10)
Transaction U:
balance = b.getBalance( )
b.setBalance(bal* 1.1)
c.withdraw(bal/10)
Operations Locks Operations Locks
openTransaction
bal = b.getBalance( ) lock B
b.setBalance(bal* 1.1)
a.withdraw(bal/10) lock A
closeTransaction unlock A,B
openTransaction
bal = b.getBalance( ) waits for T's
lock on B
…..
lockB
b.setBalance(bal* 1.1)
c.withdraw(bal/10) lock C
closeTransaction unlock B,C
transaction is not allowed any new locks after it has released a lock. The first phase of
each transaction is a 'growing phase', during which new locks are acquired. In the
second phase, the locks are released (a 'shrinking phase'). This is called two-phase
locking. Under a strict execution regime, a transaction that needs to read or write an
object must be delayed until other transaction that wrote the same object have committed
or aborted. To enforce this rule, any locks applied during the progress of a transaction
are held until the transaction commits
or aborts. This is called strict two-phase locking. The presence of the locks prevents
other transactions reading or writing the objects.
It is preferable to adopt a locking scheme that controls the access to each object
Distributed Systems Notes
Dept. of CSE,SIT 94 KNS
so that there can be several concurrent transactions reading an object, or a single
transaction writing an object, but not both. This is commonly referred to as a 'many
readers/single writers' scheme. Two types of locks are used: read locks and write locks.
Before a transaction's read operation is performed, a read lock should be set on the
object. Before a transaction's write operation is performed, a write lock should be set on
the object. Whenever it is impossible to set a lock immediately, the transaction (and the
client) must wait until it is possible to do so - a client's request is never rejected. As
pairs of read operations from different transactions do not conflict, an attempt to set a
read lock on an object with a read lock is always successful. All the transactions
reading the same object share its read lock - for this reason, read locks are sometimes
called shared locks.
The operation conflict rules tell us that:
If a transaction T has already performed a read operation on a particular object,
then a concurrent transaction U must not write that object until T commits or
aborts
If a transaction T has already performed a write operation on a particular object,
then a concurrent transaction U must not read or write that object until T
commits or aborts.
Figures show the compatibility of read locks and write locks on any particular
object. The entries in the first column in the table show the type of lock already set,
if any. The entries in the first row show the type of lock requested. The entry in each
cell shows the effect on a transaction that requests the type of lock given above
when the object has been locked in another transaction with the type of lock on the
left.
For one object Lock requested read
write
Lock already set
read write
none OK
OK
wait
OK
wait wait
Figure: Lock compatibility
The rules for the use of locks in a strict two-phase locking implementation are
summarized below:
Distributed Systems Notes
Dept. of CSE,SIT 95 KNS
1. When an operation accesses an object within a transaction:
(a) If the object is not already locked, it is locked and the operation proceeds.
(b) If the object has a conflicting lock set by another transaction, the transaction
must wait until it is unlocked.
(c) If the object has a non-conflicting lock set by another transaction, the lock is
shared and the operation proceeds.
(d) If the object has already been locked in the same transaction, the lock will be
promoted if necessary and the operation proceeds. (Where promotion is prevented by
a conflicting lock, rule (b) is used).
2. When a transaction is committed or aborted, the server unlocks all objects it locked
for the transaction.
Locking rules for nested transactions: The aim of a locking scheme for nested
transactions is to serialize access to objects so that:
1. Each set of nested transactions is a single entity that must be prevented
from observing the partial effects of any other set of nested transactions.
2. Each transaction within a set of nested transactions must be prevented
from observing the partial effects of the other transactions in the set.
The first rule is enforced by arranging that every lock that is acquired by a successful
subtransaction is inherited by its parent when it completes. Inherited locks are also
inherited ancestors. Note that this form of inheritance passes from child to parent.
The second rule is enforced as follows:
Parent transactions are not allowed to run concurrently with their child
transactions. If a parent transaction has a lock on an object, it retains the lock
during the time that its child transaction is executing. This means that child
transaction temporarily acquires the lock from its parent for its duration.
Sub transactions at the same level are allowed to run concurrently, so when they
access the same objects, the locking scheme must serialize their access.
The following rules describe lock acquisition and release:
For a sub transaction to acquire a read lock on an object, no other active
transaction can have a write lock on that object, and the only retainers of a write
lock are its ancestors.
For a subtransaction to acquire a write lock on an object, no other active
Distributed Systems Notes
Dept. of CSE,SIT 96 KNS
transaction can have a read or write lock on that object, and the only retainers of
read and write locks on that object are its ancestors.
When a subtransaction commits, its locks are inherited by its parent, allowing the
parent to retain the locks in the same mode as the child.
When a subtransaction aborts, its locks are discarded. If the parent already retains
the locks it can continue to do so.
Deadlocks
The use of locks can lead to deadlock. Consider the use of locks shown in Figure
below. Each of them acquires a lock on one account and then gets blocked when it tries
to access the account that the other one has locked. This is deadlock situation - two
transactions are waiting, and each is dependent on the other to release a lock so it can
resume.
Transaction T Transaction U
Operations Locks Operations Locks
a.deposit(lOO)
b.withdraw(lOO)
write lock A
wait for U's
b.deposit(200)
a.withdraw(200)
...
write lock B
waits for T's
lock on A
lock on B
Figure: Deadlock with write locks
Definition of deadlock: Deadlock is a state in which each member of a group of
transactions is waiting for some other member to release a lock. A wait-for-graph can be
used to represent the waiting relationship between current transactions. In a wait-for
graph the nodes represent transactions and the edges represent wait-for relationships
between transactions
Deadlock prevention: An apparently simple but not very good way to overcome deadlock
is to lock all of the objects used by a transaction when it starts. This would need to be
done as a single atomic step so as to avoid deadlock at this stage. Such a transaction
cannot run into deadlock with other transactions, but it unnecessarily restricts access to
shared resources. In addition, it is sometimes impossible to predict at the start of a
Distributed Systems Notes
Dept. of CSE,SIT 97 KNS
transaction which objects will be used.
Deadlock detection: Deadlocks may be detected by finding cycles in the wait-for graph.
Having detected a deadlock, a transaction must be selected for abortion to break the
cycle. The choice of the transaction to abort is not simple. Some factors that may be taken
into account for aborting a transaction are the age of the transaction and the number of
cycles it is involved in.
Increasing concurrency in locking schemes
Even when locking rules are based on the conflicts between read and write
operations and the granularity at which they are applied is as small as possible, there is
scope for increasing concurrency. In the first approach (two-version locking), the setting
of exclusive locks is delayed until a transaction commits
Two-version locking: This is an optimistic scheme that allows one transaction to
write tentative versions of objects while other transactions read from the committed
version of the same objects. Read operations only wait if another transaction is currently
committing the same object. Transactions cannot commit their write operations
immediately if other uncompleted transactions have read the same objects. Therefore,
transactions that request to commit in such a situation are made to wait until the reading
transactions have completed. Deadlock may occur when transactions are waiting to
commit. Therefore transactions may need to be aborted when they are waiting to
commit, to resolve deadlocks.
This variation on strict two-phase locking uses three types of lock: a read lock, a
write iock and a commit lock. Before a transaction's read operation is performed, a read
lock must be set on the object the attempt to set a read lock is successful unless the
object has a commit lock, in which case the transaction waits. Before a transaction's
write operation is performed, a write lock must be set on the object - the attempt to set a
write lock is successful unless the abject has a write lock or a commit lock, in which case
the transaction waits.
When the transaction coordinator receives a request to commit a transaction, it
attempts to convert all that transactions' write locks to commit locks. If any of the objects
have outstanding read locks, the transaction must wait until the transactions that set these
locks have completed and the locks are replaced. The compatibility of read, write and
Distributed Systems Notes
Dept. of CSE,SIT 98 KNS
commit Jocks is shown in Figure
For one object Lock to be set
read
write Commit
Lock already
set
None Ok Ok Ok
Read Ok ok Wait
Write Ok wait
Commit Wait wait
Hierarchic Locks: At each level, the setting of a parent lock has the same effect as
setting all the equivalent child locks. This economizes on the number of locks to be set.
In the banking example which we consider the branch is the parent and the accounts are
The operation to view a week would cause a read lock to be set at the top of this
hierarchy, whereas the operation to enter an appointment would cause a write lock to be
set on a time slot. The effect of a read lock on a week would be to prevent write
operations on any of the substructures, for example the time slots for each day in that
week.
Each node in the hierarchy can be locked - giving the owner of the lock explicit
access to the node and implicit access to its children. Before a child node is granted a
read/write lock, an intention to read/write lock is set on the parent node and its
ancestors, if any. The intention locks is compatible with other intention locks but
conflicts with read and write locks according to the usual rules. Figure 4.17 gives the
compatibility table for hierarchic locks. Hierarchic locks have the advantage of
reducing the number of locks when mixed-granularity locking is required. The
compatibility tables and the rules for promoting locks are more complex.
For one object Lock to be set
read write
I-read
I-write
Distributed Systems Notes
Dept. of CSE,SIT 99 KNS
Lock already set none
read
write
I-read
I-write
OK
OK
wait
OK
wait
OK
wait
wait
wait
wait
OK
OK
wait
OK
OK
OK
wait
wait
OK
OK
Lock compatibility table for hierarchic locks
5.4 OPTIMISTIC CONCURRENCY CONTROL
The drawbacks of locking are:
* Lock maintenance represents an overhead
* The use of locks can result in deadlock
* To avoid cascading aborts, locks cannot be released until the end of the
transaction. This may reduce significantly the potential for concurrency.
The alternative approach proposed by Kung and Robinson is. 'optimistic' because it is
based on the observation that, in most application, the likelihood of two clients'
transactions accessing the same object is low. Transactions are allowed to proceed as
though there were no possibility of conflict with other transactions until the client
completes its task and issues a close Transaction request. When a conflict arises some,
transaction is generally aborted and will need to be restarted by the client. Each
transaction has the following phases:
Working Phase: during this phase, each transaction has a tentative version of
each of the objects that it updates. This is a copy of the most recently committed
version of the object. Read operations are performed immediately- if a tentative
version for that transaction already exists, a read operation accesses it; otherwise
it accesses the most recently committed value of the object. Write operations record
the new values of the objects as tentative values (which are invisible to other
transactions). When there are several concurrent transactions, several different tentative
values of the same object may coexist. In addition, two records are kep: of the objects
accessed within a transaction: a read set containing the objects read by the transaction: and
a write set containing the objects written by the transaction.
Validation phase: When the close Transaction request is received, the transaction is
validated to establish whether or not its operations on objects conflict with operations of
Distributed Systems Notes
Dept. of CSE,SIT 100 KNS
other transactions on the same objects. If the validation is successful then the transaction
can commit. If the validation fails, then some form of conflict resolution must be used
and either the current transaction, or in some cases those with which it conflicts, will
need to be aborted.
Update phase: If a transaction is validated, all of the changes recorded in inj tentative
versions are made permanent.
Validation of transactions: To assist in performing validation each transaction is
assigned i transaction number when it enters the validation phase. Transaction numbers
are integers assigned in ascending sequence; the number of a transaction therefore
defines its position m time - a transaction always finishes its working phase after all
transactions with lower numbers That is, a transaction with the number 1\ always
precedes a transaction with the number T: i<j. The validation test on transaction Tv is
based on conflicts between operations in pairs transaction T. and Tv. For a transaction Tv
to be serializable with respect to an overlapped transaction Ti5 their operations must
conform to the following rules:
To prevent overlapping, the entire validation and update phases can be implemented
ad a critical section so that only one client at a time can execute it.
Backward validation: As all the read operations of earlier overlapping transactions won
performed before the validation of Tv started, they cannot be affected by the writes of a
current transaction. The validation of transaction Tv checks whether its read set with any
of the write sets of earlier overlapping transactions T.. If there is any overlap. Hue
validation fails.
Let start Tn be the biggest transaction number assigned at the time when transaction
1 started its working phase and finish Tn be the biggest transaction number assigned at the
when Tv entered the validation phase. The following program describes the algorithm 1
validation of Tv :
boolean valid = true;
for( int T, = startTn + 1; Ti <= finishTn; T,+ +){
if(read set of Tv intersects write set of Ti) valid =
false;
Figure 4.18 shows overlapping transactions that might be considered in the
validation of a transaction Tv. Time increases from left to right. The earlier committed
Distributed Systems Notes
Dept. of CSE,SIT 101 KNS
transactions are T1, T, and T3. T, committed before Tv started. T2, and T3, committed
before Tv finished its working phase. StartTn +1 = T2 and finishTn =T3. In backward
validation, the read set of Tv must be compared with the write sets of T2 and T3. In
backward validation, the read set of the transaction being validated is compared with the
write sets of other transaction that have already committed,. Therefore, the only way to
resolve any conflicts is to abort the transactions that are undergoing validation.
Forward validation: In forward validation of the transaction T , the write set of T, is
compared with the read sets of all overlapping active transactions- those that.are still
in their working phase (rule 1). Rule 2 is automatically fulfilled because the active
transactions do not write until after Tv has completed. Let the active transactions have
(consecutive) transaction identifiers active, to activeN, the following program
describes the algorithm for the forward validation of Tv:
boolean valid = true;
for (int Tid = active^ Tid < = activeN; Tid ++){
if (write set of Tv intersects read set of Tid) valid = false; }
In figure 4.18, the write set of transaction Tv must be compared with the read
sets of the transactions with identifiers aactive, and active2. As the transaction being
compared with the validating transaction are still active, we have a choice of whether
to abort the validating transaction or to take some alternative way of resolving the
conflict.
5.5 TIME STAMP ORDERING
In concurrency control schemes based on timestamp ordering, each operation in a
transaction is validated when it is carried out. If the operation cannot be validated, their
transaction is aborted immediately and can then be restarted by the client. Each
transaction is assigned a unique timestamp value when it starts. The basic
timestamp ordering rule is based on operation conflicts and is very simple:
A transactions' request to write an object is valid only if that object was last
read and written by earlier transactions. A transactions' request to read an
object if valid only if that object was last written by an earlier transaction.
In timestamp ordering, each request by a transaction for read or write
operation on ai object is checked to see whether it confirms to the operation
conflict rules. A request by tic current transaction Tc can conflict with previous
operations done by other transactions, Tp whose timestamps indicate that they should
Distributed Systems Notes
Dept. of CSE,SIT 102 KNS
be later than TC . These rules are shown in Figure 4.19, in which Ti >Tc means Ti. is
later than Tc and Ti.<Tc means Ti., is earlier than Tc .
Rule TC Ti
1. Write Read T must now write an object that has been read by
any T C
i where Ti > T this requires that TC > the
2. Write Write maximum read timestamp of the object
Tc must not write an object that has been written by
3. read write and Ti where Ti > Tc this required that Tc > write
timestamp of the committed object
T must now write an object that has been written by
any Ti where T>T this requires that T > write
timestamp of the committed object.
Figure: Operation conflicts for timestamp ordering
Timestamp ordering write rule: By combining rules 1 and 2 we have the following rule
far deciding whether to accept a write operation requested by transaction Tc on object D:
if (Tc>= maximum read timestamp on D &&
Tc> write timestamp on committed version of D)
perform write operation on tentative version of D with write timestamp Tc
else
Abort transaction Tc
If a tentative version with write timestamp Tc already exist the write operation: m|
addressed to it, otherwise a new tentative version is created and given write timestamp TC
Figure illustrates the action of a write operation by a transaction T3 in cases where T3 >=
maximum read timestamp on the object( the read timestamps are not shown). In cases (a) to
Distributed Systems Notes
Dept. of CSE,SIT 103 KNS
(c) T3> write timestamp on the committed version of the object and a tentative version with
write timestamp T3 is inserted at the appropriate place in the list of tentative versions ordered by
their transactions timestamps. In case (d), T3< write timestamp on the committed version of
the object and the transaction is aborted
Timestamp ordering read rule: By using rule 3 we have the following rule for deciding
whether to accept immediately to wait or to reject read operation requested by transaction T
on object D:
If (Tc>write timestamp on committed version of D){
let DSelected be the version of D with the maximum write timestamp < = Tc
if(Dselected
is committed)
perform read operation on the version Dseleoted else
wait until the transaction that made version Dselected commits or aborts
then reapply the read rule
} else
Abort transaction Tc
Figure illustrates the timestamp ordering read rule. It includes four cases labeled (a)
to (d), each of which illustrates the action of a read operation by transaction T3. In each case,
a version whose write timestamp is less than or equal to T3 is selected. If such a version exists,
it is indicated with a line. In cases (a) and (b) the read operation is directed to a committed
version - in (a) it is the only version, whereas in (b) there is a tentative version belonging to
a later transaction. In case (c) the read operation is directed to a tentative version and must wait
until the transaction that made it commits or aborts. In case (d) there is no suitable version to
read and transaction T3 is aborted.
5.6 COMPARISON OF METHODS FOR CONCURRENCY CONTROL
All of the methods carry some overheads in the time and space they require, and they
all limit to some extent the potential for concurrent operation.
The timestamp ordering method is similar to two-phase locking in that both use
pessimistic approaches in which conflicts between transactions are detected as each object is
accessed.
On the one hand, timestamp ordering decides the serialization order statically- when a
transaction starts. On the other hand, two-phase locking decides the serialization order
dynamically- according to the order in which objects are accessed.
Distributed Systems Notes
Dept. of CSE,SIT 104 KNS
Time stamp ordering and in particular multisession timestamp ordering is better than
strict two-phase locking for read-only transactions. Two-phase locking is better when the
operations in transactions are predominantly
The pessimistic methods differ in the strategy used when a conflicting access to
an object is detected. Timestamp ordering aborts the transaction immediately,
whereas locking makes the transaction wait - but with a possible later penalty of
aborting to avoid deadlock.
When optimistic concurrency control is used, all transactions are allowed to
proceed, but some are aborted when they attempt to commit, or in forward validation,
transactions are aborted earlier.
This results in relatively efficient operation when there are few conflicts, but a
substantial amount of work may have to be repeated when a transaction is aborted.
Locking has been in use for many years in database systems, but timestamp
ordering has been used in the SDD-1 database system. Both methods have been used
in file servers.
5.7 INTRODUCTION TO DISTRIBUTED TRANSACTIONS
In the general case, a transaction, whether flat or nested, will access objects
located in several different computers. We use the term distributed transactions to
refer to a flat file or nested transaction that accesses objects managed by multiple
servers.
When a distributed transaction comes to an end, the atomicity property of
transactions requires that either all of the servers involved commit the transaction or
all role, which involves ensuring the same outcome at all of the servers. The manner
in which the coordinator achieves this depends on the protocol chosen.
A protocol known as the two-phase commit protocol is the one most commonly
used. This protocol allows the servers to communicate with one another to reach a
joint decision as to whether to commit or abort.
Concurrency control in distributed transactions is based on the methods
discussed previously. Each server applies local concurrency control to its own object,
which ensures that transactions are serialized locally.
Distributed transactions must be serialized globally. How this is achieved varies
as to whether locking, timestamp ordering or optimistic concurrency control is in use.
Distributed Systems Notes
Dept. of CSE,SIT 105 KNS
In some cases, the transactions may be serialized at the individual servers, but at the
same time a cycle of dependencies between the different servers may occur and
distributed deadlock arise.
Transaction recovery is concerned with ensuring that all the objects involved in
transactions are recoverable. In addition to that, it guarantees that the values of the
objects reflect all the changes made by committed transactions and none of those
made by aborted ones.
5.8 FLAT AND NESTED DISTRIBUTED TRANSACTIONS
A client transaction becomes distributed if it invokes operations in several
different servers. There are two different ways that distributed transactions can be
structured: as flat transactions and as nested transactions. In a flat transaction, a client
makes requests to more than one server. For example, in Figure 4.22(a), transaction T
is a transaction that invokes operations on objects in servers X,Y and Z. A flat client
transaction completes each of its requests before going on to the next one. Therefore,
each transaction accesses, servers' objects sequentially. When servers use locking, a
transaction can only be waiting for one object at a time.
In a nested transaction, the top-level transaction can open subtransactions, and each
subtransaction can open further subtransactions down to any depth of nesting. A client's
transaction T that opens two subtransactions T, and Ti open which access objects at servers
X and Y. The subtransactions Tl and T2 open further subtransactions Tn, TJ2, T21 and T22,
which access objects at servers M, N, and P. In the nested case, subtransactions at the same level
can run concurrently, so T, and T2 are concurrent, and as they invoke objects in different
servers, they can run in parallel. The four subtransactions TM, T12, T2I and Tn also run
concurrently.
The coordinator of a distributed transaction
Servers that execute requests as part of a distributed transaction need to able to
communicate with one another to coordinate their actions when the transactions commits. A
client starts a transaction by sending an openTransaction request to a coordinator in any
server. The coordinator that is contacted carries out the openTransaction and returns the
resulting transaction identifier to the client. Transaction identifiers for distributed transactions
must be unique within a distributed system.
The coordinator that opened the transaction becomes the coordinator for the distribute
Distributed Systems Notes
Dept. of CSE,SIT 106 KNS
transaction and at the end is responsible for committing or aborting it. Each of the servers
that manage an object accessed by a transaction is a participant in the transaction. During tie
progress of the transaction, the coordinator records a list of references to the participant!,
and each participant records a reference to the coordinator. The method, join, is used whenever a
new participant joins the transaction:
join(Trans, reference to participant)
Informs a coordinator that a new participant has joined the transaction Trans
5.9 ATOMIC COMMIT PROTOCOLS
In the case of a distributed transaction, the client has requested the operations at
more than one server. A transaction comes to an end when the client requests that a
transaction be admitted or aborted. A simple way to complete the transaction in an
atomic manner is for ±e coordinator to communicate the commit or abort request to all of
the participants in the transaction and to keep on repeating the request until all of them
have acknowledged that they had carried it out. This is an example of a one-phase atomic
commit protocol. This sample one-phase atomic commit protocol is inadequate because,
in the case when the client requests a commit, it does not allow a server to make a
unilateral decision to abort a transaction.The two-phase commit protocol is designed to
allow any participant to abort its part of a transaction. Due to the requirement for
atomicity, if one part of a transaction is aborted, then the whole transaction must also
be aborted.
The two-phase commit protocol
In the first phase of the two-phase commit protocol the coordinator asks all the
participants if they are prepared to commit; and in the second, it tells them to commit (or
abort) the transaction. If a participant can commit its part of a transaction, it will agree as
soon as it has recorded the changes and its status in permanent storage- and is prepared
to commit. The coordinator in a distributed transaction communicates with the
participants to carry out the two-phase commit protocol by means of the operations
summarized below:
canCommit?'(trans)-* Yes/No
Call from coordinator to participant to ask whether it can commit a transaction.
Participant replies with its vote
doCommit(trans)
Distributed Systems Notes
Dept. of CSE,SIT 107 KNS
Call from coordinator to participant to tell participant to commit its part of a
transaction doAbort(trans)
Call from coordinator to participant to tell participant to abort its part of a transaction.
haveCommitted(trans, participant)
Call from participant to coordinator to confirm that it has committed the transaction.
getDecision(trans)-> Yes/No
Call from participant to coordinator to ask for the decision on a transaction after it has
voted Yes but has still had no reply after some delay. Used to recover from seVver crash
or delayed messages.
The two phase commit protocol consists of a voting phase and a completion phase.
The steps involved are summarized below. At the end of the second step the coordinator
and all the participants that voted Yes are prepared to commit. By the end of third step
the transaction
is effectively completed. At step 3.a the coordinator and the participants are committed,
so the coordinator can report a decision to commit to the client. At 3.b the coordinator
reports a decision to abort the client. At step four participants confirm that they have
committed so that the coordinator knows when the information it has recorded about the
transaction is no longer needed.
Phase 1 (voting phase):
1. The coordinator sends a canCommit request to each of the participants in the
transaction.
When a participant receives a canCommit request it replies with its vote (Yes
or No) to the coordinator. Before voting Yes, it prepares to commit by saving
objects in permanent storage. If the vote is No the participant aborts
immediately.
(completion according to outcome of vote):
The coordinator collects the votes (including its own).
(a) If there are no failures and all the votes are Yes the coordinator decides to
Commit the transaction and sends a doCommit request to each of the Participants.
(b) Otherwise the coordinator decides to abort the transaction and sends doAbort
Distributed Systems Notes
Dept. of CSE,SIT 108 KNS
requests to all participants that voted Yes.
Participants that voted Yes are waiting for a doCommit or do Abort request
from the coordinator. When a participant receives one of these messages it
acts accordingly and in the case of commit, makes a haveCommitted call as
confirmation to the coordinator.
There are various stages in the protocol at which the coordination or a participant
cannot progress its part of the protocol until it receives another request or reply from one
of the others.
Consider the situation where a participant has voted Yes and is waiting for the
coordinator to report on the outcome of the vote by telling it to commit or abort the
transaction. As mentioned in Figure 4.23 participant is uncertain of the outcome and
cannot proceed any further until it gets the outcome of the vote from the coordinator.
The participant cannot decide unilaterally what to do next, and meanwhile the
objects used by its transaction cannot be released for use by other transactions. The
participant makes a getDecision request to the coordinator to determine the outcome of
the transaction. When it gets the reply it continues the protocol at step 4 mentioned
above. If the coordinator has failed, the participant will not be able to get the decision
until the coordinator is replaced, which can result in extensive delays for participants in
the uncertain state
Two phase commit protocol for nested transactions:
A coordinator for a subtransaction will provide an operation to open a
subtransaction together with an operation enabling the coordinator of a subtransaction
to enquire whether ■s parent has yet committed or aborted, as shown in Figure 4.24. A
client starts a set of transactions by opening a top-level transaction with an
openTransaction operation, It returns a transaction identifier for the top-level
transaction. The client starts a subtransaction by invoking the openSubTransaction
operation, whose argument specifies its parents transaction. The new subtransaction
automatically joins the parent transaction, and a transaction identifier for a
subtransaction is returned.
operations in coordinator for nested transactions
openSubTransaction(trans) → subTrans
Opens a new subtransaction whose parent is trans and returns a
Distributed Systems Notes
Dept. of CSE,SIT 109 KNS
unique subtransaction identifier.
getStatus(trans) → committed, aborted provisional
Asks the coordinator to report on the status of the transaction trans. Returns
values
representing one' of the following: committed, aborted, provisional.
Figure: Operations in coordinator for nested transactions
Each of the nested transactions carries out its operations. When they are finished,
the server managing subtransaction records information as to whether the
subtransaction committed provisionally or aborted. Note that if its parent aborts, then
the subtransaction be forced to abort too.
A parent transaction - including a top-level transaction - can commit even if one
of its child subtransactions has aborted. Consider the top-level transaction T and its
subtransactions shown in Figure 4.25, which is based on Figure above. Each
subtransaction has either provisionally committed or aborted. For example, T)2 has
provisionally committed and T„ has aborted, but the fate of T12, depends on it parent T,
and sventuilly on the top-level transaction, T. Although T21 and T22 have both
provisionally committed, T2 has aborted and this means that T21 and T22 must also
abort. Suppose that T decides to commit in spite of the fact that T2 has aborted, also
that T, decides to commit in spite of the fact that Tn has aborted.
When a top-level transaction completes, its coordinator carries out a two-phase
commit protocol. The top-level transaction plays the role of coordinator in the two-phase
commit protocol.
The two-phase commit protocol may be performed in either a hierarchic manner or
in a flat manner. The second phase of the two-phase commit protocol is the same as for
the non-nested case. The coordinator collects the votes and then informs the participants
as to the outcome. When it is complete, coordinator and participants will have committed
or aborted their transactions.
Hierarchic two-phase commit protocol: In this approach, the two-phase commit
protocol becomes a multi-level nested protocol. The coordinator of the top-level
transaction communicates with the coordinators of the subtransactions for which it is the
immediate parent.
Distributed Systems Notes
Dept. of CSE,SIT 110 KNS
It sends canCommit messages to each of the latter, which in turn pass them on to
the coordinators of their child transactions (and so on down the tree). Each participant
collects the replies from its descendants before replying to its parent. In our example, T
sends canCommit messages to the coordinator of T, and then T, sends canCommit
messages to TB asking about descendants of T,.
The protocol does not include the coordinators of transaction such as T2, which
has aborted. The participant receiving the call looks in its transaction list for any
provisionally committed transaction or subtransaction, if it finds, it replies with Yes vote,
otherwise it replies with a No vote.
Flat two-phase commit protocol: In this approach, the coordinator of the top-level
transaction sends canCommit messages to the coordinators of all of the subtransactions
in the provisional commit list. In our example, to the coordinators of T, and T12
When a participant receives a canCommit request, it does the following:
If the participant has any provisionally committed transactions that are
descendants of the top-level transaction, trans:
- check that they do not have aborted ancestors in the abort List. Then prepare to
commit (by recording the transaction and its objects in permanent storage).
-those with aborted ancestors are aborted;
-send a Yes vote to the coordinator.
If the participant does not have a provisionally committed descendant of the top-
level transaction, it must have failed since it performed the subtransaction and it
sends a No vote to the coordinator.
5.10 CONCURRENCY CONTROL IN DISTRIBUTED
TRANSACTIONS Locking
In a distributed transaction, the locks on an object are held locally. The local lock
■imager can decide whether to grant a lock or make the requesting transaction wait.
However lit cannot release any locks until it knows that the transaction has been
committed or aborted at all the servers involved in the transaction. As lock managers in
different servers set their independently of one another, it is possible that different
servers may impose different on transactions. Consider the following interleaving of
transactions T and U at senders X and Y:
Distributed Systems Notes
Dept. of CSE,SIT 111 KNS
T U ' -
Write(A) at X locks A
Write(B)
waits for
Read(A)
atY
U
atX
locks B
waits for T
Read(B) at Y
•
Timestamp ordering concurrency control
In distributed transactions, a globally unique transaction timestamp is issued to the
client by the first coordinator accessed by a transaction. The transaction timestamp is
passed to the coordinator at each server whose objects perform an operation in the
transaction. The servers of distributed transactions are jointly responsible for ensuring
that they are performed m a serially equivalent manner. For example, if the version of an
object accessed by transaction U commits after the version accessed by T at one server,
Distributed Systems Notes
Dept. of CSE,SIT 112 KNS
then if T and U access the same object as one another at other servers, the coordinators
must agree as to the ordering of their timestamps. A timestamp consists of a pair <local
timestamp, server-id>. The agreed ordering of pairs of timestamps is based on a
comparison in which the server-id part is less significant.
When timestamp ordering is used for concurrency control, conflicts are resolved as
each operation is performed. If the resolution of a conflict requires a transaction to be
aborted, the coordinator will be informed and it will abort the transaction at all the
participants. Therefore any transaction that reaches the client request to commit should
always be able to commit.
A distributed transaction is validated by a collection of independent servers, each of
which validates transactions that access its own objects. The validation at all of the servers
takes place during the first phase of the two-phase commit protocol. Consider the
following interleaving of transactions T and U, which access objects A and B servers X
and Y, respectively.
T U
Read (A)
Write (A)
Read (B)
Write (B)
atX Read (B) atY
atX
atY
Write (B)
Read (A)
Write (A)
The transactions access the objects in the order T before U at server X and in the
order U before T at server Y Now suppose that T and U start validation at about the
same time, but server X validates T first and server Y validates U first. Each server will
be unable to validate the other transaction until the first one has completed. This is an
example of commitment deadlock. In distributed optimistic transactions, each server
applies a parallel validation protocol. This is an extension of either backward or forward
validation to allow multiple transactions to be in the validation phase at the same time. If
parallel validation is used transactions will not suffer from commitment deadlock.
Distributed Systems Notes
Dept. of CSE,SIT 113 KNS
In a distributed system involving multiple servers being accessed by multiple
transactions, a global wait for graph can in theory be constructed from the local ones. Then
can be a cycle in the global wait-for graph that is not in any single local one -that is, then
can be a distributed deadlock. Figure shows the interleaving of the transactions U, V and
W involving the objects A and B managed by servers X and Y and objects C and E
managed by server Z.
The complete wait-for graph shows that a deadlock cycle consists of alternate edges,
which represent a transaction waiting for an object and an object held by a transaction.
5.11 Distributed deadlocks
U V W
d.deposit(lO) lockD
a.deposit(20) lock A
atX
b.wifhdraw(30) wait at Y
b.deposit (10) lockB
atY
c.withdraw(20) wait at Z
c.deposit(30) lockC
atZ
a.withdraw(20) wait at X
Distributed Systems Notes
Dept. of CSE,SIT 114 KNS
Phantom deadlocks: A deadlock that is 'detected' but is not really a deadlock is called
phantom deadlock. In distributed deadlock detection, information about wait-for
relationships between transactions is transmitted from on server to another. If there is a
deadlock, the necessary information will eventually be collected in one place and a
cycle will be detected. Ja this procedure will take some time, there is a chance that one
of the transactions that Holds a lock will meanwhile have released it, in which case the
deadlock will no longer exist.
Liege chasing: A distributed approach to deadlock detection uses a technique called
edge casing or path pushing. In this approach, the global wait-for graph is not
constructed, but the servers involved have knowledge about some of its edges. The
servers attempt cycles by forwarding messages called probes, which follow the edges
of the graph throughout the distributed system. A probe message consists of transaction
wait-for representing a path in the global wait-for graph.
Edge-chasing algorithms have three steps - initiation, detection and resolution.
Initiation: When a server notes that a transaction T starts waiting for another
transaction J, where U is waiting to access an object at another server, it initiates
detection by sending a probe containing the edge <T-> U > to the server of the object at
which transaction U is locked. If U is sharing a lock, probes are sent to all the holders of
the lock. Sometimes farther transactions may start sharing the lock later on, in which
case probes can be sent to item too.
Detection: Detection consists of receiving probes and deciding whether deadlock
has occurred and whether to forward the probes. For example:
When a server of an object receives a probe <T → U > (indicating that T is
waiting for a transaction U that holds a local object), it checks to see whether U is also
waiting. If it is. the transaction it waits for (for example, V) is added to the probe
(making it < T →U → V >), and if the new transaction (V) is waiting for another
object elsewhere, the probe is forwarded. In this way, paths through the global wait-
for graph are built one edge at a time Before forwarding a probe, the server checks to
see whether the transaction (for example, < T →U → V → T>). If this is the case, it has
found a cycle in the graph and deadlock has been detected.
Distributed Systems Notes
Dept. of CSE,SIT 115 KNS
Resolution: When a cycle is detected, a transaction in the cycle is aborted to
break the deadlock.
In our example, the following steps describe how deadlock detection it
initiated and the probes that are forward during the corresponding detection phase.
Server X initiated detection by sending probe < W→ U> to the server of B
(server Y)
Server Y receives probe <W? U>, notes that B is held by V and appends V
to the Probe to procedure <W→U→V>. it notes that V is waiting for C at
Server Z. This probe is forwarded to server Z.
Server Z receives probe <W→U→V> and notes C is held by W and
appends by to the probe to produce <W→U→V→W>.
This path contains a cycle. The server detects a deadlock. One of the transactions in
out cycle must be aborted to break the deadlock. The transaction to be aborted can be
chosen according to transaction priorities. Transaction priorities could also be used to
reduce ae number of probes that are forwarded.
5.12 TRANSACTION RECOVERY
Recovery is concerned with ensuring that a server's objects are durable and that
the service provides failure atomicity.
|He task of a recovery manager is:
• to save objects in permanent storage (in a recovery file) for committed
transactions
• to restore the server's objects after a crash
• to recognize the recovery file to improve the performances of recovery
• to reclaim storage space (in the recovery file)
Intentions List: Any server that provides transactions needs to1
keep track of the objects
accessed by client's transactions. At each server, an intentions list is recorded for all of its
I currently active transactions - an intentions list of a particular transaction contains a list
of references and the values of all the objects that are altered by that transactions. When
a transaction is committed, that transaction's intentions list is used to identify the objects
it I affected. The committed version of each object is replaced by the tentative version
made by transaction, and the new value is written to the server's recovery file. When a
Distributed Systems Notes
Dept. of CSE,SIT 116 KNS
transaction I , the server uses the intentions list to delete all the tentative versions of
objects made I that transaction.
Logging: In the logging technique, the recovery file represents a log containing the
history i of all the transactions performed by a server. The history consists of values
of objects, Transactions status entries and intentions lists of transactions. The recovery
file will contain I * recent snapshot of the values of all the objects in the server
followed by a history of 1
Transactions after the snapshot.
During the normal operation of a server, its recovery manager is called whenever
a transaction prepares to commit, commits or aborts a transaction. When the server
is prepared I ID commit a transaction, the recovery manager appends all the objects
in its intentions list to recovery file, followed by the current status of that transaction
(prepared) together with its intentions list.
The log was recently reorganized and entries to the left of the double line represent
a snapshot of the values of A, B and C before transactions T and U started. In this we
use the names A, B and C as unique identifiers for objects. We show the situation
wtien transaction T has committed and transaction U has prepared but not
committed. When transaction T prepares to commit, the values of objects A and B are
written at positions PI and P2 in the log, followed by a prepared .transaction status
entry for T With its intentions list (<A, P, >, < B, P2 >). When transaction T
commits, a committed transaction states entry for T is put at position P4. Then when
transaction U prepares to commit, the values of objects C.