1 high performance data streaming in a service architecture jackson state university internet...
TRANSCRIPT
11
High Performance Data Streaming in a Service
ArchitectureJackson State University
Internet SeminarNovember 18 2004
Geoffrey FoxComputer Science, Informatics, Physics
Pervasive Technology LaboratoriesIndiana University Bloomington IN 47401
[email protected]://www.infomall.org http://www.grid2002.org
22
Abstract We discuss a class of HPC applications characterized
by large scale simulations linked to large data streams coming from sensors, data repositories and other simulations.
Such applications will increase in importance to support "data-deluged science”.
We show how Web service and Grid technologies offer significant advantages over traditional approaches from the HPC community.
We cover Grid workflow (contrasting it with dataflow) and how Web Service (SOAP) protocols can achieve high performance
33
Parallel Computing Parallel processing is built on breaking problems up
into parts and simulating each part on a separate computer node
There are several ways of expressing this breakup into parts with Software: • Message Passing as in MPI or• OpenMP model for annotating traditional languages• Explicitly parallel languages like High Performance Fortran
And several computer architectures designed to support this breakup• Distributed Memory with or without custom interconnect• Shared Memory with or without good cache• Vectors with usually good memory bandwidth
4
The Six Fundamental MPI The Six Fundamental MPI routinesroutines
MPI_Init MPI_Init (argc, argv) -- initialize(argc, argv) -- initialize MPI_Comm_rankMPI_Comm_rank (comm, rank) -- find (comm, rank) -- find
process label (rank) in groupprocess label (rank) in group MPI_Comm_sizeMPI_Comm_size(comm, size) -- find total (comm, size) -- find total
number of processesnumber of processes MPI_SendMPI_Send
(sndbuf,count,datatype,dest,tag,comm) -- (sndbuf,count,datatype,dest,tag,comm) -- send a messagesend a message
MPI_RecvMPI_Recv (recvbuf,count,datatype,source,tag,comm,st(recvbuf,count,datatype,source,tag,comm,status) -- receive a messageatus) -- receive a message
MPI_FinalizeMPI_Finalize( ) -- End Up( ) -- End Up
55
Whatever the Software/Parallel Architecture …..
The software is a set of linked parts • Threads, Processes sharing the same memory or independent programs
on different computers
And the parts must pass information between them in to synchronize themselves and ensure they really are working the same problem
The same of course is true in any system• Neurons pass electrical signals in the brain
• Humans use a variety of information passing schemes to build communities: voice, book, phone
• Ants and Bees use chemical messages
Systems are built of parts and in interesting systems the parts communicate with each other and this communication expresses “why it is a system” and not a bunch of independent bits
66
A Picture from 20 years ago
77
Passing Information Information passing between parts covers a wide range
in size (number of bits electronically) and “urgency” Communication Time = Latency + (Information
Size)/Bandwidth From Society we know that we choose multiple
mechanisms with different tradeoffs• Planes and high latency and bandwidth• Walking is low latency but low bandwidths• Cars are somewhat in between theses cases
We can always think of information being transferred as a message• If airplane passenger, sound waves or a posted letter• Whether if an MPI message or UNIX Pipe between processes
or a method call between threads
88
Parallel Computing and Message Passing We worked very hard to get a better programming model
for parallel computing that removed need for user to• Explicitly decompose problem and derive parallel
algorithm for decomposed parts• Write MPI programs expressing explicit decomposition
This effort wasn’t so successful and on distributed memory machines (including BlueGene/L) at least message passing of MPI style is the execution model even if one uses a higher level language
So for parallelism, we are forced to use message passing and this is efficient but intellectually hard
99
The Latest Top 5 in Top500
10
What about Web Services?• Web Services are distributed computer programs that
can be in any language (Fortran .. Java .. Perl .. Python) • The simplest implementations involve XML messages
(SOAP) and programs written in net friendly languages like Java and Python
• Here is a typical e-commerce use?
Security Catalog
PaymentCredit Card
WarehouseshippingWSDL interfaces
WSDL interfaces
1111
Internet Programming Model Web Services are designed as the latest distributed computing
programming paradigm motivated by the Internet and the expectation that enterprise software will be built on the same software base
Parallel Computing is centered on DECOMPOSITION Internet Programming is centered on COMPOSITION The components of e-commerce (catalog, shipping, search,
payment) are NATURALLY separated (although they are often mistakenly integrated in older implementations)
These same components are naturally linked by Messages MPI is replaced by SOAP and the COMPOSITION model is
called Workflow Parallel Computing and the Internet have the same execution
model (processes exchanging messages) but very different REQUIREMENTS
1212
Requirements for MPI Messaging
MPI and SOAP Messaging both send data from a source to a destination
• MPI supports multicast (broadcast) communication;
• MPI specifies destination and a context (in comm parameter)
• MPI specifies data to send• MPI has a tag to allow flexibility in processing in source processor
• MPI has calls to understand context (number of processors etc.)
MPI requires very low latency and high bandwidth so that tcomm/tcalc is at most 10
• BlueGene/L has bandwidth between 0.25 and 3 Gigabytes/sec/node and latency of about 5 microseconds
• Latency determined so Message Size/Bandwidth > Latency
tcommtcalc tcalc
1313
BlueGene/L MPI I
http://www.llnl.gov/asci/platforms/bluegene/papers/6almasi.pdf
1414
BlueGene/L MPI II
http://www.llnl.gov/asci/platforms/bluegene/papers/6almasi.pdf
1515
BlueGene/L MPI IIIhttp://www.llnl.gov/asci/platforms/bluegene/papers/6almasi.pdf
500 Megabytes/sec
1616
Requirements for SOAP Messaging Web Services has much of the same requirements as MPI with two
differences where MPI more stringent than SOAP• Latencies are inevitably 1 (local) to 100 milliseconds which is
200 to 20,000 times that of BlueGene/L 1) 0.000001 ms – CPU does a calculation 2) 0.001 to 0.01 ms – MPI latency 3) 1 to 10 ms – wake-up a thread or process 4) 10 to 1000 ms – Internet delay
• Bandwidths for many business applications are low as one just needs to send enough information for ATM and Bank to define transactions
SOAP has MUCH greater flexibility in areas like security, fault-tolerance, “virtualizing addressing” because one can run a lot of software in 100 milliseconds• Typically takes 1-3 milliseconds to gobble up a modest message
in Java and “add value”
1717
Ways of Linking Software Modules
Module A
Module B
.001 to 1 millisecondMETHOD CALL BASED
Service A
Service B
Messages
0.1 to 1000 millisecond latencyMESSAGE BASED
Coarse Grain Service ModelClosely coupled Java/Python …
Service B Service A
PublisherPost Events
“Listener”Subscribe to Events
Message Queue in the Sky
EVENT BASED with brokered messages
1818
MPI and SOAP Integration Note SOAP Specifies format and through WSDL
interfaces MPI only specifies interface and so interoperability
between different MPIs requires additional work• IMPI http://impi.nist.gov/IMPI/
Pervasive networks can support high bandwidth (Terabits/sec soon) but latency issue is not resolvable in general way
Can combine MPI interfaces with SOAP messaging but I don’t think this has been done
Just as walking, cars, planes, phones coexist with different properties; so SOAP and MPI are both good and should be used where appropriate
1919
NaradaBrokering http://www.naradabrokering.org We have built a messaging system that is designed to
support traditional Web Services but has an architecture that allows it to support high performance data transport as required for Scientific applications• We suggest using this system whenever your application can
tolerate 1-10 millisecond latency in linking components
• Use MPI when you need much lower latency Use SOAP approach when MPI interfaces required but
latency high• As in linking two parallel applications at remote sites
Technically it forms an overlay network supporting in software features often done at IP Level
20
Pentium-3, 1GHz, 256 MB RAM100 Mbps LAN
JRE 1.3 Linux
hop-3
0
1
2
3
4
5
6
7
8
9
100 1000
Tra
nsit
Del
ay
(Mill
isec
onds
)
Message Payload Size (Bytes)
Mean transit delay for message samples in NaradaBrokering: Different communication hops
hop-2
hop-5 hop-7
2121
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1000 1500 2000 2500 3000 3500 4000 4500 5000
Sta
nd
ard
De
via
tion
(M
illis
eco
nd
s)
Message Payload Size (Bytes)
Standard Deviation for message samples in NaradaBrokering Different communication hops - Internal Machines
hop-2hop-3hop-5hop-7
22
23
Average Video Delays for one broker – divide by N for N load balanced brokers
Latency ms
# Receivers
One sessionMultiple sessions
30 frames/sec
24
NB-enhanced GridFTPAdds Reliability and Web Service Interfaces to GridFTPPreserves parallel TCP performance and offers choice of transport andFirewall penetration
2525
Role of WorkflowRole of Workflow
Programming SOAP and Web Services (the Grid)Programming SOAP and Web Services (the Grid): : Workflow describes linkage between servicesWorkflow describes linkage between services
As distributed, As distributed, linkage must be by messageslinkage must be by messages Linkage is two-way and has both control and dataLinkage is two-way and has both control and data Apply to multi-disciplinary, multi-scale linkage, Apply to multi-disciplinary, multi-scale linkage,
multi-program linkage, link multi-program linkage, link visualization to visualization to simulationsimulation, GIS to simulations and visualization , GIS to simulations and visualization filters to each otherfilters to each other
Microsoft-IBM specification Microsoft-IBM specification BPELBPEL is current is current preferred Web Service XML specification of preferred Web Service XML specification of workflowworkflow
Service-1 Service-3
Service-2
2626
Example workflowExample workflow
Here a sensor feeds a data-mining application(We are extending data-mining in DoD applications with Grossman from UIC)The data-mining application drives a visualization
2727
Example Flood Simulation workflowExample Flood Simulation workflow
DataArchives
DataArchives
RunoffModel
RunoffModel
FlowModel
FlowModel
FlowModel
GIS Grid Services Link Distributed
Data and Applications
SOAP MessagesAnd Events
DataArchives
DataArchives
RunoffModel
RunoffModel
FlowModel
FlowModel
FlowModel
GIS Grid Services Link Distributed
Data and Applications
SOAP MessagesAnd Events
2828
SERVOGrid Codes, RelationshipsSERVOGrid Codes, Relationships
Elastic DislocationPattern Recognizers
Fault Model BEM
Viscoelastic Layered BEM
Viscoelastic FEMElastic Dislocation Inversion
This linkage called Workflow in Grid/Web Service parlance
29
Two-level Programming I• The Web Service (Grid) paradigm implicitly assumes a
two-level Programming Model• We make a Service (same as a “distributed object” or
“computer program” running on a remote computer) using conventional technologies– C++ Java or Fortran Monte Carlo module
– Data streaming from a sensor or Satellite
– Specialized (JDBC) database access
• Such services accept and produce data from users files and databases
• The Grid is built by coordinating such services assuming we have solved problem of programming the service
Service Data
3030
Two-level Programming II The Grid is discussing the composition of distributed
services with the runtime interfaces to Grid as opposed to UNIX pipes/data streams
Familiar from use of UNIX Shell, PERL or Python scripts to produce real applications from core programs
Such interpretative environments are the single processor analog of Grid Programming
Some projects like GrADS from Rice University are looking at integration between service and composition levels but dominant effort looks at each level separately
Service1 Service2
Service3 Service4
31
3 Layer Programming Model
Application(level 1 Programming)
Application Semantics (Metadata, Ontology)Level 2 “Programming”
Basic Web Service Infrastructure
Web Service 1
Workflow (level 3) Programming BPEL
WS 2 WS 3 WS 4
MPI Fortran C++ etc.
Semantic Web
Workflow will be built on top of NaradaBrokering as messaging layer
32
Structure of SOAP• SOAP defines a very obvious message structure with a header
and a body just like email• The header contains information used by the “Internet operating
system”– Destination, Source, Routing, Context, Sequence Number …
• The message body is partly further information used by the operating system and partly information for application when it is not looked at by “operating system” except to encrypt, compress it etc.– Note WS-Security supports separate encryption for different parts of a
document
• Much discussion in field revolves around what is referenced in header
• This structure makes it possible to define VERY Sophisticated messaging
3333
Deployment Issues for “System Services” “System Services” (handlers/filters) are ones that act
before the real application logic of a service They gobble up part of the SOAP header identified by
the namespace they care about and possibly part or all of the SOAP body• e.g. the XML elements in header from the WS-RM
namespace They return a modified SOAP header and body to next
handler in chain
WS-RMHandler
WS-……..Handler
Header
Body
e.g. ……. Could be WS-Eventing WS-Transfer ….
34
Fast Web Service Communication I• Internet Messaging systems allow one to optimize message streams at the cost of
“startup time”, • Web Services can deliver the fastest possible interconnections with or without reliable
messaging• Typical results from Grossman (UIC) comparing Slow SOAP over TCP with binary and
UDP transport (latter gains a factor of 1000)
SOAP/XML WS-DMX/ASCII WS-DMX/Binary Record Count MB µ σ/µ MB µ σ/µ MB µ σ/µ
10000 0.93 2.04 6.45% 0.5 1.47 0.61% 0.28 1.45 0.38% 50000 4.65 8.21 1.57% 2.4 1.79 0.50% 1.4 1.63 0.27% 150000 13.9 26.4 0.30% 7.2 2.09 0.62% 4.2 1.94 0.85% 375000 34.9 75.4 0.25% 18 3.08 0.29% 10.5 2.11 1.11% 1000000 93 278 0.11% 48 3.88 1.73% 28 3.32 0.25% 5000000 465 7020 2.23% 242 8.45 6.92% 140 5.60 8.12%
Pure SOAP SOAP over UDP Binary over UDP
7020 5.60
35
Fast Web Service Communication II• Mechanism only works for streams – sets of related
messages• SOAP header in streams is constant except for
sequence number (Message ID), time-stamp ..• One needs two types of new Web Service
Specification• “WS-StreamNegotiation” to define how one can use
WS-Policy to send messages at start of a stream to define the methodology for treating remaining messages in stream
• “WS-FlexibleRepresentation” to define new encodings of messages
36
Fast Web Service Communication III• Then use “WS-StreamNegotiation” to negotiate stream in Tortoise
SOAP – ASCII XML over HTTP and TCP –
– Deposit basic SOAP header through connection – it is part of context for stream (linking of 2 services)
– Agree on firewall penetration, reliability mechanism, binary representation and fast transport protocol
– Naturally transport UDP plus WS-RM• Use “WS-FlexibleRepresentation” to define encoding of a Fast
transport (On a different port) with messages just having “FlexibleRepresentationContextToken”, Sequence Number, Time stamp if needed
– RTP packets have essentially this structure– Could add stream termination status
• Can monitor and control with original negotiation stream• Can generate different streams optimized for different end-points
3737
Data Deluged Science In the past, we worried about data in the form of parallel I/O or
MPI-IO, but we didn’t consider it as an enabler of new algorithms and new ways of computing
Data assimilation was not central to HPCC DoE ASC set up because didn’t want test data! Now particle physics will get 100 petabytes from CERN
• Nuclear physics (Jefferson Lab) in same situation
• Use around 30,000 CPU’s simultaneously 24X7
Weather, climate, solid earth (EarthScope) Bioinformatics curated databases (Biocomplexity only 1000’s of
data points at present) Virtual Observatory and SkyServer in Astronomy Environmental Sensor nets
38
Weather Requirements
Data
Information
Ideas
Simulation
Model
Assimilation
Reasoning
Datamining
ComputationalScience
Informatics
Data DelugedScienceComputingParadigm
4040
Virtual Observatory Astronomy GridIntegrate Experiments
Radio Far-Infrared Visible
Visible + X-ray
Dust Map
Galaxy Density Map
4141
In flight data
Airline
Maintenance Centre
Ground Station
Global NetworkSuch as SITA
Internet, e-mail, pager
Engine Health (Data) Center
DAME Data Deluged Engineering
Rolls Royce and UK e-Science ProgramDistributed Aircraft Maintenance Environment
~ Gigabyte per aircraft perEngine per transatlantic flight
~5000 engines
42
USArray
Seismic
Sensors
43
a
Topography1 km
Stress Change
Earthquakes
PBO
Site-specific IrregularScalar Measurements Constellations for Plate
Boundary-Scale Vector Measurements
aaIce Sheets
Volcanoes
Long Valley, CA
Northridge, CA
Hector Mine, CA
Greenland
4444
HPCSimulation
DataFilter
Data FilterD
ata
Filt
er
Data
Filter
Data
Filter
Distributed Filters massage dataFor simulation
Other
Grid
and W
eb
Servi
ces
AnalysisControl
Visualize
Data Deluged ScienceComputing Architecture
Grid
OGSA-DAIGrid Services
Grid Data Assimilation
4545
Data Assimilation Data assimilation implies one is solving some optimization
problem which might have Kalman Filter like structure
Due to data deluge, one will become more and more dominated by the data (Nobs much larger than number of simulation points).
Natural approach is to form for each local (position, time) patch the “important” data combinations so that optimization doesn’t waste time on large error or insensitive data.
Data reduction done in natural distributed fashion NOT on HPC machine as distributed computing most cost effective if calculations essentially independent • Filter functions must be transmitted from HPC machine
2 2
1
min ( , ) _obsN
i iTheoretical Unknownsi
Data position time Simulated Value Error
4646
Distributed Filtering
HPC Machine
Distributed Machine
Data FilterNobslocal patch 1
Nfilteredlocal patch 1
Data FilterNobslocal patch 2
Nfilteredlocal patch 2
GeographicallyDistributedSensor patches
Nobslocal patch >> Nfiltered
local patch ≈ Number_of_Unknownslocal patch
Send needed FilterReceive filtered data
In simplest approach, filtered data gotten by linear transformations on original data based on Singular Value Decomposition of Least squares matrix
Factorize Matrixto product oflocal patches