www.bsc.es pol alvarez, victor anton, rosa m. badia, javier conejero, carlos diaz, jorge ejarque...

Download Www.bsc.es Pol Alvarez, Victor Anton, Rosa M. Badia, Javier Conejero, Carlos Diaz, Jorge Ejarque Daniele Lezzi, Francesc Lordan Raül Sirvent, Cristian

If you can't read please download the document

Upload: rosamond-johnson

Post on 18-Jan-2018

213 views

Category:

Documents


0 download

DESCRIPTION

Outline (Feb 4 th 2016) Lunch Break (13:00 -14:00) Session 3 (14: :30) Hands-on I – Virtual Machine –Virtual Machine Setup - Jorge –Java Hands-on – Jorge Word-count taskified code Configuration, monitoring, debugging Graph generation Coffee break: 15:30 – 16:00 Session 4 (16:00 – 18:00): Hands-on II – MareNostrum –MN accounts –Python Hands-on (Javi, Sandra support) Word-count without annotations Annotate tasks Execution in MN Overview of tracing, trace analysis Code optimization (reduce in a tree) Final notes Support: Carlos, Sandra, Cristian, Victor 3

TRANSCRIPT

Pol Alvarez, Victor Anton, Rosa M. Badia, Javier Conejero, Carlos Diaz, Jorge Ejarque Daniele Lezzi, Francesc Lordan Ral Sirvent, Cristian Ramon-Cortes, COMPSs Tutorial February 19th 2015, Barcelona Outline (Feb 4 th 2016) Roundtable(9:00 - 9:30): Presentation and background of participants Session 1 (9:30 11:00): Introduction to COMPSs Programming model (Raul) Java Syntax (Raul) Demo: First Java example (Raul) Python syntax (Rosa) Demo: First Python example (Rosa) Coffee break (11:00 11:30) Session 2 (11:30 13:00): COMPSs execution environments (Jorge) Demo: First example executed in MN Demo: first example executed in cloud (video) Other sample codes -slides (Rosa) 2 Outline (Feb 4 th 2016) Lunch Break (13:00 -14:00) Session 3 (14: :30) Hands-on I Virtual Machine Virtual Machine Setup - Jorge Java Hands-on Jorge Word-count taskified code Configuration, monitoring, debugging Graph generation Coffee break: 15:30 16:00 Session 4 (16:00 18:00): Hands-on II MareNostrum MN accounts Python Hands-on (Javi, Sandra support) Word-count without annotations Annotate tasks Execution in MN Overview of tracing, trace analysis Code optimization (reduce in a tree) Final notes Support: Carlos, Sandra, Cristian, Victor 3 Introduction Motivation New complex architectures constantly emerging With their own way of programming them Fine grain: e.g. APIs to run with GPUs, NVMs (Non-Volatile Memories) Coarse grain: e.g. APIs to deploy in Clouds Difficult for programmers Higher learning curve / Time To Market (TTM) What about non computer scientists??? Difficult to understand what is going on during execution Was it fast? Could it be even faster? Am I paying more than I should? (Efficiency) Tune your application for each architecture (or cluster) E.g. partitioning data among nodes 5 Motivation Create tools that make users life easier Intermediate layer: let the difficult parts to those tools Act on behalf of the user Distributing the work through resources Dealing with architecture specifics Automatically improving performance Tools for visualization Monitoring Performance analysis 6 The parallel programming revolution Parallel programming in the past Where to place data What to run, where How to communicate Parallel programming in the future What do I need to compute What data do I need to use Hints (not necessarily very precise) on potential concurrency, locality, YOU PROGRAM SEQUENTIALLY!!! programmers mind Static system Dynamic Complexity: Divergence between our mental model and reality Variability 7 Living in the programming revolution At the beginning there was one language Simple interface Sequential program Simple interface Sequential program ILP ISA / API Programs decoupled from hardware Programs decoupled from hardware Applications Living in the programming revolution Multicores made the interface to leak ISA / API Address spaces (hierarchy, transfer), control flows, Applications Application logic + Platform specificities Application logic + Platform specificities Applications BSC Vision in the programming revolution (StarSs) Need to decouple again ISA / API General purpose Task based Single address space General purpose Task based Single address space Reuse architectural ideas under new constraints (Out-Of-Order execution) Reuse architectural ideas under new constraints (Out-Of-Order execution) Application logic Arch. independent Application logic Arch. independent Applications Power to the runtime PM: High-level, clean, abstract interface 10 BSC Vision in the programming revolution (StarSs) ISA / API Special purpose Must be easy to develop/maintain Special purpose Must be easy to develop/maintain Fast prototyping Applications Power to the runtime PM: High-level, clean, abstract interface DSL1 DSL2 DSL3 11 The StarSs Granularities StarSs OmpSsCluster Average task Granularity: 100 microseconds 10 milliseconds 10 ms - 1 day Language bindings: C, C++, FORTRAN Java, C/C++, Python Address space to compute dependences: Memory Files, Objects, NVMs SMPs, ClustersClusters, Clouds 12 Lets narrow the StarSs ideafor Distributed Architectures Cluster / Cloud applications are complex to develop Even more if you want to run things in parallel Goal 1: Keep a Sequential Programming Paradigm Writing an application for a computational distributed infrastructure should be as easy as writing a sequential application Goal 2: Exploit parallelism Run it as fast as possible Target applications: composed of tasks, most of them repetitive Granularity of the tasks: enough to be distributed (simulators, ) Data: files, objects, arrays and primitive types 14 15 Programming Model: Properties (I) Main Program { taskA(data1); for (int i=0; i< N; i++) taskB(data1, data2); if (condition) process(data2); } t taskA taskB main thread synch Based on sequential programming No APIs, no threading, no messaging No parallel constructs, no pragmas Sequential consistency Runtime System 16 ApplicationTask Selection Interface task GridClusterCloud Runtime System Supported Features Basic Features: Data dependency analysis Data transfer Task scheduling Resource management Results collection Fault tolerance Method and Web Service Tasks Advanced Features: Shared disks support Constraints based scheduling Task versioning support 17 18 for (int i = 0; i < N; i++) { newBWD = random(); subst(refCFG, newBWD, newCFG ); dimemas( newCFG, trace, dimOUT ); extract(newBWD, dimOUT, finalOUT ); if (i % 2 == 0) display( finalOUT ); } Main Program subst dimemas extract subst dimemas extract subst dimemas extract subst dimemas extract subst dimemas extract display Programming Model: Dependency detection Automatic on-the-fly creation of a task dependency graph Java Syntax COMPSs Bindings 20 Java Runtime C/C++ binding Python binding Java App C/C++ App Python App JNI C++ library PyCOMPSs Programming Model: Steps Resource 2 1. Identify tasks main program { } 2. Select tasks taskA(...); taskB(...); task selection interface { } taskA taskB Task Unit of parallelism... taskA Asynchrony taskB Resource 1 Resource N 21 22 public interface SampleItf = 1, memoryPhysicalSize = = servicess.Example) void = INOUT) Reply r = http://servicess.es/example, name = SampleService, port = SamplePort) Reply = IN) Query q ); } Programming Model: Task selection interface 23 Programming Model: Regular Main program public class App { public static void main(String[] args) { Query query = new Query(); Reply reply = myServiceOp(query); myMethod(reply); reply.printToLog(); } } Service task call Method task call Synchronization myServiceOp myMethod public class ServiceApp public static void sampleComposite() { Query query = new Query(); Reply reply = myServiceOp(query); myMethod(reply); reply.printToLog(); } myServiceOp myMethod Programming Model: Service Operation 24 Programming Model: Summary Resource 1... for (i=0; i Increment Des dEclipse exportar Fer execucio sequencial Anar al directori I fer el runcompss -> una primera sense monitor Segona execucio: Mostrar monitor -> graf, execucio Goal: same sequential applications runs in parallel 31 Python Syntax Why Python? Python is powerful... and fast; plays well with others; runs everywhere; is friendly & easy to learn; is Open. * Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C Large community using it, including scientific and numeric Object-oriented programming and structured programming are fully supported Large number of software modules available (38,000 as of January 2014) 33 * From python.org PyCOMPSs: Task definition Task definition with Python decorators Provide information about task parameters (TYPE_DIRECTION): Type Only mandatory for files Inferred for the rest of the types Direction Default IN (read-only) Mandatory for INOUT (read-write) and OUT a = INOUT, b = FILE_OUT ) def my_func(a, b, c): 34 type inferred explicit type and direction type inferred, default direction (IN) PyCOMPSs: Task definition (II) decorator: special arguments Type of the return value mandatory if a value is = int) def ret_func(): return 1 Does the task modify the callee object? default True class = False) def instance_method(self): # self is NOT modified here Is it a priority task? Default = True) def prio_func(): 35 PyCOMPSs: Task types What can be selected as a task? (a) Functions (b) Instance methods (c) Class methods 36 class ( ) def my_i_method (self,( ) def my_c_method (cls, ): ( ) def my_function (): (a) (b) (c) PyCOMPSs: Main program Synchronization API Data created or updated by a task can be used in the main program of the application But we need to synchronize first! Two API methods for synchronization compss_open files my_file = file.txt func(my_file) fd = compss_open(my_file) compss_wait_on objects my_obj = MyClass() my_obj.method() my_obj = compss_wait_on(my_obj) 37 func is a task that modifies my_file method is a task that modifies my_obj PyCOMPSs: Main program Future objects Mechanism to make asynchronous those tasks that return a value Synchronization is only triggered when necessary The future object is a representative of the object yet to be = MyClass) def ret_func(): return MyClass() o = ret_func() 38 future object PyCOMPSs: Main program Future objects (II) A future object can be involved in a subsequent task call PyCOMPSs will automatically enforce the dependency o = ret_func() another_task(o) o.yet_another_task() Synchronization from main program (same as other objects): o = ret_func() o = compss_wait_on(o) 39 future object PyCOMPSs Constraints Enables definition of tasks constraints Resource to execute the task should meet the constraint Decorator definition: constraint2=value2, ) Examples of supported constratints: ProcessorArch ProcessorCoreCount MemoryPhysicalSize AppSoftware 40 PyCOMPSs: Wrap-up example Invoke tasks as Python functions/methods API for data synchronization 41 class () def myMethod (self): foo = Foo() myFunction( foo ) foo.myMethod() foo = compss_wait_on(foo) foo.bar() Main Program Function definition Task selection in function definition ( par = INOUT ) def myFunction (par): myF myM synch Demo Python Increment.py example Show code Make dem as Java demo Fer execucio sequencial Anar al directori I fer el runcompss -> una primera sense monitor Segona execucio: Mostrar monitor -> graf, execucio Goal: same sequential applications runs in parallel 42 Increment.py 43 Increment is an application that takes three different values and increases them a number of given times. The purpose of this application is to show parallelism between the different increments. def main_program(): # Check and get parameters if len(sys.argv) != 5: usage() exit(-1) N = int(sys.argv[1]) counter1 = int(sys.argv[2]) counter2 = int(sys.argv[3]) counter3 = int(sys.argv[4]) # Initialize counter files initializeCounters(counter1, counter2, counter3) print "Initial counter values:" printCounterValues() # Execute increment for i in range(N): increment(FILENAME1) increment(FILENAME2) increment(FILENAME3) # Write final counters state (sync) print "Final counter values:" printCounterValues() Counters stored in files Increment.py 44 Increment is an application that takes three different values and increases them a number of given times. The purpose of this application is to show parallelism between the different increments. def main_program(): # Check and get parameters if len(sys.argv) != 5: usage() exit(-1) N = int(sys.argv[1]) counter1 = int(sys.argv[2]) counter2 = int(sys.argv[3]) counter3 = int(sys.argv[4]) # Initialize counter files initializeCounters(counter1, counter2, counter3) print "Initial counter values:" printCounterValues() # Execute increment for i in range(N): increment(FILENAME1) increment(FILENAME2) increment(FILENAME3) # Write final counters state (sync) print "Final counter values:" = FILE_INOUT) def increment(filePath): # Read value fis = open(filePath, 'r') value = fis.read() fis.close() # Write value fos = open(filePath, 'w') fos.write(str(int(value) + 1)) fos.close() Increment.py Task graph identical to the Java case 45 Task Graph counter1counter2 counter3 1st iteration 2nd iteration 3rd iteration COMPSs Execution Environments Runtime System 47 ApplicationTask Selection Interface task GridClusterCloud Runtime System Execution Environment Configuration Overview 48 Runtime System Comm. Adaptor GAT NIO Job & Data Management Cloud Connector jClouds rOCCI Resource Management project.xml resources.xml Master-Worker Comm. Mechanism GAT: Restrictred environments (only ssh access) and Grid Middleware NIO: Efficient Persistent workers implementation Controlled and secured environments Provided as execution command argument Application Exec. Description Selection of resources Application Code Location Working directory Provided as execution command argument Resource Scalability jClouds: access to most of commercial public clouds rOCCI: OGF standard Extensible (support others..) Described in the resources and project files Infrastructure Description Describe the available resource in the infrastructure Describe Cloud Providers: Images and VM Templates 49 Remote host Demo Matrix Multiplication execution in bscgrid06 Clusters Demo Matrix Multiplication execution in MN Cloud Demo Matrix Multiplication execution in Google Compute Engine Basic Execution Examples Demo application: Block Matrix multiplication Code Example Task dependency graph for matrices of 2x2 blocks Dynamically Generated //Main Code for (int i = 0; i < MSIZE; i++){ for (int j = 0; j < MSIZE; j++){ for (int k = 0; k < MSIZE; k++){ multiplyAccumulative(C[i][j], A[i][k], B[k][j] ); } //Task Code public static void multiplyAccumulative ( Block c, Block a, Block b ) { for( int i = 0; i < c.bRows; i++ ) for( int j = 0; j < c.bCols; j++ ) for ( int k = 0; k < c.bCols; k++ ) c.data[i][j] += a.data[i][k] * b.data[k][j]; } 51 COMPSs in remote hosts (interactive) Typical setup: Master node: main program (+ master runtime) Worker nodes: tasks (+ worker runtime) Communication Adaptor GAT COMPSs Master RT App main program COMPSs Worker RT Task code Master Workers Described by resources.xml files NIO 52 Configuration: Resources Specification Resources.xml 0 short x Linux 0 short x Linux 1 8 Java... 1 8 Java... 53 Configuration: Project Specification Project.xml /opt/COMPSs/Runtime/scripts/ /tmp/ user 1 . /opt/COMPSs/Runtime/scripts/ /tmp/ user 1 . 54 DEMO: COMPSs using Remote hosts (interactive) Demo: Deploy code in worker Run the application with specific resources and project.xml and GAT the adaptor Communication Adaptor GAT... COMPSs Master RT App main program COMPSs Worker RT Task code localhost Bscgrid06 (Only ssh access is allowed) Cluster Compute Nodes Cluster Login Node 55 COMPSs in a Cluster (queue system) Execution divided in two phases Launch scripts queue a whole COMPSs app execution Actual execution starts when reservation is obtained Queue System (LSF, PBS,...) Launch scripts Automatically generated XML files Application COMPSs RT Communication Adaptor NIO MN Compute Nodes Application MN Login Node (mn1.bsc.es) 56 COMPSs in a Cluster (queue system) DEMO: Deploy app in MN Connect login node Launch enqueue_compss script requesting the number of nodes LSF Queue System Enqueue_compss COMPSs RT Communication Adaptor NIO 57 Create/delete VMs Execute /Copy COMPSs in Clouds Execution of COMPSs applications in Clouds Select de connector to interact the Cloud provider Adaptor to communicate VMs (NIO if provider supports firewall management, GAT if only ssh) Application COMPSs Runtime Cloud Connector jClouds rOCC I Comm. Adaptor NIO GAT Provider API 58 Cloud Configuration: Resources Specification Resources.xml https://bscgrid20.bsc.es:11443 integratedtoolkit.connectors.rocci.ROCCI > 59 Cloud Configuration: Project Specification Project.xml user-cred /home/.../cert.pem user userbsc user-cred /home/.../cert.pem user userbsc /opt/COMPSs/Runtime/scripts /tmp/ user /home/.../AppName.tar.gz /home/user 60 DEMO: COMPSs in Clouds Execution of COMPSs applications in Google Compute Engine (GCE) Application COMPSs Runtime Cloud Conn. jCloud s Comm. Adapt. NIO GCE API https://www.youtube.com/watch?v=XGaqUje_2zY 61 Combine Different Environment Cloud Bursting Multiple Grids COMPSs for scaling Web Service Real Applications implemented with COMPSs Advanced Usage Examples Local Cluster 62 Cloud Bursting Execution of COMPSs applications in Clouds Select de connector to interact Cloud providers connectors, SSH Application COMPSs Runtime 63 Cloud Bursting Increase/decrease number of VMs depending on task load Bursting to Amazon EC2 to face peak load 64 COMPSs in multiple Grids COMPSs Runtime 65 Web service implementations with COMPSs A WS method implements a worklfow of tasks Different invocations generate different tasks Runtime manages the execution of the different calls in the available services IDE for implementing and deploying applications 66 Building & Deployment: -Generate Packages -Define hosts & Deploy IDE for COMPSs applications Tasks Definition: -Service Operations (Orchestration -Tasks (Core Element ) 67 Applications using COMPSs Personalized medicine eIMRT: planning radiotherapy treatments Earth Science HRT: modeling global ocean-atmosphere circulation 3D render LuxRender: renderize architectural designs Civil Engineering EnergyPlus: modeling airflows in buildings Architrave: force-effects on buildings 68 Applications using COMPSs BioInformatics Discrete: simulate molecular dynamics for proteins Blast: alignments of protein sequences Hmmer: alignment of protein sequences GeneDetect: genetics algorithm QSAR: drug design REPET ABYSS Other sample codes Matrix multiply in Java 70 for (int i = 0; i < MSIZE; i++){ for (int j = 0; j < MSIZE; j++){ for (int k = 0; k < MSIZE; k++){ MatmulImpl.multiplyAccumulative( _C[i][j], _A[i][k], _B[k][j] ); } public static void multiplyAccumulative( String f3, String f1, String f2 ) { Block a = new Block( f1 ); Block b = new Block( f2 ); Block c = new Block( f3 ); c.multiplyAccum( a, b ); try } public void multiplyAccum ( Block a, Block b ) { for( int i = 0; i < this.bRows; i++ )// rows for( int j = 0; j < this.bCols; j++ )// cols for ( int k = 0; k < this.bCols; k++ )// cols this.data[i][j] += a.data[i][k] * b.data[k][j]; } Matrix multiply in Java 71 package matmul; import integratedtoolkit.types.annotations.Constraints; import integratedtoolkit.types.annotations.Method; import integratedtoolkit.types.annotations.Parameter; import integratedtoolkit.types.annotations.Parameter.*; public interface MatmulItf = 4, memoryPhysicalSize = = "matmul.MatmulImpl") void = Type.FILE, direction = Direction.INOUT) String = Type.FILE, direction = Direction.IN) String = Type.FILE, direction = Direction.IN) String file3 ); } Matrix multiply with constraints in Python 72 from pycompss.api.constraint import constraint from pycompss.api.task import task from pycompss.api.parameter= INOUT) def multiply(a, b, c): import numpy c += a*b args = sys.argv[1:] MSIZE = int(args[0]) BSIZE = int(args[1]) A = B = C = [] # Initialize A, B & C as np.array(BSIZE, BSIZE) initialize_variables() for i in range(MSIZE): for j in range(MSIZE): for k in range(MSIZE): multiply(A[i][k], B[k][j], C[i][j]) PyCOMPSs constraints 73 Worker (Numpy) Workerdef fitness(source, target): import numpy fitval = 0 a = numpy.array([ord(i) for i in source]) b = numpy.array([ord(i) for i in target]) fitval = numpy.linalg.norm(a-b) return def mutate(source): charpos = random.randint(0, len(source) - 1) parts = list(source) parts[charpos] = chr(ord(parts[charpos]) + random.randint(-1,1)) return ''.join(parts) fitval = fitness(source, target) i = 0 while True: i += 1 m = mutate(source) fitval_m = fitness(m, target) fitval_m = compss_wait_on(fitval_m) if fitval_m < fitval: fitval = fitval_m m = compss_wait_on(source) Sample code: PyCOMPSs 74 from pycompss.api.api import compss_wait_on size = int(numV / numFrag) X = [genFragment(size, dim) for _ in range(numFrag)] mu = init_random(dim, k) oldmu = [] n = 0 startTime = time.time() while not has_converged(mu, oldmu, epsilon, n, maxIterations): oldmu = mu clusters = [cluster_points_partial(X[f], mu, f * size) for f in range(numFrag)] partialResult = [partial_sum(X[f], clusters[f], f * size) for f in range(numFrag)] mu = merge_reduce(reduceCentersTask, partialResult) mu = compss_wait_on(mu) mu = [mu[c][1] / mu[c][0] for c in mu] n += 1 return (n, priority=True) def reduceCentersTask(a, b): for key in b: if key not in a: a[key] = b[key] else: a[key] = (a[key][0] + b[key][0], a[key][1] + b[key][1]) return priority=True) def reduceCentersTask(a, b): for key in b: if key not in a: a[key] = b[key] else: a[key] = (a[key][0] + b[key][0], a[key][1] + b[key][1]) return def partial_sum(XP, clusters, ind): import numpy as np XP = np.array(XP) p = [(i, [(XP[j - ind]) for j in clusters[i]]) for i in clusters] dic = {} for i, l in p: dic[i] = (len(l), np.sum(l, axis=0)) return def partial_sum(XP, clusters, ind): import numpy as np XP = np.array(XP) p = [(i, [(XP[j - ind]) for j in clusters[i]]) for i in clusters] dic = {} for i, l in p: dic[i] = (len(l), np.sum(l, axis=0)) return def cluster_points_partial(XP, mu, ind): import numpy as np dic = {} XP = np.array(XP) for x in enumerate(XP): bestmukey = min([(i[0], np.linalg.norm(x[1] - mu[i[0]])) for i in enumerate(mu)], key=lambda t: t[1])[0] if bestmukey not in dic: dic[bestmukey] = [x[0] + ind] else: dic[bestmukey].append(x[0] + ind) return def cluster_points_partial(XP, mu, ind): import numpy as np dic = {} XP = np.array(XP) for x in enumerate(XP): bestmukey = min([(i[0], np.linalg.norm(x[1] - mu[i[0]])) for i in enumerate(mu)], key=lambda t: t[1])[0] if bestmukey not in dic: dic[bestmukey] = [x[0] + ind] else: dic[bestmukey].append(x[0] + ind) return dic Sample code: PyCOMPSs Task graph: 8 fragments 4 iterations 75 Neuroscience Data Processing example Computation of mutual cross- correlations between all pairs of a set of spike data Also computes the cross-correlations for surrogate data sets for each neuron pair 76 f = open(./spikes.dat, r) spikes = pickle.load(f) f.close() #preallocate result variables num_ccs = (num_neurons**2 - num_neurons)/2 cc_orig = zeros((num_ccs,2*maxlag+1)) cc_surrs = zeros((num_ccs,2*maxlag+1,num_surrs)) idxrange = range(num_bins-maxlag,num_bins+maxlag+1) row = 0 #for all pairs ni,nj such that nj > ni for ni in range(num_neurons-1): for nj in range(ni+1,num_neurons): cc_orig[row,:] = correlate(spikes[ni,:],spikes[nj,:], num_spikes_i = sum(spikes[ni,:]) num_spikes_j = sum(spikes[nj,:]) for surrogate in range(num_surrs): surr_i = zeros(num_bins) surr_i[random.random_integers(0,num_bins-1,num_spikes_i)] = 1 surr_j = zeros(num_bins) surr_j[random.random_integers(0,num_bins-1,num_spikes_j)] = 1 cc_surrs[row,:,surrogate] = correlate(surr_i,surr_j,"full")[idxrange] row = row +1 #save results f = open(./result_cc_originals.dat,w) pickle.dump(cc_orig,f) f.close() f = open(./result_cc_surrogates.dat,w) pickle.dump(cc_surrs,f) f.close() Sequential code # tuple of all parallel python servers to connect with ppservers = ('comp1.my-network', 'comp2.my-network if len(sys.argv) > 1: ncpus = int(sys.argv[1]) #creates jobserver with ncpus workers job_server = pp.Server(ncpus, else: #creates jobserver with workers automatically detected job_server = pp.Server(ppservers=ppservers, #wait for servers to come up time.sleep(5) #calculate number of nodes in total nlocalworkers = job_server.get_ncpus() activenodes = job_server.get_active_nodes() workerids = activenodes.keys() nworkers=sum( [activenodes[workerids[i]] for i in range(len(workerids))] ) + nlocalworkers num_ccs = (num_neurons**2 - num_neurons)/2 #calculate number of pairs each worker should process step = ceil(float(num_ccs)/nworkers) start_idx = 0 end_idx = 0 starts = zeros((nworkers+1,)) Neuroscience Data Parallel Python Main program for worker in range(nworkers): start_idx = end_idx end_idx = int(min((worker+1)*step,num_ccs)) depfuncs = () depmodules = "numpy","pickle", jobs.append(job_server.submit(cc_surrogate_range, cc_original = zeros((num_ccs,2*maxlag+1)) cc_surrs = zeros((num_ccs,2*maxlag+1,2)) for worker in arange(nworkers): start = starts[worker] end = starts[worker + 1] result = jobs[worker]() cc_original[start:end,:] = result[0] cc_surrs[start:end,:,:] = result[1] f = open('./result_cc_originals.dat','w') pickle.dump(cc_original,f) f.close() f = open('./result_cc_surrogates_conf.dat','w') pickle.dump(cc_surrs,f) f.close() Main program (cont) def cc_surrogate_range(start_idx, end_idx, seed, num_neurons, num_surrs, num_bins, maxlag): Function definition Explicit resources declaration Explicit fork join Data back 77 Import sys from pycompss.api.api import compss_wait_on num_frags = int(sys.argv[1]) #calculate number of pairs per fragment num_ccs = (num_neurons**2 - num_neurons)/2 step = ceil(float(num_ccs)/num_frags) start_idx = 0 end_idx = 0 seed = delta = Neuroscience Data PyCOMPSs Main = INOUT, cc_surrs = INOUT, priority = True) def gather(result, cc_original, cc_surrs, start, end): cc_original[start:end,:] = result[0] cc_surrs[start:end,:,:] = = list) def cc_surrogate_range(start_idx, end_idx, seed, num_neurons, num_surrs, num_bins, maxlag): Tasks definition cc_original = zeros((num_ccs,2*maxlag+1)) cc_surrs = zeros((num_ccs,2*maxlag+1,2)) for frag in range(num_frags): start_idx = end_idx end_idx = int(min((frag+1)*step,num_ccs)) result = cc_surrogate_range(start_idx, end_idx, seed, gather(result, cc_original, cc_surrs, start_idx, end_idx) seed = seed + delta f = open('./result_cc_originals.dat','w') cc_original = compss_wait_on(cc_original) pickle.dump(cc_original,f) f.close() f = open('./result_cc_surrogates_conf.dat','w') cc_surrs = compss_wait_on(cc_surrs) pickle.dump(cc_surrs,f) f.close() Main program (cont) 78 HANDS-ON 80 COMPSs Virtual Machine Set-up Java Hands-on Compilation & Execution Configuration Monitoring, debugging, graph generation Python Hands-on Annotate tasks Execution in MN Overview of tracing, trace analysis Code optimization Hands-On: Overview 81 COMPSs development VM Installation COMPSs Virtual Appliance Available from website:VirtualBox: Import Virtual Appliance Java Hands-on Word Count 83 Counting words of a document Parallelization Split documents in blocks Count words of Blocks Merge results Java Hands On: Exercise Complete the Word Count parallelization with COMPSs Level 0: No Java background Look the implementation (wordcount project) Level 1: Basic Java background Define methods in the interface (wordcount_sequential) Level 2: Java background Define methods in the interface and complete the part of the main code with helper methods (wordcount_blanks) 84 Java Hands On: Exercise Solution 85 public interface WordcountItf = "wordcount.uniqueFile.Wordcount") public HashMap HashMap HashMap m2 = "wordcount.uniqueFile.Wordcount") HashMap = Type.FILE, direction = Direction.IN) String int int bsize ); } private static void computeWordCount() { HashMap result = new HashMap (); int start = 0; for (int i = 0; i < NUM_BLOCKS; ++i) { HashMap partialResult = wordCountBlock(DATA_FILE, start, BLOCK_SIZE); start = start + BLOCK_SIZE; result = mergeResults(result, partialResult); } System.out.println("[LOG] Counted Words is : " + result.keySet().size()); } Main Code Interface Java Hands On: Alternative Solution 86 private static void computeWordCount() { // Passing a block HashMap result = new HashMap (); String block; while ((block = getNextBlock()) != null){ HashMap partialResult = wordCountBlock(block); result = mergeResults(result, partialResult); } System.out.println("[LOG] Counted Words is : " + result.keySet().size()); } Main Code Interface public interface WordcountItf = "wordcount.uniqueFile.Wordcount") public HashMap HashMap HashMap m2 = "wordcount.uniqueFile.Wordcount") HashMap String block ); } Java Hands-on: Compilation and Simple Execution 87 Compilation (Eclipse IDE) Package Explorer -> Project (wordcount) -> Export (Solution) Use runcompss command to run the application runcompss [options] Exercise: Simple wordcount execution Usage: wordcount.uniqueFile.Wordcount cd ~/workspace_java/wordcount/jar runcompss wordcount.uniqueFile.Wordcount /home/compss/workspace_java/wordcount/data/file_short.txt 650 Java Hands-on: Result 88 runcompss wordcount.uniqueFile.Wordcount /workspace_java/wordcount/data/file_short.txt 500 Using default location for project file: /opt/COMPSs/Runtime/scripts/user/../../configuration/xml/projects/project.xml Using default location for resources file: /opt/COMPSs/Runtime/scripts/user/../../configuration/xml/resources/resources.xml Executing wordcount.uniqueFile.Wordcount WARNING: IT Properties file is null. Setting default values [ API] - Deploying COMPSs Runtime v1.3 (build xxxx) [ API] - Starting COMPSs Runtime v1.3 (build xxxx) DATA_FILE parameter value = /home/compss/workspace_java/wordcount/data/file_short.txt BLOCK_SIZE parameter value = 650 [LOG] Computing word count result [LOG] Counted Words is : 250 [ API] - No more tasks for app 1 [ API] - Getting Result Files 1 [ API] - Execution Finished Application Logs Java Hands-on: Configuration 89 /opt/COMPSs/Runtime/scripts/system /tmp/localhost /opt/COMPSs/Runtime/scripts/system /tmp/localhost Project.xml: /opt/COMPSs/Runtime/configuration/xml/projects/project.xml Other optional parameters User, AppDir, LibraryPath Java Hands-On: Configuration 90 Resources.xml: /opt/COMPSs/Runtime/configuration/xml/resources/resources.xml 0 short IA Linux 32 8 0 short IA Linux 32 8 2 8 Java 2 8 Java Affects to application parallelism Java Hands-On: Monitoring 91 The runtime of COMPSs provides real-time monitoring http://localhost:8080/compss-monitor/ If not started run as root: /etc/init.d/compss-monitor start The user can log-in and follow the progress of the executions Running tasks, resources usage, execution time per task, real-time execution graph, etc. Activate monitoring with a runcompss flag Setting a monitoring interval runcompss --monitoring= With a default monitoring interval runcompss m (or) runcompss --monitoring Exercise: run wordcount enabling monitoring cd ~/workspace_java/wordcount/jar runcompss m wordcount.uniqueFile.Wordcount /home/compss/workspace_java/wordcount/data/file_long.txt Java Hands-on: Debugging 92 Different log levels activated as runcompss options runcompss --log_level= (off: for performance | info: basic logging | debug: detect errors) runcompss debug or runcompss -d The output/errors of the main code of the application are shown in the console Other logging files are stored in: $HOME/.COMPSs/ _XX Inside this folder, the user can check the following: The output/error of a task # N : /jobs/jobN.[out|err] Messages from the COMPSs : runtime.log Task to resources allocation: resources.log Exercise: run wordcount with debugging cd ~/workspace_java/wordcount/jar runcompss d wordcount.uniqueFile.Wordcount /home/compss/workspace_java/wordcount/data/file_short.txt 650 Java Hands-on: Graph generation 93 To generate the graph of an application, it must be run with the monitor or graph flags activated runcompss -m (or) runcompss graph (or) runcompss -g The graph will be stored in: $HOME/.COMPSs/ _XX/monitor/complete_graph.dot To convert the graph to a PDF format use gengraph command Usage: gengraph Exercise: generate the graph for the wordcount application cd ~/workspace_java/wordcount/jar runcompss g wordcount.uniqueFile.Wordcount /home/compss/workspace_java/wordcount/data/file_short.txt 650 application execution cd ~/.COMPSs/wordcount.uniqueFile.Wordcount_04/monitor $~/.COMPSs/wordcount.uniqueFile.Wordcount_04/monitor> gengraph complete_graph.dot Output file: complete_graph.pdf $~/.COMPSs/wordcount.uniqueFile.Wordcount_04/monitor> evince complete_graph.pdf Python Hands-on PyCOMPSs Hands On: Exercise Complete the WordCount parallelization with PyCOMPSs Task definition with python decorators. Execution in MareNostrum III. Optimize the reduce method. 95 Word Count Counting words of a set of documents Parallelization Phase 1: Count words of a set of document Phase 2 : Merge results 96 First step - Connecting to MareNostrum III How to connect to MN3? > ssh -X Password: COMPSS.nct.2016.XXX Update.bashrc Edit:.bashrc Add: module load COMPSs/1.3 at the end Execute: source.bashrc Where is the source code? cd cp /gpfs/projects/nct01/nct01001/source/*. Where is the dataset? cp -r /gpfs/projects/nct01/nct01001/dataset/. Available editors vi emacs 97 launch_pycompss.sh wordcount.py launch_pycompss.sh wordcount.py Sequential (wordcount.py) def read_file(file_path): data = [] with open(file_path, 'r') as f: for line in f: data += line.split() return data def wordCount(data): partialResult = {} for entry in data: if entry in partialResult: partialResult[entry] += 1 else: partialResult[entry] = 1 return partialResult def merge_two_dicts(dic1, dic2): for k in dic2: if k in dic1: dic1[k] += dic2[k] else: dic1[k] = dic2[k] return dic1 if __name__ == "__main__": pathDataset = sys.argv[1] # Construct a list with the file's paths from the dataset paths = [] for fileName in os.listdir(pathDataset): paths.append(os.path.join(pathDataset, fileName)) # Read file's content data = map(read_file, paths) # From all file's data execute a wordcount on it partialResult = map(wordCount, data) # Accumulate the partial results to get the final result. result = reduce(merge_two_dicts, partialResult) Time Sequential Remember the dataset path How to launch with python sequentialy? > python wordcount.py /home/nct01/dataset/4files/ Results: Submit jobs to MareNostrum III All jobs should be submited to the queuing system (LSF) We will use a launcher script: launch_pycompss.sh Useful commands: bjobs This command shows the status of the job. bkill jobId This command kills a job with id jobId. 99 python wordcount.py /path/to/dataset/ Ellapsed Time (s) Words: PyCOMPSs Hands On: Exercise Complete the WordCount parallelization with PyCOMPSs Task definition with python decorators. Modify the wordcount.py Execution in MareNostrum III. Use launch_pycompss.sh Optimize the reduce method. 100 Execution in MareNostrum III - HandsOn launch_pycompss.sh Parameters: num_nodes: amount of nodes where to execute (1 master + 1 worker). tasks_per_node: amount of tasks that can be processed in parallel (1-16). Dataset path: /gpfs/home/nct01/nct01XXX/dataset/64files How to execute with PyCOMPSs? ./launch_pycompss.sh 101 #/bin/bash enqueue_compss \ --exec_time=10 \ --num_nodes=2 \ --tasks_per_node=8 \ --lang=python \ --classpath=/gpfs/home/nct01/nct01XXX/ \ --tracing=true \ --graph=true \ /home/nct01/nct01XXX/wordcount.py /gpfs/home/nct01/nct01XXX/dataset/64files PyCOMPSs (option1) def read_file(file_path): data = [] with open(file_path, 'r') as f: for line in f: data += line.split() return def wordCount(data): partialResult = {} for entry in data: if entry in partialResult: partialResult[entry] += 1 else: partialResult[entry] = 1 return priority=True) def merge_two_dicts(dic1, dic2): for k in dic2: if k in dic1: dic1[k] += dic2[k] else: dic1[k] = dic2[k] return dic1 if __name__ == "__main__": from pycompss.api.api import compss_wait_on pathDataset = sys.argv[1] # Construct a list with the file's paths from the dataset paths = [] for fileName in os.listdir(pathDataset): paths.append(os.path.join(pathDataset, fileName)) # Read file's content data = map(read_file, paths) # From all file's data execute a wordcount on it partialResult = map(wordCount, data) # Accumulate the partial results to get the final result. result = reduce(merge_two_dicts, partialResult) # Wait for result result = compss_wait_on(result) Tim e PyCOMPSs (option file_path=FILE_in) def read_file(file_path): data = [] with open(file_path, 'r') as f: for line in f: data += line.split() return def wordCount(data): partialResult = {} for entry in data: if entry in partialResult: partialResult[entry] += 1 else: partialResult[entry] = 1 return priority=True) def merge_two_dicts(dic1, dic2): for k in dic2: if k in dic1: dic1[k] += dic2[k] else: dic1[k] = dic2[k] return dic1 if __name__ == "__main__": from pycompss.api.api import compss_wait_on pathDataset = sys.argv[1] # Construct a list with the file's paths from the dataset paths = [] for fileName in os.listdir(pathDataset): paths.append(os.path.join(pathDataset, fileName)) # Read file's content data = map(read_file, paths) # From all file's data execute a wordcount on it partialResult = map(wordCount, data) # Accumulate the partial results to get the final result. result = reduce(merge_two_dicts, partialResult) # Wait for result result = compss_wait_on(result) Tim e Performance Analysis 104 Paraver is the BSC tool for trace visualization Trace events are encoding in Paraver (.prv) format by Extrae Paraver is a powerful tool for trace visualization. An experimented user could obtain many different views of the trace events. For more information about Paraver visit: http://www.bsc.es/computer-sciences/performance- tools/paraver Performance Analysis 105 COMPSs can generate post-execution traces of the distributed execution of the application Useful for performance analysis and diagnosis How it works? Task execution and file transfers are application events An XML file is created at workers to keep track of these events At the end of the execution all the XML files are merged to get the final trace file COMPSs uses Extrae tool to dynamically instrument the application In a worker: Extrae keeps track of the events in an intermediate file In the master: Extrae merges the intermediate files to get the final trace file Performance Analysis Executing wc_reduce.py Welcome to Extrae 3.1.1rc (revision 3360 based on extrae/trunk) Extrae: Generating intermediate files for Paraver traces. Extrae: Intermediate files will be stored in /.statelite/tmpfs/gpfs/home/bsc19/bsc19000/Apps/WC/src/tutorial Extrae: Tracing buffer can hold events Extrae: Tracing mode is set to: Detail. Extrae: Successfully initiated with 1 tasks [ API] - Deploying COMPSs Runtime v1.3 (build ) [ API] - Starting COMPSs Runtime v1.3 (build )... [ API] - No more tasks for app 0 [ API] - Getting Result Files 0 [ API] - Execution Finished... Extrae: Application has ended. Tracing has been terminated. merger: Output trace format is: Paraver merger: Extrae 3.1.1rc (revision 3360 based on extrae/trunk) mpi2prv: Selected output trace format is Paraver mpi2prv: Parsing intermediate files mpi2prv: Generating tracefile (intermediate buffers of events) mpi2prv: Congratulations!./trace/wc_reduce.py_compss_trace_ prv has been generated COMPSs runtime starts Extrae starts before the user application execution The merge process starts COMPSs runtime ends The final trace file is generated Intermediate trace files are processed The application finishes and the tracing process ends Performance Analysis Open Paraver > module load paraver > cd $HOME/.COMPSs/wordcount.py_01 > wxparaver trace/*.prv COMPSs provides some configuration files to automatically obtain the view of the trace File/Load Configuration... (/gpfs/apps/MN3/COMPSs/1.3/Dependencies/paraver/cfgs/compss_tasks.cfg) Performance Analysis Fit window Right click on the trace window Fit Semantic Scale/ Fit Both View Event flags Right click on the trace window View / Event Flags Performance Analysis Show info Panel Right click on the trace window Check info panel option Select Colors tab of the panel Legend with task names Threads Tasks Execution time Performance Analysis Zoom to see details Select a region in the trace window to see in detail And repeat the process until the needed zoom level The undo zoom option is in the right click panel Previous task ends Processor is idle New task starts Performance Analysis Summarizing: Lines in the trace: One line for the master N lines for the workers Meaning of the colours: Black: idle Other colors: task running see the color legend Flags (events): Start / end of task PyCOMPSs Hands On: Exercise Complete the WordCount parallelization with PyCOMPSs Task definition with python decorators. Execution in MareNostrum III. Optimize the reduce method. 112 PyCOMPSs (option file_path=FILE_in) def read_file(file_path): data = [] with open(file_path, 'r') as f: for line in f: data += line.split() return def wordCount(data): partialResult = {} for entry in data: if entry in partialResult: partialResult[entry] += 1 else: partialResult[entry] = 1 return priority=True) def merge_two_dicts(dic1, dic2): for k in dic2: if k in dic1: dic1[k] += dic2[k] else: dic1[k] = dic2[k] return dic1 if __name__ == "__main__": from pycompss.api.api import compss_wait_on pathDataset = sys.argv[1] # Construct a list with the file's paths from the dataset paths = [] for fileName in os.listdir(pathDataset): paths.append(os.path.join(pathDataset, fileName)) # Read file's content data = map(read_file, paths) # From all file's data execute a wordcount on it partialResult = map(wordCount, data) # Accumulate the partial results to get the final result. result = reduce(merge_two_dicts, partialResult) # Wait for result result = compss_wait_on(result) Tim e PyCOMPSs file_path=FILE_in) def read_file(file_path): data = [] with open(file_path, 'r') as f: for line in f: data += line.split() return def wordCount(data): partialResult = {} for entry in data: if entry in partialResult: partialResult[entry] += 1 else: partialResult[entry] = 1 return priority=True) def merge_two_dicts(dic1, dic2): for k in dic2: if k in dic1: dic1[k] += dic2[k] else: dic1[k] = dic2[k] return dic1 if __name__ == "__main__": from pycompss.api.api import compss_wait_on pathDataset = sys.argv[1] # Construct a list with the file's paths from the dataset paths = [] for fileName in os.listdir(pathDataset): paths.append(os.path.join(pathDataset, fileName)) # Read file's content data = map(read_file, paths) # From all file's data execute a wordcount on it partialResult = map(wordCount, data) # Accumulate the partial results to get the final result. result = mergeReduce(merge_two_dicts, partialResult) # Wait for result result = compss_wait_on(result) Tim e (, ) 115 mergeReduce function Solution d1d1 def mergeReduce(function, data): """ Apply function cumulatively to the items of data, from left to right in binary tree structure, so as to reduce the data to a single value. :param function: function to apply to reduce data :param data: List of items to be reduced :return: result of reduce the data to a single value """ from collections import deque q = deque(xrange(len(data))) while len(q): x = q.popleft() if len(q): y = q.popleft() data[x] = function(data[x], data[y]) q.append(x) else: return data[x] d2d2 d3d3 d4d4 f q= 134 d1d1 = (, ) d3d3 f = 2 d1d1 d3d3 d1d1 f = d1d Execution in MareNostrum III - HandsOn launch_pycompss.sh Parameters: num_nodes: amount of nodes where to execute (1 master + 1 worker). tasks_per_node: amount of tasks that can be processed in parallel (1-16). Dataset path: /gpfs/home/nct01/nct01XXX/dataset/64files How to execute with PyCOMPSs? ./launch_pycompss.sh 116 #/bin/bash enqueue_compss \ --exec_time=10 \ --num_nodes=2 \ --tasks_per_node=8 \ --lang=python \ --classpath=/gpfs/home/nct01/nct01XXX/ \ --tracing=true \ --graph=true \ /home/nct01/nct01XXX/wordcount_optimized.py /gpfs/home/nct01/nct01XXX/dataset/64files FINAL NOTES Final Notes Sequential programming approach Parallelization at task level Transparent data management and remote execution Can operate on different infrastructures: Cluster Grid Cloud (Public/Private) PaaS IaaS Web services 118 Final Notes Project page: http://www.bsc.es/compsshttp://www.bsc.es/compss Direct downloads page: http://www.bsc.es/computer-sciences/grid-computing/comp- superscalar/downloadhttp://www.bsc.es/computer-sciences/grid-computing/comp- superscalar/download Virtual Appliance for testing & sample applications Tutorials Red-Hat & Debian based installation packages Source Code Application Repository http://compss.bsc.es/projects/bar/wiki/Applicationshttp://compss.bsc.es/projects/bar/wiki/Applications Several examples of applications developed with COMPSs 119 Thank you! For further information please contact