ujian kompetensi dasar pengantar organisasi komputer

of 50 /50
©2010 International Journal of Computer Applications (0975 8887) Volume 1 No. 12 1 Performance Assessment using Text Mining Radha Shakarmani Asst. Prof, SPIT Sardar Patel Institute of Technology Munshi Nagar, Andheri (W) Mumbai - 400 058 Nikhil Kedar Student, SPIT 903, Sai Darshan Versova, Andheri (W) Mumbai - 400 058 ABSTRACT Here in this paper we use Text Mining-a feature of Web Intelligence to derive information from the unstructured textual data on the web and device the consensus based strategy to business decisions. This will have two fold advantages, one mitigate the risk early and second would provide a support for our understanding and decision making. This concept is explained with an example of evaluating a player‘s performance based on minute to minute commentary of the match. Parameters such as his position on field (for example in football defenders, midfielders, forwards and goal keeper), his past performance, his present fitness and form, and such other parameters are considered. Weightage / value for each parameter is decided and information can be derived for analyzing a player‘s performance. During analysis we view the comments, we read through fan forums, blogs, newspaper reviews on the play, expert commentator views, etc. This is either used as a correction factor to enhance the credibility of the model. 1. IDENTIFYING INFORMATION RESOURCES The first phase is to gather information. But information cannot be gathered from the entire web due to its vastness. To reduce the search time static sources (web site) are chosen to extract information. Authenticity, correctness and up-to-date nature of such sources are very important for information retrieval. Apart from these static sources there might be sources which may provide additional information but are unknown to the user. A web crawler (also known as a web spider) is a program or automated script which browses the World Wide Web to find such sources. The static site which gives the performance of all the players and their ranking / values are identified and relevant information is retrieved. In addition to the performance, information regarding popularity or fan following of a player can be judged from blog sites. 2. INFORMATION RETRIVAL Information retrieval system identifies the document, pages in the collection which matches the user query. Information retrieval system allows us to narrow down the set of documents that are relevant to the particular problem. In our example analysis of minute to minute commentary, news flows and blogs will be done using text mining. As text mining involves applying very computationally intensive algorithms to large documents collections, it can be limited to support information retrieved. Information retrieval can speed up analysis considerably reducing the number of documents for analysis. As most of the resources used to gather information are static, it is not necessary that all the information in the form of web pages is required for further analysis. In the case study considered, information is derived from the sites where minute by minute description about Football matches is available. The project intends to consider evaluation of players based on expert opinions combined with reviews from common people obtained from blogs. IR systems will allow us to retrieve such documents which will ease the process of text mining. These documents will then be applied to the information extraction systems which are systems used for text mining. 3. INFORMATION EXTRACTION After short listing the required web pages text mining is applied. The following three phases - information extraction, text mining, converting unstructured data to structured data are closely related to each other. Information extraction is the process of automatically obtaining structured data from unstructured natural language document. Often this involves defining the general form of information that we are interested in as one or templates which are then used to Khandelwal Student, SPIT B-401, Mahesh Tower, Sector-2 Charkop, Kavdivli (W) Mumbai - 400 067 Existing search engines have many remarkable capabilities; but what is not among them is deduction capabilitythe capability to synthesize an answer to a query from bodies of information which reside in various parts of the World Wide Web. Web Intelligence is an area of research which attempts to provide this capability. The whole procedure involves four main stages: Web crawling i.e identifying information resources, information retrieval and extraction, text mining and finally converting unstructured data to structured data. Keywords Web Intelligence, NLP, Text Mining, Information Extraction, Information Retrieval, GATE.

Author: fembi-rekrisna-grandea-putra

Post on 09-Jan-2017

1.258 views

Category:

Education


3 download

Embed Size (px)

TRANSCRIPT

  • 2010 International Journal of Computer Applications (0975 8887)

    Volume 1 No. 12

    1

    Performance Assessment using Text Mining

    Radha Shakarmani

    Asst. Prof, SPIT Sardar Patel Institute of Technology

    Munshi Nagar, Andheri (W) Mumbai - 400 058

    Nikhil Kedar Student, SPIT

    903, Sai Darshan Versova, Andheri (W)

    Mumbai - 400 058

    ABSTRACT

    Here in this paper we use Text Mining-a feature of Web

    Intelligence to derive information from the unstructured textual

    data on the web and device the consensus based strategy to

    business decisions. This will have two fold advantages, one

    mitigate the risk early and second would provide a support for our

    understanding and decision making. This concept is explained

    with an example of evaluating a players performance based on

    minute to minute commentary of the match. Parameters such as

    his position on field (for example in football defenders,

    midfielders, forwards and goal keeper), his past performance, his

    present fitness and form, and such other parameters are

    considered. Weightage / value for each parameter is decided and

    information can be derived for analyzing a players performance.

    During analysis we view the comments, we read through fan

    forums, blogs, newspaper reviews on the play, expert

    commentator views, etc. This is either used as a correction factor

    to enhance the credibility of the model.

    1. IDENTIFYING INFORMATION

    RESOURCES The first phase is to gather information. But information cannot be

    gathered from the entire web due to its vastness. To reduce the

    search time static sources (web site) are chosen to extract

    information. Authenticity, correctness and up-to-date nature of

    such sources are very important for information retrieval. Apart

    from these static sources there might be sources which may provide additional information but are unknown to the user. A

    web crawler (also known as a web spider) is a program or

    automated script which browses the World Wide Web to find

    such sources.

    The static site which gives the performance of all the players and

    their ranking / values are identified and relevant information is

    retrieved. In addition to the performance, information regarding

    popularity or fan following of a player can be judged from blog

    sites.

    2. INFORMATION RETRIVAL

    Information retrieval system identifies the document, pages in the

    collection which matches the user query. Information retrieval

    system allows us to narrow down the set of documents that are

    relevant to the particular problem.

    In our example analysis of minute to minute commentary, news

    flows and blogs will be done using text mining.

    As text mining involves applying very computationally intensive

    algorithms to large documents collections, it can be limited to

    support information retrieved. Information retrieval can speed up

    analysis considerably reducing the number of documents for

    analysis. As most of the resources used to gather information are

    static, it is not necessary that all the information in the form of

    web pages is required for further analysis.

    In the case study considered, information is derived from the sites

    where minute by minute description about Football matches is

    available. The project intends to consider evaluation of players

    based on expert opinions combined with reviews from common

    people obtained from blogs. IR systems will allow us to retrieve

    such documents which will ease the process of text mining. These

    documents will then be applied to the information extraction

    systems which are systems used for text mining.

    3. INFORMATION EXTRACTION

    After short listing the required web pages text mining is applied.

    The following three phases - information extraction, text mining,

    converting unstructured data to structured data are closely related

    to each other.

    Information extraction is the process of automatically obtaining

    structured data from unstructured natural language document.

    Often this involves defining the general form of information that

    we are interested in as one or templates which are then used to

    Khandelwal Student, SPIT

    B-401, Mahesh Tower, Sector-2 Charkop, Kavdivli (W)

    Mumbai - 400 067

    Existing search engines have many remarkable capabilities; but

    what is not among them is deduction capabilitythe capability to

    synthesize an answer to a query from bodies of information which

    reside in various parts of the World Wide Web. Web Intelligence

    is an area of research which attempts to provide this capability.

    The whole procedure involves four main stages: Web crawling i.e

    identifying information resources, information retrieval and

    extraction, text mining and finally converting unstructured data to

    structured data.

    Keywords Web Intelligence, NLP, Text Mining, Information Extraction,

    Information Retrieval, GATE.

  • 2010 International Journal of Computer Applications (0975 8887)

    Volume 1 No. 12

    2

    guide the extraction process. Information extractions rely heavily

    on the data generated by NLP systems.

    The role of Natural Language processing in text mining is to

    provide linguistic data to next phase. This is done by annotating

    documents with information like sentences boundaries, parts of

    speech parsing results. NLP may be deep (parsing every part of

    every sentence and attempting to account semantically for every

    part) or shallow (parsing only certain passages or phrases within

    sentences or producing only limited semantic analysis), and may

    even use statistical means to disambiguate word senses or multiple

    parses of the same sentence.

    Tasks that Information extraction systems can perform includes:

    A) Term analysis: It involves analysis of one or more

    words, multi word, terms like papers, PDF etc.

    B) Named entity recognition: It involves identification of

    names, dates, expressions of time, quantities, associated

    percentage, units.

    C) Fact extraction: Relationships between entities or

    events.

    The data generated during Information extraction phase is

    structured information derived from unstructured textual data.

    Information extraction will give information in the form of

    relationships. Analysis of words, phrases will give us information

    about an entity and relationship of that entity with other entities.

    In our case information is used to design a database which can

    serve as a useful tool to evaluate a players performance. The

    database contains attributes such as goals scored, red cards, assists

    etc for allocating points.

    Text mining using NLP is one of the approaches which can be

    used. Other approaches like Semantic Web and OWL can also be

    used to provide a solution.

    In semantic Web we search for keywords and each keyword is

    considered a class, if it can be further described (i.e it has

    attributes) else it is considered an attribute if it has no further

    description. For example: The classes can be a Player Name and

    number of goals scored can form the attribute.

    In OWL, we create a data dictionary which has all the possible

    replacements that can be used for a particular attribute .e.g. a

    player could be referred by his jersey number or by his nick name.

    The information that is extracted using NLP also contains similar

    replacements for the attribute. In this case we can create our own

    Data Dictionary which can be used as a reference.

    The contents of tables are updated at the end of every match. This

    structured data (e.g.: The number of goal scored by a particular

    player) is used to determine the performance.

    To get information about a players performance in a particular

    match we make use of the textual data available as minute to

    minute commentary on the web. To get information about a

    players popularity we can make use of Fan Forum and blogs.

    Considering performance of a player in a recent match the blogs

    can give the peoples verdict.

    Now Natural Language Processing can be used to process this

    information. Analysis of this information can be used in

    performance evaluation.

    4. IMPLEMENTATION In our Project the process of Text Mining is implemented using an

    open source tool kit GATE (General architecture for Text

    Engineering).The IE system in Gate is called ANNIE i.e A Nearly

    New Information Extraction System. (developed by Hamish

    Cunningham, Valentin Tablan, Diana Maynard, Kalina

    Bontcheva, Marin Dimitrov and others).

    The functioning of ANNIE relies on finite state algorithms and the

    JAPE (JAVA Annotation Pattern Engineering) language. ANNIE

    has two parts: Language resources and Processing resources.

    Language resources consist of GATE Documents and Corpus. The

    source of GATE Documents can be specified either by giving the

    path of a locally stored file on the hard disk or by specifying the

    URL of a web page. Document formats supported by GATE are

    XML, HTML, SGML, Plain Text, RTF, Email, PDF and

    Microsoft Word. A Corpus in GATE is a Java Set whose members

    are Documents. ANNIE can run only on a corpus.

    Processing Resources are responsible for performing text mining

    on the corpus. There are many plugins available that provide

    various functionalities. Some of the important ones applicable in

    our project are Tokeniser, Gazetteer and JAPE Transducer.

    The Tokeniser splits the text into simple tokens. There are five

    types of token word, number, symbol, punctuation and

    spacetoken. The aim is to limit the work of the Tokeniser to

    maximize efficiency, and enable greater flexibility by placing the

    burden on the grammar rules (JAPE), which are more adaptable.

    The Gazetteer Lists used are plain text files, with one entry per

    line. An index file (lists.def) is used to access these lists; for each

    list, a major type is specified and, optionally, a minortype. These

    lists are compiled into finite state machines. Any text tokens that

    are matched by these machines will be annotated with features

    specifying the major and minor types. Grammar rules (JAPE) then

    specify the types to be identified in particular circumstances. Each

    gazetteer list should reside in the same directory as the index file.

    JAPE provides finite state transduction over annotations based on

    regular expressions. A JAPE grammar consists of a set of phases,

    each of which consists of a set of pattern/action rules. The phases

    run sequentially and constitute a cascade of finite state transducers

    over annotations. The left-hand-side (LHS) of the rules consist of

  • 2010 International Journal of Computer Applications (0975 8887)

    Volume 1 No. 12

    3

    an annotation pattern that may contain regular expression

    operators (e.g. *, ?, +). The right-hand-side (RHS) consists of

    annotation manipulation statements. Annotations matched on the

    LHS of a rule may be referred to on the RHS by means of labels

    that are attached to pattern elements. The RHS of the rule contains

    information about the annotation. Information about the

    annotation is transferred from the LHS of the rule using the label

    just described, and annotated with the entity type (which follows

    it). Finally, attributes and their corresponding values are added to

    the annotation. Alternatively, the RHS of the rule can contain Java

    code to create or manipulate annotations. JAPE grammars are

    written as files with the extension .jape, which are parsed and

    compiled at run-time to execute them over the GATE

    document(s).

    In our project we have created Gazetteer Lists consisting of

    names of players based on the club to which they belong and

    their position. The Tokeniser is used to annotate the entries in the

    Gazetteer List. The JAPE rules written use these annotations on

    the LHS to indentify the player and the JAVA code in the RHS

    part is to assign points to that player. Different rules are written to

    indentify various actions like a goal scored, a red/yellow card

    shown, a save made, a penalty missed, a foul made, a clearance

    made and so on.

    The following two figures describe the implementation

    Figure3. A Screen Shot of GATE

    Here is a screen shot of GATE which shows the various Language Resources and Processing Resources in GATE and ANNIE being loaded

    to run on the Corpus

  • 2010 International Journal of Computer Applications (0975 8887)

    Volume 1 No. 12

    4

    Figure4.A Screen Shot of the Data Base

    Here is a screen shot of the Data Base which shows the allocation of points to a particular based on his performance in the match.

    5. OTHER APPLICATIONS

    In the above example, information derived was used to provide

    performance analysis of players in the form of points i.e.

    structured data. But crawlers along with text mining can support a

    number of applications. The only things that will change are the

    information resources and analyzing of information.

    To track vote count during presidential elections or to track news

    about stock market and performance of firms sites for crawling

    will be different (e.g. news channels websites ,stock exchange

    web sites) and the analysis instead of database can be dynamic

    application showing textual updates. One can get information

    regarding the quarterly reports of a company or about the status of

    a particular stock and expert opinions on that stock from various

    websites/news channels all on a single screen. Such an analysis

    can help a prospective trader to make decision regarding which

    shares to buy, when to buy and when to sell.

    Other applications include applications for analyzing reviews and

    performance of different products by performing a comparative

    study, to identify the best product based on the individual needs

    of a customer. For example if a customer intends to buy a Digital

    Camera, he would want a comparative report of the various

    companies to decide the best product for himself. For this

    websites of companies like SONY, NIKON, etc as well as those

    of retail outlets need to be crawled.

  • 2010 International Journal of Computer Applications (0975 8887)

    Volume 1 No. 12

    5

    6. CONCLUSION

    The goal of web intelligence is to retrieve information about the

    customer decision process, customer needs and customer

    behavior. Retrieving this information gives marketing intelligence

    the opportunity to improve their predictive models and to create a

    serious customer view. In this article, we used the example of

    football league to explain web intelligence using text mining

    techniques for player assessment.

    7. ACKNOWLEDGEMENT

    We are very grateful & indebted to our project guide Prof. Radha

    Shankarmani for providing her enduring patience, guidance and

    invaluable suggestions. She was a constant source of inspiration

    for us and took utmost interest in our project. We would also like

    to thank all the IT Staff members for their invaluable co-

    operation. We are also thankful to all the students for giving us

    their useful advice and immense co-operation. Lastly we would

    like to convey our regards to the developers of GATE as well as

    its other world-wide users for their constant support and

    guidance through the online mailing lists.

    8. REFERENCES

    [1] GATE . www.gate.ac.uk. General Architecture for Text

    Engineering or GATE is a Java software toolkit originally

    developed at the University of Sheffield since 1995.

    [2] Web Intelligence: kis.maebashi-it.ac.jp/wi01/ www.web-

    intelligence.com/

    [3] Muslea, I. (Ed.). (2004). Papers from the AAAI-2004

    Workshop on Adaptive Text Extraction and Mining (ATEM-

    2004) Workshop, San Jose, CA. AAAI Press.

    [4] Weiguo Fan, et. al., Tapping the Power of Text Mining,

    Communications of the ACM, 49(9), 2006.

    [5] Tan, A.-H. (1999), Text Mining: The state of the art

    and the challenges, in Proceedings, PAKDD99 workshop

    on Knowledge Discovery from Advanced Databases, Beijing,

    April, 1999.

    [6] Intelligence on the Web: www.fas.org/irp/intelwww.html

    WIN: home WEB INTELLIGENCE

    NETWORK,smarter.net/

    [7] J. Srivastava et al., Web Usage Mining: Discovery

    andApplications of Usage Patterns from Web Data,

    SIGKDD Explorations, vol. 1, no. 2, 2000, pp. 12

    [8] Fayyad, U., Piatetsky-Shapiro, G. & Smyth, P. (1996),

    From data mining to knowledge discovery: An overview,

    in U. Fayyad et al. (eds.) Advances in Knowledge

    Discovery and Data Mining, MIT Press, Cambridge, Mass.

    [9] Fayyad, U., Piatetsky-Shapiro, G. & Smyth, P. (1996),

    From data mining to knowledge discovery: An overview,

    in U. Fayyad et al. (eds.) Advances in Knowledge

    Discovery and Data Mining, MIT Press, Cambridge, Mass., 1

    [10] IEEE 2000b. IEEE Standard for Modelling and Simulation

    (M&S) High Level Architecture (HLA) Federate Interface

    Specification. IEEE Std 1516.1-2000. IEEE Computer

    Society, New York, NY.

  • Mesin pencari yang ada memiliki kemampuan yang luar biasa, tetapi apa yang tidak terdapat didalamnya adalah pengurangan kemampuan-kemampuan untuk mensintesis jawaban atas permintaandari isi informasi yang ada di berbagai bagian di World Wide Web. Intelijensi web adalah daerahpenelitian yang mencoba untuk memberikan kemampuan ini.

    Dalam tulisan ini digunakan Text Mining-fitur intelijensi web untuk memperoleh informasi dari datatekstual tidak terstruktur di web dan perangkat strategi berbasis konsensus untuk keputusan bisnis.Hal ini memberikan dua keuntungan kali lipat, pertama mengurangi risiko awal dan kedua akanmemberikan dukungan untuk pemahaman dan pengambilan keputusan. Konsep ini dijelaskan dengancontoh mengevaluasi kinerja pemain berdasarkan komentar menit ke menit pertandingan. Parameterseperti posisi pemain di lapangan (misalnya dalam sepak bola - pemain bertahan, gelandang, dankiper), kinerja sebelumnya, kebugarannya sekarang, dan parameter lainnya akan dipertimbangkan.Nilai untuk masing-masing parameter ditentukan dan informasi dapat ditarik untuk menganalisiskinerja pemain. Selama analisis, kita melihat komentar, kita membaca forum penggemar, blog, ulasankoran terhadap permainan, pandangan ahli komentator, dll. Hal ini baik digunakan sebagai faktorkoreksi untuk meningkatkan kredibilitas model.

    Seluruh prosedur melibatkan empat tahap utama: Web crawling, yaitu mengidentifikasi sumberinformasi, pencarian informasi dan ekstraksi, penambangan teks, dan akhirnya mengkonversi datatidak terstruktur menjadi data terstruktur.

    Tahap pertama adalah untuk mengumpulkan informasi. Tapi informasi tidak dapat dikumpulkan dariseluruh web karena kepadatannya. Untuk mengurangi waktu pencarian sumber statis (situs web) yangdipilih untuk mengekstrak informasi. Keaslian, kebenaran, dan kemutakhiran dari sumber tersebutsangat penting untuk pencarian informasi. Selain dari sumber-sumber statis mungkin ada sumber yangmungkin memberikan informasi tambahan tetapi tidak diketahui oleh pengguna. Sebuah web crawler(juga dikenal sebagai spider web) adalah sebuah program atau skrip otomatis yang menelusuri WorldWide Web untuk menemukan sumber tersebut.

    Sistem pencarian informasi mengidentifikasi dokumen dan halaman dalam koleksi yang sesuai denganpermintaan pengguna. Pencarian informasi sistem memungkinkan kita untuk mempersempit setdokumen yang relevan dengan masalah tertentu.

    Setelah pengumpulan daftar singkat, halaman web penambangan teks akan diterapkan. Tiga faseberikut - ekstraksi informasi, penambangan teks, mengkonversi data tidak terstruktur menjadi dataterstruktur yang terkait erat satu sama lain.

    Dalam contoh di atas, informasi yang diperoleh digunakan untuk menyediakan analisis kinerja pemaindalam bentuk poin yaitu data terstruktur. Tapi crawler bersama dengan text mining dapat mendukungsejumlah aplikasi. Satu-satunya hal yang akan berubah adalah sumber informasi dan proses analisisinformasi.

  • Tujuan dari inteligensi web adalah untuk mengambil informasi tentang proses pengambilan keputusanpelanggan, kebutuhan pelanggan, dan perilaku pelanggan. Mengambil informasi ini memberikaninteligensi pemasaran kesempatan untuk memperbaiki model prediksi mereka dan untukmenciptakan pandangan serius pelanggan. Pada artikel ini, digunakan contoh liga sepak bola untukmenjelaskan web intelijen menggunakan text mining teknik untuk penilaian pemain.

  • National Conference on Innovative Paradigms in Engineering & Technology (NCIPET-2012)

    Proceedings published by International Journal of Computer Applications (IJCA)

    20

    Multicore Processing for Classification and Clustering

    Algorithms

    V. Vaitheeshwaran School of Computing,

    KL University Vijayawada, India

    Kapil Kumar Nagwanshi School of Computing,

    KL University Vijayawada, India

    T. V. Rao School of Computing,

    KL University Vijayawada, India

    ABSTRACT Data Mining algorithms such as classification and clustering are

    the future of computation, though multidimensional data-

    processing is required. People are using multicore processors

    with GPUs. Most of the programming languages doesnt

    provide multiprocessing facilities and hence wastage of

    processing resources. Clustering and classification algorithms

    are more resource consuming. In this paper we have shown

    strategies to overcome such deficiencies using multicore

    processing platform OpenCL.

    Keywords

    Parallel Processing, Clustering, Classification, OpenCL, CUDA,

    NVIDIA, AMD, GPU.

    1. INTRODUCTION CLUSTERING is an unsupervised learning technique that

    separates data items into a number of groups, such that items in

    the same cluster are more similar to each other and items in

    different clusters tend to be dissimilar, according to some

    measure of similarity or proximity. Pizzuti and Talia[1]presents

    a P-AutoClass technique for Scalable Parallel Clustering for

    Mining Large Data Sets Data clustering is an important task in

    the area of data mining. Clustering is the unsupervised

    classification of data items into homogeneous groups called

    clusters. Clustering methods partition a set of data items into

    clusters, such that items in the same cluster are more similar to

    each other than items in different clusters according to some

    defined criteria. Clustering algorithms are computationally

    intensive, particularly when they are used to analyze large

    amounts of data. A possible approach to reduce the processing

    time is based on the implementation of clustering algorithms on

    scalable parallel computers. This paper describes the design and

    implementation of P-AutoClass, a parallel version of the

    AutoClass system based upon the Bayesian model for

    determining optimal classes in large data sets. The P-AutoClass

    implementation divides the clustering task among the processors

    of a multicomputer so that each processor works on its own

    partition and exchanges intermediate results with the other

    processors. The system architecture, its implementation, and

    experimental performance results on different processor

    numbers and data sets are presented and compared with

    theoretical performance. In particular, experimental and

    predicted scalability and efficiency of P-AutoClass versus the

    sequential AutoClass system are evaluated and compared.

    Different from supervised learning, where training

    examples are associated with a class label that expresses the

    membership of every example to a class, clustering assumes no

    information about the distribution of the objects and it has the

    task to both discover the classes present in the data set and to

    assign objects among such classes in the best way. A large

    number of clustering methods have been developed in several

    different fields, with different definitions of clusters and

    similarity among objects. The variety of clustering techniques is

    reflected by the variety of terms used for cluster analysis such as

    clumping, competitive learning, unsupervised pattern

    recognition, vector quantization, partitioning, and winner-take-

    all learning.

    Most of the early cluster analysis algorithms come

    from the area of statistics and have been originally designed for

    relatively small data sets. Fayyad et al [2], found that the

    clustering algorithms have been extended to efficiently work for

    knowledge discovery in large databases and, therefore, to

    classify large data sets with high-dimensional feature items.

    Clustering algorithms are very computing demanding and, thus,

    require high-performance machines to get results in a

    reasonable amount of time. HunterandStates[3] gives

    classification algorithm on protein databases and experiences of

    clustering algorithms taking one week or about 20 days of

    computation time on sequential machines are not rare. Scalable

    parallel computers can provide the appropriate setting where to

    efficiently execute clustering algorithms for extracting

    knowledge from large-scale databases and, recently, there has

    been an increasing interest in parallel implementations of data

    clustering algorithms. There are variety of parallel approaches

    to clustering has been discovered by[4],[5],[6],[7].

    Classification: Classification is one of the primary

    data mining tasks[8]. The input to a classification system

    consists of example tuples, called a training set, with each tuple

    having several attributes. Attributes can be continuous, coming

    from an ordered domain, or categorical, coming from an

    unordered domain. A special class attribute indicates the label

    or category to which an example belongs. The goal of

    classification is to induce a model from the training set, that can

    be used to predict the class of a new tuple. Classification has

    applications in diverse fields such as retail target marketing,

    fraud detection, and medical diagnosis (Michie, 1994). Amongst

    many classification methods proposed over the years

    [9][10]decision trees are particularly suited for data mining,

    since they can be built relatively fast compared to other methods

    and they are easy to interpret [11]. Trees can also be converted

    into SQL statements that can be used to access databases

    efficiently [12]. Finally, decision-tree classifiers obtain similar,

    and often better, accuracy compared to other methods [10].

    Prior to interest in classification for database-centric data

    mining, it was tacitly assumed that the training sets could fit in

    memory. Recent work has targeted the massive training sets

    usual in data mining. Developing classification models using

    larger training sets can enable the development of higher

    accuracy models. Various studies have confirmed this[13].

    Recent classifiers that can handle disk-resident data include

    SLIQ [14], SPRINT [15], and CLOUDS[16]. As data continue

    to grow in size and complexity, high performance scalable data

    mining tools must necessarily rely on parallel computing

  • National Conference on Innovative Paradigms in Engineering & Technology (NCIPET-2012)

    Proceedings published by International Journal of Computer Applications (IJCA)

    21

    techniques. Past research on parallel classification has been

    focused on distributed-memory (also called shared-nothing)

    machines. Examples include parallel ID3 [17], which assumed

    that the entire dataset could fit in memory; Darwin toolkit with

    parallel CART [18]from Thinking Machine, whose details are

    not available in published literature; parallel SPRINT on IBM

    SP2 [15]; and ScalParC[19]on a Cray T3D. While distributed-

    memory machines provide massive parallelism, shared-memory

    machines (also called shared everything systems), are also

    capable of delivering high performance for low to medium

    degree of parallelismat an economically attractive price.

    Increasingly SMP machines arebeing networked together via

    high-speed links to form hierarchical clusters. Examples include

    the SGI Origin 2000and IBM SP2 system which can have a 8-

    way SMP as one high node. A shared-memory system offers a

    single memory address space that all processors can access.

    Processors communicate through shared variables in memory.

    Synchronization is used to co-ordinate processes. Any processor

    can also access any disk attached to the system. The SMP

    architecture offers new challenges and trade-offs that are worth

    investigating in their own right.

    2. THE GPU ARCHITECTURE All Initially intended as a fixed many-core processor dedicated

    to transforming 3-D scenes to a 2-D image composed of pixels,

    the GPU architecture has undergone several innovations to meet

    the computationally demanding needs of supercomputing

    research groups across the globe. The traditional GPU pipeline

    designed to serve its original purpose came with several

    disadvantages. Shortcomings such as the v limited data reuse in

    the pipeline, excessive variations in hardware usage, and lack of

    integer instructions coupled with weak floating-point precision

    rendered the traditional GPU a weak candidate for HPC. In

    November 2006 [20], NVIDIA introduced the GeForce 8800

    GTX with a novel unified pipeline and shader architecture. In

    addition to overcoming the limitations of the traditional GPU

    pipeline, the GeForce 8800 GTX architecture added the concept

    of streaming processor (SMP) architecture that is highly

    pertinent to current GP-GPU programming. SMPs can work

    together in close proximity with extremely high parallel

    processing power. The outputs produced can be stored in fast

    cache and can be used by other SMPs. SMPs have instruction

    decoder units and execution logic performing similar operations

    on the data. This architecture allows SIMD instructions to be

    efficiently mapped across groups of SMPs. The streaming

    processors are accompanied by units for texture fetch (TF),

    texture addressing (TA), and caches. The structure is maintained

    and scaled up to 128 SMPs in GeForce 8800 GTX. The SMPs

    operate at 2.35 GHz in the GeForce 8800 GTX, which is

    separate from core clock operating at 575 MHz. Several GP-

    GPUs used thus far for HPC applications have architectures that

    are concurrent with the GeForce 8800 GTX architecture.

    However, the introduction of the Fermi by Nvidia in September

    2009 [21]has radically changed the contours of the GP-GPU

    architecture, as we will explore in the next subsection.

    GPUs amazing evolution on both computational capability

    and functionality extends application of GPUs to the field of

    non-graphics computations, which is so-called general purpose

    computation on GPUs (GPGPU) [22]. Design and development

    of GPGPU are becoming significant because of the following

    reasons:

    1. Cost-performance: Using only commodity hardware is important to achieve high-performance computing at a low

    cost, and GPUs have become commonplace even in low-

    end PCs. Due to the hardware architecture designed for

    exploiting parallelism of graphics, even todays low-end

    GPU exhibits high-performance for data-parallel

    computing. In addition, GPU has much higher sequential

    memory access performance than CPU because one of

    GPUs key tasks is filling regions of memory with

    contiguous texture data. That is, GPUs dedicated memory

    can provide data to GPUs processing units at the high

    memory bandwidth.

    2. Evolution speed: GPUs performance such as the number of floating-point operations per second has been

    growing at a rapid pace. Amazingly-evolving GPU

    capabilities have a possibility to enable the GPU

    implementation of a task to outperform its CPU

    implementation in the future. Modern GPUs have two

    kinds of programmable processors, vertex

    shaderandfragment shader, on the graphics pipeline to

    render an image.

    Figure 1 illustrates a block diagram of the programmable

    rendering pipeline of these processors. The vertex shader

    manipulates transformation and lighting of vertices of

    polygons to transform them into the viewing coordinate

    system. Polygons projected into the viewing coordinate

    system are then decomposed into fragments each

    corresponding to a pixel on the screen. Subsequently, the

    color and depth of a fragment are computed by the

    fragment shader. Finally, composition operations such as

    tests using depth, alpha and stencil buffers are applied to

    the outputs of the fragment shader to determine the final

    pixel colors to be written to the frame buffer.

    It is emphasized here that vertex and fragment shaders

    are developed to utilize multi-grain parallelism in the

    rendering processes: the coarse-grain vertex/fragment level

    parallelism and the fine-grain vector component level

    parallelism. To exploit the coarse-grain parallelism at the

    GPU level, individual vertices and fragments can be

    processed in parallel. The fragment shaders (vertex

    shaders) of recent GPUs have several processing units for

    parallel-processing multiple fragments (vertices). For

    example, NVIDIAs high-end GPU, GeForce 6800 Ultra,

    has 16 processing units in the fragment shader, and

    therefore can compute colors and depths of up to 16

    fragments at the same time. On the other hand, to exploit

    the fine-grain parallelism involved in all vector operations,

    they have SIMD instructions that can simultaneously

    operate on four 32-bit floating-point values within a 128-

    bit register. For example, one of the powerful SIMD

    instructions, the multiply and add (MAD) instruction,

    performs a component-wise multiply of two registers each

    storing four floating-point components, and then does a

    component-wise addition of the product to another register;

    the MAD instruction performs these eight floating-point

    operations in a single cycle. Due to its application-specific

    architecture, however, GPU does not work well

    universally. To exhibit high performance for a non-

    graphics application, hence, we ought to consider how to

    bind it to GPUs programmable rendering pipeline.

  • National Conference on Innovative Paradigms in Engineering & Technology (NCIPET-2012)

    Proceedings published by International Journal of Computer Applications (IJCA)

    22

    Fig 1 Overview of a programmable rendering pipeline

    The most critical restriction in GPU programming for non-

    graphics applications is due to the restricted data flows in

    and between the vertex shader and the fragment shader.

    Arrows in Fig. 1 show typically-permitted data flows. Both

    vertex and fragment shader programs have to write their

    outputs to write-only dedicated registers; random access

    writes are not provided. This is severe impediment to

    effective implementation of many data structures and

    algorithms. In addition, the lack of loop-controls,

    conditionals, and branching is also serious for most of

    practical applications. Although the latest GPUs with

    Shader Model 3.0 support dynamic controls flows, there is

    some overhead to flow-control operations, and they can

    limit the GPUs performance. If an application imposes the

    restriction violation on the GPU programming model

    mentioned above, it is not a good idea to implement the

    application entirely on GPUs. For such an application,

    collaboration between CPU and GPU often leads to better

    performance, though it is essential to keep time-consuming

    data transfer between them minimum. From the viewpoint

    of data accessibility, the fragment shader is superior to the

    vertex shader because the fragment shader can randomly

    access the video memory and fetch data as texture colors.1

    Furthermore, the fragment shader usually has more

    processing units than the vertex shader, and thereby the

    fragment shader has a great potential to exploit

    dataparallelism more effectively. Consequently, this paper

    presents an implementation of data clustering accelerated

    effectively using multi-grain parallel processing on the

    fragment shader.

    2.1 GPU Computing with AMD/ATi Radeon 5870

    The AMD/ATis Radeon 5870 architecture [23]is very different

    compared to NVIDIAs Fermi architecture. The AMD/ATi

    Radeon 5870 used in our study has 1600 ALUs organized in a

    different fashion compared to the Fermi. The ALUs are grouped

    into five-ALU Very Long Instruction Word (VLIW) processor

    units. While all five of the ALUs can issue the basic arithmetic

    operations, only the fifth ALU can additionally execute

    transcendental operations. The five-ALU groups along with the

    branch execution unit and general-purpose registers form

    another group called the stream core. This translates to 320

    stream cores in all, which are further grouped into compute

    units. Each compute unit has 16 stream cores, resulting in 20

    total compute units in the ATi Radeon 5870. One thread can be

    executed on one stream core, thus 16 threads can be run on a

    single compute unit. In order to hide the memory latency, 64

    threads are assigned to a single compute unit. When one 16-

    thread group accesses memory, the other 16-thread group

    executes on the ALU. Therefore theoretically, a throughput of

    16 threads per cycle is possible on the Radeon architecture.

    Each ALU can execute a maximum of 2 single-precision Flops:

    multiply and add instructions per cycle. The clock rate of the

    Radeon GPU is 850 MHz; for 1600 ALUs this translates to a

    throughput of 2.72 TFlops/s. The Radeon 5870 has a memory

    hierarchy that is similar to the Fermis memory hierarchy. The

    hierarchy includes a global memory, L1 and L2 cache, shared

    memory, and registers. The 1 GB global memory has the peak

    bandwidth of 153 GB/s and is controlled by eight memory

    controllers. Each compute unit has 8 KB L1 cache having an

    aggregate bandwidth of 1 TB/s. Multiple compute units share a

    512 KB L2 cache with 435 GB/s of bandwidth between L1 and

    L2 cache. Each compute unit also has a 32 KB of shared

    memory, providing a total 2 TB/s aggregate bandwidth. The

    registers have the highest bandwidth, 48 bytes per cycle in each

    stream core (aggregate bandwidth of 48 320 850 MB/s, i.e., 13 TB/s). The 256 KB register space is available per compute

    unit, totaling 5.1 MB for the entire GPU.

    3. THE K-MEANS ALGORITHM In data clustering[24], multivariate data units are grouped

    according to their similarity or dissimilarity. MacQueen used

    the term k-means to denote the process of assigning each data

    unit to that cluster (of k clusters) with the nearest centroid. That

    is, k-means clustering employs the Euclidean distance between

    data units as the dissimilarity measure; a partition of data units

    is assessed by the squared error:

    =

    = 1

    2 =1 (3.1)

    wherexiRd, i= 1, 2, . . . ,m is a data unit and yjRd , j = 1, 2, . . . , k denotes the cluster centroid.

    Although there are a vast variety of k-means algorithms

    [5], for the sake of explanation simplicity, this paper focuses on

    a simple and standard k-means algorithm summarized as

    follows:

    1. Begin with any desirable initial states, e.g. initial cluster centroids may be drawn randomly from a

    given data set.

    2. Allocate each data unit to the cluster with the nearest centroid. The centroids remain fixed through the

    entire data set.

    3. Calculate centroids of new clusters. 4. Repeat Steps 2 and 3 until a convergence condition is

    met, e.g. no data units change their membership at

    Step 2, or the number of repetitions exceeds a

    predefined threshold.

    At each repetition the assignment of m data units to k

    clusters in Step 2 requires km distance computations (and (k -

    1)mdistance comparisons) for finding the nearest cluster

    centroids, the so-called nearest neighbor search. The cost of

    each distance computation increases in proportion to the

    dimension of data, i.e. the number of vector elements in a data

    unit, d. The nearest neighbor search consists of approximately 3

    dkmfloating-point operations, and thus the computational cost of

    the nearest neighbor search grows at O(dkm). In practical

    applications, the nearest neighbor search consumes most of the

    execution time for k-means clustering because m and/or d often

    become tremendous. However, the nearest neighbor search

    involves massive SIMD parallelism; the distance between every

    pair of a data unit and a cluster centroid can be computed in

    parallel, and the distance computation can further be

    parallelized according to their vector components.

    This motivates us to implement the distance computation on

    recent programmable GPUs as multi-grain SIMD-parallel

    coprocessors. On the other hand, there is no necessity to

    consider the acceleration of Steps 1 and 4 using GPU

    programming, because they require little execution time and

    further include almost no parallelism. In Step 3, cluster centroid

  • National Conference on Innovative Paradigms in Engineering & Technology (NCIPET-2012)

    Proceedings published by International Journal of Computer Applications (IJCA)

    23

    recalculation consists of dmadditions and dkdivisions of

    floating-point values. Although most of these calculations can

    be performed in parallel, conditionals and random access writes

    are required for effective implementation of individually

    summing up vectors within each cluster. In addition, the

    divisions also require conditional branching to prevent divide-

    by-zero errors. Since the execution time for Step 3 is much less

    than that of Step 2, there is no room for performance

    improvement that outweighs the overheads derived from the

    lack of random access writes and conditionals in GPU

    programming. Therefore, we decide to implement Steps 1, 3,

    and 4 as CPU tasks.

    4. DISCUSSION

    Fig. 2 Parallel processing of data clustering that uses

    GPU as a co-processor to exploit two kinds of data

    parallelism in the nearest neighbor search Our preliminary analysis show that the data transfer from CPU

    to GPU at each rendering pass is not a bottleneck in the nearest

    neighbor search. This is because the large data set has already

    been placed on the GPU-side video memory in advance; only

    the geometry data of a polygon, including texture coordinates as

    a cluster centroid, are transferred at eachrendering pass. On the

    other hand, the data transfer from the GPU-side video memory

    to the main memory induces a certain overhead even when

    using the PCI-Express interface. Therefore, we should be

    judicious about reading data back from GPU, even in the cases

    of using GPUs connected via the PCI-Express interface. In our

    implementation scheme, the overhead of the data transfer is

    negligible except for trivial-scale data clustering because the

    data placed on the GPU-side video memory are transferred only

    once in each repetition. Accordingly, our implementation

    scheme of data clustering with GPU co-processing can exploit

    GPUs computing performance without critical degradation

    attributable to the data transfer between CPU and GPU.

    5. CONCLUSION In this paper, further research tasks include using a cluster of

    GPUs for texture classification. This could be done including

    various GPUs on the same machine or dividing computations

    into a cluster of PCs, and would significantly increase the

    applicability of the architecture to complex industrial

    applications. Using this approach, a deeper analysis on the

    strategy of parallelization for multi-GPU introduced is needed.

    Undoubtedly, the increase of performance will not be

    proportional to the number of GPUs and will be damaged by

    factors such as data communication and synchronization among

    different hosts. we have proposed a three-level hierarchical

    parallel processing scheme for the k-means algorithm using a

    modern programmable GPU as a SIMD-parallel co-processor.

    Based on the divide-and-conquer approach, the proposed

    scheme divides a large-scale data clustering task into subtasks

    of clustering small subsets, and the subtasks are executed on a

    PC cluster system in an embarrassingly parallel manner. In the

    subtasks, a GPU is used as a multigrain SIMD-parallel co-

    processor to accelerate the nearest neighbor search, which

    consumes a considerable part of the execution time in the k-

    means algorithm. The distances from one cluster centroid to

    several data units are computed in parallel. Each distance

    computation is parallelized by component-wise SIMD

    instructions. As a result, the parallel data clustering with GPU

    co-processing significantly improve the computational

    efficiency of massive data clustering. Experimental results

    clearly show that the proposed hierarchical parallel processing

    scheme remarkably accelerate massive data clustering tasks.

    Especially, acceleration of the nearest neighbor search by GPU

    co-processing is significant to save the total execution time in

    spite of the overhead of the data transfer from the GPU-side

    video memory to the CPU-side main memory. GPU co-

    processing is also effective to retain the scalability of the

    proposed scheme by accelerating the aggregation stage that is a

    non-parallelized part of the proposed scheme.

    This paper has discussed the GPU implementation of the nearest

    neighbor search, compared with the CPU implementation to

    clarify the performance gain of GPU co-processing. However,

    the multi-threading approach has a possibility to allow both

    CPU and GPU to execute the nearest neighbor search in parallel

    without interrupting each other. In such an implementation,

    hence, GPU co-processing will always bring additional

    computing power even in the case where only a low-end GPU is

    available. The multi-threading implementation with effective

    load balancing between CPU and GPU will be investigated in

    our future work.

    6. REFERENCES [1] C. Pizzuti and D. Talia, "P-AutoClass: Scalable Parallel

    Clustering for Mining Large Data Sets," IEEE

    Transactions On Knowledge And Data Engineering,vol.

    15, no. 3, pp. 629-641, 2003.

    [2] U. Fayyad, G. Piatesky-Shapiro and P. Smith, From Data

    Mining to Knowledge Discovery: An Overview, NY:

    AAAI/MIT Press, 1996.

    [3] L. Hunter and D. States, "Bayesian Classification of

    Protein Structure," Expert, vol. 7, no. 4, pp. 67-75, 1992.

    [4] C. Olson, " Parallel Algorithms for Hierarchical

    Clustering," Parallel Computing, vol. 21, pp. 1313-1325,

    1995.

    [5] D. Judd, P. McKinley and A. Jain, "Large-Scale Parallel

    Data Clustering," in Int'l Conf. Pattern Recognition,

    New York, 1996.

    [6] J. Potts, Seeking Parallelism in Discovery Programs,

    Arlington: Master Thesis : Univ. of Texas, 1996.

    [7] K. Stoffel and A. Belkoniene, "Parallel K-Means

    Clustering for Large Data Sets," in Parallel Processing,

    UK, 1999.

    [8] R. Agrawal, T. Imielinski and A. Swami, "Database

    mining: A performance perspective," vol. 5, no. 6, p.

    914925, Dec 1993.

    [9] S. Weiss and C. Kulikowski, Computer Systems that

    Learn. ,, vol. 1, New York: Morgan Kaufman, 1991.

    [10] D. Michie, Machine Learning, Neural and Statistical

    Classification, vol. I, NJ: Ellis Horwood, 1994.

    [11] J. Quinlan, Programs for Machine Learning, vol. I, New

    York: Morgan Kaufman, 1999.

    [12] R. Agrawal, "An interval classifier for database mining

    applications," in VLDB Conference, New York, Aug

    1992.

    [13] J. Catlett, Megainduction Machine Learning on Very

    Large Databases. PhD thesis,, vol. I, Sydney: Univ. of

    Sydney, 1991.

  • National Conference on Innovative Paradigms in Engineering & Technology (NCIPET-2012)

    Proceedings published by International Journal of Computer Applications (IJCA)

    24

    [14] M. Mehta, R. Agrawal and J. Rissanen, "SLIQ: A fast

    scalable classifier for data mining," in 5th Intl. Conf. on

    Extending Database Technology, NJ, March 1996.

    [15] J. Shafer, R. Agrawal and M. Mehta., "SPRINT: A

    scalable parallel classifier for data mining," in 22nd

    VLDB Conferenc, NJ, Sept 1996.

    [16] K. Alsabti, S. Ranka and V. Singh, "CLOUDS: A

    decision tree classifier for large datasets," in 4th Intl.

    Conf. on Knowledge Discovery and DataMining, Aug

    1998.

    [17] D. Fifield, Distributed tree construction from large data-

    sets: Bachelor Thesis,, Australian Natl. Univ., 1992.

    [18] L. Breiman, Classification and Regression Trees,

    Belmont: Wadsworth, 1984.

    [19] M. Joshi, G. Karypis and V. Kumar, ScalParC: A

    scalable and parallel classification algorithm for mining

    large datasets, Intl. Parallel Processing Symp, 1998.

    [20] "Technical Brief: NVIDIA GeForce 8800 GPU

    architecture overview," [Online]. Available:

    www.nvidia.com.

    [21] "NVIDIAs next generation CUDA compute

    architecture: Fermi," [Online]. Available:

    http://www.nvidia.com/content/

    PDF/fermi_white_papers/NVIDIAFermiComputeArchit

    ectureWhitepaper.pdf.

    [22] Z. Fan, F. Qiu, A. Kaufman and S. Yoakum-Stover,

    "GPU cluster for high performance computing," NY,

    2004.

    [23] "ATI Mobility Radeon HD 5870 GPU specifications,"

    [Online]. Available:

    http://www.amd.com/us/products/notebook/graphics/ati-

    mobility-hd-5800/Pages/hd-5870-specs.aspx.

    [24] D. Judd, P. McKinley and A. Jain, "Performance

    Evaluation on Large-Scale Parallel Clustering in NOW

    Environments," in Eighth SIAM Conf. Parallel

    Processing for Scientific Computing, Mar 1997.

    http://www.nvidia.com/

  • FEMBI REKRISNA GRANDEA PUTRA M0513019

    UJIAN KOMPETENSI DASAR 2 PENGANTAR ORGANISASI KOMPUTER

    Algoritma Data Mining seperti klasifikasi dan clustering adalah masa depan komputasi,

    meskipun dataprocessing multidimensi diperlukan. Orang yang menggunakan prosesor multicore

    dengan GPU. Sebagian besar bahasa pemrograman tidak menyediakan fasilitas multiprocessing dan

    karenanya menjadi pemborosan pengolahan sumber daya. Clustering dan klasifikasi algoritma yang

    mengonsumsi lebih banyak sumber daya. Dalam makalah ini telah menunjukkan strategi untuk

    mengatasi kekurangan tersebut dengan menggunakan multicore processing platform OpenCL.

    CLUSTERING adalah teknik pembelajaran tanpa pengawasan yang memisahkan item data

    menjadi beberapa kelompok, sehingga item dalam cluster yang sama lebih mirip satu sama lain dan

    barang-barang dalam kelompok yang berbeda cenderung berbeda, menurut beberapa ukuran

    kesamaan atau kedekatan. Pizzuti dan Talia menyajikan teknik P-AutoClass untuk Clustering Paralel

    Scalable untuk Set Data Mining Besar pengelompokan data merupakan tugas penting di bidang data

    mining. Clustering adalah klasifikasi terawasi item data ke dalam kelompok homogen yang disebut

    cluster. Metode Clustering partisi satu set item data ke dalam cluster, sehingga item dalam cluster

    yang sama lebih mirip satu sama lain daripada item dalam kelompok yang berbeda menurut beberapa

    kriteria yang ditetapkan. Algoritma Clustering adalah komputasi intensif, terutama ketika mereka

    digunakan untuk menganalisis data dalam jumlah besar. Sebuah pendekatan yang mungkin untuk

    mengurangi waktu pemrosesan didasarkan pada pelaksanaan algoritma klasterisasi pada komputer

    paralel scalable. Makalah ini menjelaskan desain dan implementasi dari P-AutoClass, versi paralel dari

    sistem AutoClass berdasarkan model Bayesian untuk menentukan kelas optimal dalam set data yang

    besar. Pelaksanaan P-AutoClass membagi tugas pengelompokan antara prosesor multicomputer

    sehingga setiap prosesor bekerja pada partisi sendiri dan pertukaran hasil antara dengan prosesor

    lainnya. Sistem arsitektur, implementasi, dan hasil kinerja eksperimen pada nomor prosesor yang

    berbeda dan set data disajikan dan dibandingkan dengan kinerja teoritis. Secara khusus, percobaan

    dan diprediksi skalabilitas dan efisiensi P-AutoClass versus sistem AutoClass berurutan dievaluasi dan

    dibandingkan.

    Semua Awalnya dimaksudkan sebagai prosesor banyak-inti tetap didedikasikan untuk

    mengubah 3-D layar untuk gambar 2-D yang terdiri dari pixel, arsitektur GPU telah mengalami

    beberapa inovasi untuk memenuhi tuntutan kebutuhan komputasi dari superkomputer kelompok

    penelitian di seluruh dunia. Pipa GPU tradisional dirancang untuk melayani tujuan aslinya datang

    dengan beberapa kelemahan. Kekurangan seperti v-reuse terbatas data dalam pipa, variasi yang

    berlebihan dalam penggunaan hardware, dan kurangnya instruksi integer ditambah dengan lemah

    floating-point presisi diberikan GPU tradisional kandidat yang lemah untuk HPC. Pada bulan November

    2006, NVIDIA memperkenalkan GeForce GTX 8800 dengan pipa terpadu baru dan arsitektur shader.

    Selain mengatasi keterbatasan pipa GPU tradisional, arsitektur GeForce GTX 8800 menambahkan

  • FEMBI REKRISNA GRANDEA PUTRA M0513019

    UJIAN KOMPETENSI DASAR 2 PENGANTAR ORGANISASI KOMPUTER

    konsep streaming processor (SMP) arsitektur yang sangat relevan untuk saat ini pemrograman GP-

    GPU. SMP dapat bekerja sama di dekat dengan kekuatan pemrosesan paralel yang sangat tinggi.

    Output yang dihasilkan dapat disimpan dalam cache cepat dan dapat digunakan oleh SMP lainnya.

    SMP memiliki unit decoder instruksi dan eksekusi logika melakukan operasi serupa pada data.

    Arsitektur ini memungkinkan instruksi SIMD untuk secara efisien dipetakan di kelompok SMP.

    Prosesor streaming yang disertai oleh unit untuk tekstur fetch ( TF ), tekstur menangani (TA ), dan

    cache. Struktur dipertahankan dan ditingkatkan untuk 128 SMP di GeForce 8800 GTX. The SMP

    beroperasi pada 2,35 GHz di GeForce 8800 GTX, yang terpisah dari core clock 575 MHz yang beroperasi

    di. Beberapa GPGPUs digunakan sejauh ini untuk aplikasi HPC memiliki arsitektur yang bersamaan

    dengan arsitektur GeForce GTX 8800. Namun, pengenalan Fermi oleh Nvidia pada bulan September

    2009 [ 21 ] telah secara radikal mengubah kontur arsitektur GP-GPU, seperti yang kita akan

    mengeksplorasi pada subseksi berikutnya.

    Analisis awal kami menunjukkan bahwa transfer data dari CPU ke GPU di setiap pass render

    bukanlah hambatan dalam pencarian tetangga terdekat. Hal ini karena set data yang besar telah

    ditempatkan pada memori video GPU-side terlebih dahulu, hanya data geometri poligon, termasuk

    koordinat tekstur sebagai centroid cluster ditransfer pada eachrendering lulus. Di sisi lain, transfer

    data dari memori video GPU-sisi ke memori utama menginduksi overhead tertentu bahkan ketika

    menggunakan interface PCI-Express. Oleh karena itu, kita harus bijaksana tentang membaca kembali

    data dari GPU, bahkan dalam kasus menggunakan GPU terhubung melalui antarmuka PCI-Express.

    Dalam skema implementasi kami, overhead transfer data diabaikan kecuali untuk clustering data yang

    sepele skala karena data ditempatkan pada memori video GPU-side ditransfer hanya sekali dalam

    setiap pengulangan. Dengan demikian, skema implementasi kami pengelompokan data dengan GPU

    co-processing dapat memanfaatkan kinerja komputasi GPU tanpa degradasi kritis akibat dari tindakan

    transfer data antara CPU dan GPU.

    Dalam tulisan ini, tugas penelitian lebih lanjut termasuk menggunakan sekelompok GPU untuk

    klasifikasi tekstur. Hal ini bisa dilakukan termasuk berbagai GPU pada mesin yang sama atau membagi

    perhitungan menjadi sekelompok PC, dan secara signifikan akan meningkatkan penerapan arsitektur

    untuk aplikasi industri yang kompleks. Dengan menggunakan pendekatan ini, analisis yang lebih

    mendalam pada strategi paralelisasi untuk multi-GPU diperkenalkan diperlukan. Tidak diragukan lagi,

    peningkatan kinerja tidak akan sebanding dengan jumlah GPU dan akan rusak oleh faktor-faktor

    seperti komunikasi data dan sinkronisasi antara host yang berbeda. kami telah mengusulkan skema

    pengolahan tiga tingkat hirarki paralel untuk algoritma k-means menggunakan GPU diprogram

    modern sebagai co-prosesor SIMD-paralel.

  • PLEASE SCROLL DOWN FOR ARTICLE

    This article was downloaded by: [TBTAK EKUAL]On: 5 July 2010Access details: Access Details: [subscription number 786636116]Publisher Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

    International Journal of ElectronicsPublication details, including instructions for authors and subscription information:http://www.informaworld.com/smpp/title~content=t713599654

    An efficient memory allocation algorithm and hardware design withVHDL synthesisF. Karabibera; A. Sertbasa; S. Ozdemirb; H. Camca Computer Engineering Department, Istanbul University, Avclar, Istanbul b Computer EngineeringDepartment, Gazi University, Maltepe, Ankara c Computer Science and Engineering Department,Arizona State University Tempe,

    To cite this Article Karabiber, F. , Sertbas, A. , Ozdemir, S. and Cam, H.(2008) 'An efficient memory allocation algorithmand hardware design with VHDL synthesis', International Journal of Electronics, 95: 2, 125 138To link to this Article: DOI: 10.1080/00207210701828085URL: http://dx.doi.org/10.1080/00207210701828085

    Full terms and conditions of use: http://www.informaworld.com/terms-and-conditions-of-access.pdf

    This article may be used for research, teaching and private study purposes. Any substantial orsystematic reproduction, re-distribution, re-selling, loan or sub-licensing, systematic supply ordistribution in any form to anyone is expressly forbidden.

    The publisher does not give any warranty express or implied or make any representation that the contentswill be complete or accurate or up to date. The accuracy of any instructions, formulae and drug dosesshould be independently verified with primary sources. The publisher shall not be liable for any loss,actions, claims, proceedings, demand or costs or damages whatsoever or howsoever caused arising directlyor indirectly in connection with or arising out of the use of this material.

    http://www.informaworld.com/smpp/title~content=t713599654http://dx.doi.org/10.1080/00207210701828085http://www.informaworld.com/terms-and-conditions-of-access.pdf

  • International Journal of Electronics,

    Vol. 95, No. 2, February 2008, 125138

    An efficient memory allocation algorithm and hardware design withVHDL synthesis

    F. KARABIBER*y, A. SERTBASy, S. OZDEMIRz and H. CAMyComputer Engineering Department, Istanbul University, 34320, Avc|lar, IstanbulzComputer Engineering Department, Gazi University, 06570, Maltepe, Ankara

    Computer Science and Engineering Department, Arizona State University Tempe, AZ 85287

    (Received 27 November 2006; in final form 21 November 2007)

    This paper presents a hardware-efficient memory allocation technique, calledEMA, that detects the existence of any free block of requested size in memory.EMA can allocate a free memory block of any number of chunks in any partof memory without having any internal fragmentation. The gate-level designof the hardware unit, along with its area-time measurements is given inthis paper. Simulation results indicate that EMA is fast and flexible enough toallocate/deallocate a free block in any part of memory resulting in efficientutilization of memory spaces. In addition, the VHDL synthesis with FPGAimplementation shows that EMA has less complicated hardware, and is fasterthan the known hardware techniques.

    Keywords: Memory allocation; Buddy system; Digital system design; VHDLsynthesis; Simulation; FPGA

    1. Introduction

    The high performance algorithms for dynamic memory allocation (DMA) have beenof considerable interest in the literature as DMA is a critical factor for improving acomputer systems performance. The common allocation techniques can be dividedinto four categories: sequential fits, segregated fits, buddy system and bitmapped fits.The sequential fits, and segregated fits algorithms keep a free list of all availablememory chunks and scan the list for allocation. These two methods produce goodmemory utilization (storage), but incur a time penalty because of scanning the freespace list. The bitmapped fits method uses two bitmap for allocation process, one forthe requested allocation size and another one for encoding the allocated blockboundaries, i.e. the size of allocated blocks. If the object sizes are small, it can havespace advantages, otherwise it causes long search times.

    The buddy system, introduced by Knowlton (1965), is a fast and simple memoryallocation technique which allocates memory in blocks whose lengths are power of 2.If the requested block size is not a power of 2, then the size is rounded up to the nextpower of two. This may leave a big chunk of unused space at the end of an allocatedblock (Puttkamer 1975), thereby resulting in internal fragmentation that occurs in

    *Corresponding author. Email: [email protected]

    International Journal of Electronics

    ISSN 00207217 print/ISSN 13623060 online 2008 Taylor & Francishttp://www.tandf.co.uk/journals

    DOI: 10.1080/00207210701828085

    Downloaded By: [TBTAK EKUAL] At: 15:29 5 July 2010

  • the case of more memory allocation than requested. The buddy system also suffersfrom external fragmentation that arises when a request for memory allocationcannot be satisfied, even though the total amount of free space is adequate.

    The performance of the buddy system can be improved using hardwaretechniques (Puttkamer 1975, Page and Hagins 1986, Chang and Gehringer 1991).A modified hardware-based buddy system which eliminates internal fragmentation isproposed in (Chang and Gehringer 1996). In this allocation technique, the memory isdivided into the fixed-sizes word groups called chunks, and their statusses aredetermined by using a bit vector. Each bit on the vector represents a leaf of theor-gate tree given in figure 1. The method in Chang and Gehringer (1996) can detecta free block of size j only if the starting address of the free block is a factor of j, ork j, where k 0 and j is a power of 2. Even though a block with the requested freespace is available in the memory, their hardware design may not be able to detect itdue to the limitations of the or-gate tree structure. Another memory allocationtechnique that detects free blocks of size2dlog2 ke is proposed in Cam, Abd-El-Barr andSait (1999).

    In recent years, the multiple buddy systems that predefine the size set have beenintroduced, in order to reduce the external fragmentation (Chang, Srisa-an, Dan Loand Gehringer 2002). However, the modified buddy system still performs better thanthe multiple buddy system. The active memory management algorithm based on themodified buddy system is implemented in a hardware unit (AMMU), and embeddedinto SoC design in Agun and Chang (2001). The dynamic memory management unitallows easy integration into any processor, but it requires high memory for the bigobject sizes and numbers, since the total amount of memory space for the needed bitmaps is proportional to the object size and the numbers of objects. Finally, a bitmapbased memory allocator is designed in combinational logic, works in conjunctionwith application-specific instruction set extension (Chang et al. 2002).

    This paper presents a fast and efficient memory allocation (EMA) algorithm todetect any available free block of requested size and to minimize internal andexternal fragmentation. The proposed technique can allocate a free memory block ofany length located in any part of a memory. EMA can allocate the free blocks of size2blog2 kc 2dlog2k2blog2 kce and it is capable of detecting any available free block of

    Figure 1. Generic structure of or-gate prefix circuit.

    126 F. Karabiber et al.

    Downloaded By: [TBTAK EKUAL] At: 15:29 5 July 2010

  • requested size using a proposed or-gate prefix circuit. Furthermore, the proposedcircuits for implementing EMA are less complicated than those introduced inChang and Gehringer (1996); Lo, Srisa-an and Chang (2001). The simulation resultsshow that EMA occupies approximately 9.2% less memory space than the modifiedbuddy system (Chang and Gehringer 1996). In addition, EMA hardware has beensynthesized with VHDL, tested for several measurements, such as the meanallocation and deallocation time, total area etc., and compared to the memorymanagement system in Agun and Chang (2001). Simulation results show that EMAresults in a significant improvement on memory allocation behaviour compared toAgun and Chang (2001) by offering 22% less external fragmentation.

    The rest of the paper is organised as follows; x 2 describes the proposed memoryallocation algorithm: x 3 presents the detailed hardware design of the allocator/deallocator proposed in this work: x 4 includes the EMAs performance analysis withthe simulation, the VHDL analysis and FPGA implementation results; concludingremarks are made in x 5.

    2. Efficient memory allocation algorithm

    Consider that the memory is partitioned into a number of chunks which have thesame number of words and a memory block consists of one or more chunks. Thestatus of all memory chunks by either a 0 or a 1 depending on whether the chunk isfree or used, respectively, is represented by a bit-vector. In a bit-vector, memoryallocation information is held. The bits of the bit-vector are labelled from left to rightin ascending order, starting with 0. Each bit of the bit-vector has an address registercontaining its label as the address. Figure 2 shows a simplified flowchart ofAlgorithm EMA which is presented next.

    Algorithm:

    Input:Allocation: the size value k of the requested blocks for allocationDeallocation: the starting address of the block to be deallocated

    Output:Allocation

    (i) the starting address of the allocated block,(ii) the bits corresponding to the allocated blocks are inverted from 0 to 1.

    Deallocation: the bits corresponding to the deallocated blocks are inverted from1 to 0.

    In order to implement the above algorithm, the logic structures of the or-gate-prefix, the search-free block and the detect-free block circuits are given in x 3.Step 1 determines the request whether memory allocation or deallocation. Step 2detects the free memory chunks of size 2blog2 kc. If k is not a power of 2, thenStep 3 detects the free memory chunks of size 2blog2 kc 2dlog2k2blog2 kce. Step 4selects the highest address with all detected free blocks. Step 5 allocates the blocksand inverts the k bits from 0 to 1 of the bit-vector corresponding to chunks ofallocated blocks. Step 6 frees the blocks and inverts the k bits from 1 to 0corresponding to the chunks of the deallocated blocks.

    An efficient memory allocation algorithm 127

    Downloaded By: [TBTAK EKUAL] At: 15:29 5 July 2010

  • 3. Hardware design of the memory allocator

    In order to implement the above algorithm, the search-free block and detect-freeblock circuits are designed. Details of the circuits and algorithm steps are given by

    the following.

    3.1 Search of free blocks (steps 2 and 3)

    To detect all free blocks of size 2blog2 kc 2dlog2k2blog2 kce, we use the or-gate-prefixcircuit whose logic circuit structure was given in Cam et al. (1999). The or-gate-prefixcircuit that is designed for a memory of N chunks is shown in figure 1. Each node atlevel Li represents an OR gate. As seen from figure 1, there are n 1 level selectorslabelled S0; S1, . . . ,Sn for a 2

    n-bit vector. For any free block of size 2i, therewill be exactly one corresponding or-gate node with value 0 at level Li of the

    Figure 2. Block diagram of EMA algorithm.

    128 F. Karabiber et al.

    Downloaded By: [TBTAK EKUAL] At: 15:29 5 July 2010

  • or-gate-prefix circuit. The outputs of all or-gates are inverted and then become theinputs of the tri-state buffers. When level selector Si is asserted, the outputs of thosetri-state buffers which correspond to free blocks generate their associated verticallines V(j), j 0 depicted in figure 3. These vertical lines (called V-vector) generate theaddress associated with the first chunk of the available blocks of the requested size.When a block of size k is requested depending on k, in Step 2 or Step 3, only levelselector Si is asserted, i 2blog2 kc or i dlog2k 2blog2 kce, respectively.

    In this technique, the or-gate-prefix circuit can detect any free block if its size is apower of 2, no matter where the free block is located in the memory. If k is a powerof 2, the technique uses the or-gate-prefix circuit only once in step 2. In step 2, freeblocks of size 2blog2 kc are detected by a high-priority encoder, then the decoded bitsare compared with the requested block sizes, if they are the same it can easily be seenthat the requested size is a power of 2, and S1 is selected as the input size of theor-gate-prefix circuit (S). However, if k is not a power of 2, the or-gate-prefix circuitis used twice (step 2 and step 3). In step 3, instead of bit-vector bits, the NANDedV-vector bits are used, shown in figure 4. Using this new bit-vector, the algorithmdetects the free blocks of size 2dlog2k2

    blog2 kce which are equivalent to free blocks of size2blog2 kc 2dlog2k2blog2 kce in the original bit-vector.

    As seen in figure 4, the subtractor subtracts the requested block sizes, k from S1.This difference (k-S1) corresponds to the free blocks of size 2dlog2k2

    blog2 kce. As thesimilar procedure in step 2, in step 3, by an (high priority) encoder then a decodercircuits the obtained block size is loaded into a shift register, holds the content of S2.If the decoded size bits are not equal to the difference size, the shift register is enabledto shift one bit position to left, otherwise the decoded size bits are only loaded intothe register without any shift.

    Figure 3. The or-gate-prefix logic circuit.

    An efficient memory allocation algorithm 129

    Downloaded By: [TBTAK EKUAL] At: 15:29 5 July 2010

  • Example 1: Assume that the requested memory size is 38 chunks. In step 2, usingthe or-gate prefix circuit EMA detects free blocks of size 32 2blog2 38c and activatesthe V-vector address bits of those blocks by S5 1. Memory request size is not apower of 2, therefore EMA employs the or-gate-prefix circuit again to detect all thefree blocks of size 2blog238c 2dlog2382blog238ce 32 8 40. Since, in Step-2, addressregisters which hold V-vector are activated from level S5 of the or-gate prefix circuiteach active register represents a free block of size 32. Moreover, 9 consecutive activeaddress registers are equal to a free block of size 40, because or-gate-prefix circuit hasthe following property: if the address register a represents the n chunks of memory,say bit-vector bits 1 to n, then the address register a 1 represents the bit-vector bits2 to n 1. EMA takes advantage of this property of the or-gate prefix circuit todetect the free blocks of size 40 by using V-vector in the bit-vector of the or-gate-prefix circuit. After the NAND operation, the free block of size 40 is represented by 8consecutive active bits of V1 register, which can be detected by or-gate prefix circuit.The results of NAND operations are inserted into bit-vector by inverting each result,so that the active bits of the new bit-vector are represented by 0s in the original bit-vector. Finally, in Step-3, EMA detects the free blocks of size 40 by finding the 8consecutive active bits in the new bit-vector.

    3.2 The free block detection with the highest address (step 4)

    This step is to determine the free block whose first chunk address is the greatest; thefirst chunks address of a block is called its starting address. The bits of V-vectorgenerated by the or-gate-prefix circuit indicate that the requested blocks are to beallocated or not. If the corresponding k bits are 1s, they are allocatable for k size ofthe free blocks. When more than one bit is set in V-vector, the selection of the highestaddress corresponding to these bits is achieved by using a high-priority encodercircuits.

    Example 2: Consider the memory is partitioned into 24 chunks that have an equalnumber of words each. Each memory chunk is represented by one bit of the bit-vector as shown in figure 5. Assume that a free block of size 2 is requested. The levelselector S shown in figure 5 is activated and the V-vector bits with addresses 0010,

    Figure 4. The Search and Detect-free block circuit.

    130 F. Karabiber et al.

    Downloaded By: [TBTAK EKUAL] At: 15:29 5 July 2010

  • 0101, 1001, 1010 are set. The function of the (High-Priority) Encoder is to select thefree block of size 2, having the highest address. So, the output of the encoder isobtained as 1010.

    3.3 Bit inversion (step 5)

    Let us denote the starting and ending addresses of the determined blocks, SA andEA, respectively. Using the formula EA SA k 1, EA can be easily computedfor the size k of the requested block. Since the V-vector represent the bitscorresponding to the allocated block, SA and EA correspond to the addresses of thefirst and last bit, of this vector, respectively. The bits of V-vector must be inverted to1 to indicate that they are allocated.

    This is achieved in two steps by the circuit illustrated in figure 6. The objective ofthe first step is to set the gates connected to the (EA 1)th bit in such a way that thebits with the addresses, such as EA 1; EA 2, cannot be inverted in the secondstep. The goal of the second step is to invert all the bits of the subvector V.

    As soon as the free block with the highest address is determined in step 4, the bitinversion operation is simultaneously done to update the status of the bit vector.Therefore, the bit-inversion circuit does not contribute any delay to the currentmemory allocation request. The only constraint imposed by the bit-inversion circuitis that the next memory allocation request cannot start until they complete theinversion of the bits in V.

    In the first step of the bit-inversion operation, the reset line shown in figure 6 isasserted high, and EA 1 is put on the address lines of the address-control unit. Foreach bit of the bit-vector, there is a corresponding match cell illustrated by a dashed-line box in figure 6. The comparator in each match cell compares the address on theaddress lines with the contents of the address register in its match cell.

    Obviously, only the comparator associated with the bit EA 1 shows a match,because the address lines carry the address EA 1. Therefore, only the output of thecomparator of the match cell EA 1 is 1, and the outputs of all other comparatorsare 0. The contents of the bit-latch in a match cell can be changed only when the resetline is 1, so that the reset line acts like an enable line for the bit-latch. Since the resetline is also 1 during the first step only, the number 0 is stored in the bit-latch of the

    Figure 5. The contents of the bit-vector and V-vector for the example.

    An efficient memory allocation algorithm 131

    Downloaded By: [TBTAK EKUAL] At: 15:29 5 July 2010

  • match cell EA 1, and all the other bit-latches of the match cells store the number 1.Note that the output of the bit-latch of the match cell EA 1 is an input to an ANDgate whose output automatically becomes 0. Finally, the reset line is made 0 and thefirst step ends. The settings of the circuit in the first step ensure that the bits with theaddresses such as EA 1; EA 2 cannot be inverted in the second step.

    In the second step of the bit-inversion operation, all the bits of the subvector Vare inverted. To accomplish this, the address lines of the address-control unit infigure 6 are set to the starting address SA, and the set line is asserted high. Thecomparator of each match cell again compares its address with the address SAcarried by the address lines. Consequently, only the output of the comparator of thematch cell SA becomes 1. Because the set line is also 1, both inputs of one of theAND gates in the match cell SA become 1, thereby resulting in an output 1. Thisleads the output of the OR gate, connected to the AND gate, to be 1. The output ofthis OR gate is connected to the AND gate which enables the bit-inverter with theaddress SA. Because the bit-latches of the match cells with addresses SA to EA havebeen set to 1 in the first step, both inputs of the AND gates located between OR gatesare 1s. Therefore, the outputs of the OR gates connected to the bit-inverters SA toEA inclusive become 1 sequentially. Hence, all the bit-inverters from SA to EA areenabled. This amounts to saying that all the bits of the subvector V are invertedduring step 2. At the end of step 2, the set line is made 0.

    Figure 6. Bit-inversion circuit.

    132 F. Karabiber et al.

    Downloaded By: [TBTAK EKUAL] At: 15:29 5 July 2010

  • 3.4 Memory deallocation (step 6)

    In case of memory deallocation, the starting address SA and the size k of the block tobe deallocated are given. Since the ending address EA of the block is known, instep 6, bit inverters invert the bits from 1 to 0 to indicate that they are free.

    4. Simulation results and VHDL synthesis

    This section consists of two parts. In the first part, a comparative performanceanalysis of EMA and previous memory allocation techniques are given. The secondpart presents how EMA is implemented in hardware using VHDL synthesis. Also inthis part, comparison of EMA and active memory managament unit (AMMU)(Agun and Chang 2001) is given using criteria of fragmentation, execution time andchip area.

    4.1 Simulation results

    A simulator to compare the memory utilization of EMA with the following threememory allocators was written: Chang and Gehringers memory allocator (CGMA)(Chang 1996), the complete multiple buddy system (CMBS) (Chang 1997) andmultiple buddy system (MBS) (Lo et al. 2001). Similar to the simulations in Lo et al.(2001), the MBS in our simulations has the same size set of (2N, 3 2N), that is, MBSis able to allocate any requested block of size 2n or 3 2n. It is stated in Chang (1997)that CMBS yields the best memory utilization among all previously known buddysystems, therefore CMBS is also used as a benchmark in the simulations.

    The simulator uses memory allocation/deallocation traces collected using eightJava benchmark programs of SPECjvm98 (1998). These benchmark programs aredesigned to measure the performance of Java virtual machine (JVM) implementa-tions. The Kaffe (2002) JVM version 1.0.7 is customized to generate a trace file ormemory allocation/deallocation requests during the execution of each Java programof SPECjvm98.

    As in Chang and Gehringer (1996), the performance metric used in thesimulations is the highest address values allocated (HAA) in the memory. Thesimulator takes a previously generated memory trace file and generates the highestaddress values for the allocation/deallocation of the requested memory blocks byeach of the above four memory allocators. The simulation results in table 1 indicatethat, CMBS has the best memory utilization by generating the lowest HAAs, and isfollowed by EMA, CGMA and MBS, respectively.

    Also, table 1 shows that the average memory utilization of EMA is 9.2% and13.8% better than CGMA and MBS, respectively. This is due to following facts.

    . EMA is capable of allocating an available block of the requested size whereverit is in the memory.

    . In EMA the size of the free memory block that has to be detected to allocate arequest is smaller than the ones in CGMA and MBS.

    Although this memory utilization improvement of 9.2% may not seem to be a bignumber, it is indeed a significant improvement knowing that the best buddy typememory allocator CMBS is shown to be around only 20% better than the modified

    An efficient memory allocation algorithm 133

    Downloaded By: [TBTAK EKUAL] At: 15:29 5 July 2010

  • buddy system (Lo et al. 2001). Moreover, the memory utilization improvement ofEMA may be increased further when the circular prefix circuit is used.

    Table 2 shows the number of allocation/deallocation requests (NADR) andAverage Allocation/Deallocation Size (AADS) of each simulation program and thepercentage of memory overhead associating with EMA, CGMA (Chang andGehringer 1996), and MBS (Lo et al. 2001) compared to CMBS (Chang 1997).

    As seen from tables 1 and 2, CMBS has the best memory utilization, whereasEMA clearly outperforms CGMA and MBS, resulting in lower HAAs for allbenchmark programs. However, implementation cost of CMBS is much higher thanEMA, because the number of nodes in CMBSs or-gate tree is O(N2) where EMAsor-gate tree has only O(NlogN) nodes. CGMA and MBS both have O(N) nodes intheir or-gate trees. The implementation costs of above memory allocators in termsof the number of nodes in their or-gate trees are presented in figure 7 for variousbit-vector sizes. Figure 7 indicates that CMBS is extremely expensive to implementcompared to EMA, CGMA and MBS.

    4.2 VHDL synthesis and test results

    In this section, the chip area and execution time of the proposed hardware unit areexamined. For different parameters, such as bit-vector size and the maximum object

    Table 1. Highest Address Allocated (HAA) values in number of blocks whenblock size is eight.

    Trace programs CMBS EMA CGMA MBS

    Check 1,09,895 1,39,239 1,71,769 1,95,625Compress 1,04,808 1,33,085 1,62,738 1,74,129Jess 3,55,718 4,05,967 4,57,114 5,33,299Db 1,92,531 2,24,888 2,60,012 2,73,879Javac 3,13,403 3,42,351 3,91,277 3,74,762Mpegaudio 1,38,698 1,69,043 2,14,083 2,39,170Mtrt 7,41,602 9,02,414 9,28,932 9,50,535Jack 5,22,589 6,19,558 6,50,956 6,66,731

    Table 2. Percentage overhead of memory allocators compared to CMBS, NADRand AADS.

    Trace programs EMAoverhead CGMAoverhead MBSoverhead NADR AADs

    Check 21.07 36.02 43.82 86,738 54Compress 21.25 35.60 39.81 77,233 53Jess 12.38 22.18 33.30 21,184,857 46Db 14.39 25.95 29.70 65,50,585 25Javac 8.46 19.90 16.37 8,22,007 41mpegaudio 17.95 35.21 42.01 92,525 53Mtrt 17.82 20.17 21.98 13,843,281 32Jack 15.65 19.72 21.62 28,039,374 36

    134 F. Karabiber et al.

    Downloaded By: [TBTAK EKUAL] At: 15:29 5 July 2010

  • size, test results are obtained. Bit vector is used for the status and capacity ofmemory. The maximum object size is defined as the maximum number of chunksthat can be allocated once. In this application, for the different allocation requests,the bit-vector sizes range from 64 byte up to 512 byte whereas the maximumallocation sizes vary from 16 to 128 blocks. For this purpose, Xilinx ISE (2005)6.2.03i tool is used to generate a gate level representation of the memory allocator/deallocator hardware design. The hardware unit is implemented in Virtex II proFPGA (2005), having the device type XC2VP100, package type FF, number of pins1696, speed grade-5. The analysis performed on a desktop computer that has IntelsPentium IV running at 2GHz with 512 Mbyte of RAM.

    The results regarding the FPGA implementation of the allocator/deallocator unitare given in table 3. As seen from table 3, only 16% of the total FPGA can be usedfor the maximum bit-vector size selected in EMA implementation. It should be notedthat bit vector size can be increased with respect of application. These results alsoshow that the amount of FPGA used in the implementation is slightly affected by theparameter of the maximum object size.

    Table 4 shows that the number of cycles needed to perform an allocation/deallocation process. Each allocation request takes 5 clock cycles, in the case that itssize is a power of 2, and 6 cycles otherwise. Each deallocation request needs only 2cycles.

    In addition, the time analysis of the EMA hardware unit is performed and theresults in terms of the mean allocation and deallocation times are reported infigure 8. As seen from figure 8, large bit-vector sizes can increase both meanallocation and deallocation time delay. However, when the bit-vector length isselected as constant, increase of the maximum allocation size does not cause anydelay to allocation and deallocation time.

    The active memory management algorithm, based on the modified buddy systemis implemented in a hardware unit (AMMU), embedded into the SoC design usingVHDL in Agun and Chang (2001). Total fragmentation, maximum clock frequencyand used slice number are obtained in Agun (2001). The comparison of EMA and

    Figure 7. The implementation costs of CMBS, EMA, CGMA and MBS in terms of numberof nodes.

    An efficient memory allocation algorithm 135

    Downloaded By: [TBTAK EKUAL] At: 15:29 5 July 2010

  • AMMU is given in table 5. EMA reduces total fragmentation in the ratio of 22%. AsAMMU allocates memory blocks each of whose size is a power of two, it suffersfrom internal and external fragmentation. However, in EMA only externalfragmentation can exist. Furthermore, EMA can perform allocation process in ashorter time, since allocation time is proportional to inverse of the maximum clockfrequency. On the other hand, the implementation of EMA cost in terms of used slicenumber is higher than AMMU.

    5. Conclusions

    The paper presented a memory allocation and deallocation technique EMA with itshardware design. When a free block of k chunks is requested, an or-gate-prefixcircuit detects all free blocks of 2blog2 kc 2dlog2k2blog2 kce chunks in memory. The free

    Table 3. Memory Alloc./Dealloc. hardware impl