On matching programmers' chunks with program structures: An empirical investigation

Download On matching programmers' chunks with program structures: An empirical investigation

Post on 02-Jul-2016




1 download

Embed Size (px)


<ul><li><p>Int. J. Man-Machine Studies (1987) 27, 65-89 </p><p>On matching programmers" chunks with program structures: An empirical investigation </p><p>IRIS VESSEY </p><p>University of Pittsburgh, Pittsburgh, PA 15260, U.S.A. </p><p>(Received 2 June 1986 and in revised form 13 February 1987) </p><p>Expertise in a given domain is generally regarded as being manifested in the possession of a large body of knowledge stored as chunks or schema in long-term memory. Recall experiments in a variety of domains have demonstrated that experts possess larger chunks of knowledge on meaningful tasks, while their performance fails to that of novices on non-meaningful tasks. Three experiments are re- ported, two recall and one construction, that were designed to provide information on programmers' (COBOL) knowledge structures. In the initial experiment, the chunking ability of computer programmers, as revealed by program recall, was less successful in predicting performance on a debugging task than were programmers' problem-solving processes. A second experiment sought to determine whether the lack of a match between programmers' chunks and the information structures in the program used for recall was responsible for the poor differentiation of programming skill afforded by the recall test. Although expert programmers recalled more than novice programmers, there were no qualitative differences in the types of structures the two groups recalled. A third experiment required expert programmers to construct a routine to accomplish a similar function to that of the program used for recall. The programmers constructed routines with diverse program structures. In general, the results show that both expert and novice programmers possess a wide variety of chunks of the kind incorporated into the recall program. It appears, however, that even professional programmers do not have well-formulated scripts for validation stored in long-term memory. </p><p>1. Introduction </p><p>de Groot (1965, 1966) and Chase and Simon (1973a, b), with their pioneering experiments on chess-playing, demonstrated that expertise is vested in the posses- sion of a large body of meaningful chunks of knowledget stored in long-term memory. The basic studies of this nature expose experts and novices to both meaningful and non-meaningful domain-related materials for limited periods of time. The possession of expertise is demonstrated by the superior recall of experts on meaningful tasks, but equivalent expert-novice performance on non-meaningful tasks. Hence, experts are distinguished from novices on the basis of their possession of a large store of organized knowledge rather than on specific superior characteris- tics, such as short-term memory. On recall of meaningful materials, experts retrieve from long-term memory knowledge structures corresponding to the stimuli of the experiment. Since novices do not have these large stores of knowledge in long-term memory, they cannot perform at levels much better than for random recall. Correspondingly, experts do not have knowledge structures in long-term memory </p><p>t The term, knowledge structures, is used throughout this paper to refer to chunks of knowledge in long-term memory. </p><p>65 0020-7373/87/070065 + 25503.00/0 (~) 1987 Academic Press Limited </p></li><li><p>66 ~. VESSEY </p><p>from which to recall random materials and their performance on the non-meaningful tasks falls to that of novices. </p><p>The chess studies spawned a series of similar studies aimed at demonstrating that expertise is manifested in similar ways in domains other than chess. Examination of the literature shows that the paradigm has been established across a wide variety of domains. Reitman (1976) in Go and Egan and Schwartz (1979) in electronics found that expertise was associated with possession of knowledge in other domains with pictorial representations. More important from the viewpoint of computer program- ming, however, are the studies that have validated the paradigm in domains using symbolic representations. Sloboda (1976) conducted a study in the domain of music, Engle and Bukstel (1978) investigated expertise in bridge, and Larkin, McDermott, Simon and Simon (1980) investigated expertise in physics. Although there are similarities and differences among the representations in these domains and computer programs, they are all basically symbolic and possess the overriding characteristic that the principal type of information that dominates them is sequencing information.t Further, a number of studies have been conducted in the domain of computer programming; see, for example, Shneiderman (1976), McKeithen, Reitman, Rueter and Hirtle (1981), and Barfield (1986). Hence, it appears that the paradigm of expertise first elaborated in the domain of chess is applicable also to domains where symbolic representations dominate. </p><p>As Pennington (1982) comments, however, these prior program recall studies have not sought to code data systematically: "to reveal principles that might govern chunking or that might imply hierarchical coding", i.e. they have not sought to investigate the nature of the chunks programmers reveal on recall. Hence, the current research investigates the nature of the chunks expert and novice program- mers store in long-term memory by examining the importance to recall of a match between program structures and programmers' knowledge structures. </p><p>The paper proceeds as follows. Section 2 presents the results of the first experiment, a study that used the analogy to chess expertise to determine groups of more and less expert programmers for use as subjects in a further experiment. Since the recall pre-test performed poorly according to objective performance measures, this section analyses the recall data from that experiment. It compares the nature of the information structures incorporated in the program vis-h-vis the chunks possessed by programmers, in an attempt to understand the poor performance of expert programmers on this task. Section 3 tests the thesis derived from the analysis of the results of the first experiment: that it is essential to match program information structures with the chunks of knowledge possessed by expert and novice programmers. The following section, section 4, describes a third experiment that sought to determine the types of chunks possessed by experienced programmers by requesting them to construct their preferred routines to fulfill a function similar to that of the recall program used in the previous two experiments. Section 5 presents a general discussion of the results of the three experiments and the conclusions. </p><p>~" That sequencing information is the primary information propagated through a program is due to the linear representation employed by current programming languages. However, the structure of languages, and therefore the types of information foregrounded in those languages, may change, for example with the advent of visual programming languages. See, for example, Brown, Carling, Herot, Kramlich &amp; Souza (1985) and Raeder (1985). </p><p>$ See Vessey (1986) for a review of recall research in computer programming. </p></li><li><p>PROGRAMMERS' CHUNKS AND PROGRAM STRUCTURES 67 </p><p>2. Experiment I </p><p>The aim of the first experiment was to derive a set of "expert" and "novice" programmers from a group of professional programmers serving as subjects in a protocol investigation of expert and novice debugging processes (Vessey, 1984). Vessey used three ratings in an attempt to assess the expertise of programmers for the research. One of the ways used to assess programmer expertise was the effectiveness of program recall. The other methods were manager ratings or expert opinion (Reilly et al., 1975) and an ex post measure derived from the protocols themselves. The ex post classification was based on the "tuning" of expert problem-solving, whereby: "(w)ith continued use of the material about a topic, performance becomes much smoother, more efficient, less hesitant" (Norman, 1978).t Hence, instead of assessing the extent or the content of programmers' chunks, this method of classifying programmers assessed the efficiency with which programmers chunked the material under examination, i.e. it assessed chunking processes. </p><p>2.1. TASK </p><p>The program used was a slightly modified, short (67-line) COBOL program taken from the text, Learning to Program in Structured COBOL, by Yourdon, Gane, Sarson and Lister (1976, pp. 32-33) (see Fig. 1).:~ The program was originally used by Shneiderman (1977) for recall in programming factors research, where both functional and verbatim recall were used as surrogates for program comprehension. </p><p>2.2. RECALL TEST METHODOLOGY USED </p><p>Vessey's recall test differed from other recall experiments in the domain of computer programming (Shneiderman, 1976; McKeithen et al., 1981) in that the task objective was functional recall of a program that subjects were not required to memorize. Experts use cues to chunks stored in long-term memory and the effectiveness of their recall is reflected in the extent to which the information structures in the task domain match their stored chunks. HenCe, this study took the view that recall is a process of pattern matching rather than memorization. This recall study also took the view that recall of the DATA DIVISION of a COBOL program is not meaningful in current-day programming environments. With the advent of copy facilities, data dictionaries, and the like, few programmers actually write the DATA DIVISIONs of their programs. Recalling the DATA DIVISION (especially) verbatim would no doubt require memorization! </p><p>t The ex post programmer classification was derived from three debugging "efficiency" criteria that sought to determine the extent of smooth-flowingness in programmers' chunking processes. Experts are expected to demonstrate chunking ability by displaying a smooth approach to problem solving, referencing different parts of the program minimally to gain understanding of its functioning. Novices, on the other hand, are expected to exhibit more erratic behavior, returning to parts of the program they have already inspected. The debugging efficiency criteria were derived from a model of the debugging process. For further information on the ex post classification of programmers, see Vessey (1984, 1985). </p><p>~:The program was modified in two ways: (1) The CARDS-LEFT field was initialized in the WORKING- STORAGE SECTION with the VALUE 'YES' clause instead of using the initialization statement, MOVE 'YES' TO CARDS-LEFT, in the PROCEDURE DIVISION; (2) The READ statement was placed in a separate module rather than having it appear in two locations in the program--the main routine and the program loop (the PROCESS-CARDS module). </p></li><li><p>IDENTIFICATION DIVISION. PROGRAM-ID. TEST. </p><p>ENVIRONMENT DIVISION. INPUT-OUTPUT SECTION. FILE-CONTROL. </p><p>SELECT CARD-FILE ASSIGN TO CDR. SELECT PRINT-FILE ASSIGN TO PTR. </p><p>DATA DIVISION. FILE SECTION. FD CARD-FILE </p><p>LABEL RECORDS ARE OMITTED. 01 CARD. </p><p>05 NAME PICTURE X(20). 05 ADDR PICTURE X(40). 05 PHONE </p><p>10 AREA-CODE PICTURE X(3). 10 NUMBR PICTURE X(8). </p><p>05 FILLER PICTURE X(9). FD PRINT-FILE </p><p>LABEL RECORDS ARE OMITTED. 01 PRINT-RECORD. </p><p>05 FILLER PICTURE X(3). 05 NAME PICTURE X(20). 05 FILLER PICTURE X(5). 05 ADDR PICTURE X(40). 05 FILLER PICTURE X(5). 05 PHONE. </p><p>10 AREA-CODE PICTURE X(3). 10 FILLER PICTURE X. 10 NUMBR PICTURE X(8). </p><p>05 FILLER PICTURE X(10). 05 MESSAGE-AREA PICTURE X(25). 05 FILLER PICTURE X(12). </p><p>WORKING-STORAGE SECTION. 01 WARNING-1 PICTURE X(25) VALUE '***** NAME MISSING *****' 01 WARNING-2 PICTURE X(25) VALUE '***** ADDRESS MISSING *****' 01 WARNING-3 PICTURE X(25) VALUE '***** PHONE MISSING *****' 01 CARDS-LEFT PICTURE X(3) VALUE 'YES'. </p><p>PROCEDURE DIVISION. OPEN INPUT CARD-FILE </p><p>OUTPUT PRINT-FILE. PERFORM READ-CARD. PERFORM PROCESS-CARDS </p><p>UNTIL CARDS-LEFT IS EQUAL TO 'NO'. CLOSE CARD-FILE </p><p>PRINT-FILE. STOP RUN. </p><p>READ-CARD. READ CARD-FILE </p><p>AT END MOVE 'NO' TO CARDS-LEFT. </p><p>PROCESS-CARDS. MOVE SPACES TO PRINT-RECORD. IF NAME IN CARD IS EQUAL TO SPACES </p><p>MOVE WARNING-1 TO MESSAGE-AREA ELSE </p><p>MOVE NAME IN CARD TO NAME IN PRINT-RECORD. IF ADDR IN CARD IS EQUAL TO SPACES </p><p>MOVE WARNING-2 TO MESSAGE-AREA ELSE </p><p>MOVE ADDR IN CARD TO ADDR IN PRINT-RECORD. IF PHONE IN CARD IS EQUAL TO SPACES </p><p>MOVE WARNING-3 TO MESSAGE-AREA ELSE </p><p>MOVE AREA-CODE IN CARD TO AREA-CODE IN PRINT-RECORD MOVE NUMBR IN CARD TO NUMBR IN PRINT-RECORD. </p><p>WRITE PRINT-RECORD. </p><p>PERFORM READ-CARD. </p><p>FIG. 1. The Recall program. </p></li><li><p>PROGRAMMERS' CHUNKS AND PROGRAM STRUCTURES 69 </p><p>In view of the above analysis, subjects in this recall experiment were required to recall the PROCEDURE DIVISION of the COBOL program presented to them with the first three divisions on the first page of the listing and the PROCEDURE DIVISION on the second. Perusal and recall times were determined by pilot tests. Subjects were permitted 90 s to familiarize themselves with the complete program. Referencing only the first three divisions, they were then permitted 10 min to reproduce a functionally equivalent PROCEDURE DIVISION, i.e. the PROCEDURE DIVISION they produced had to be error-free and to result in the same output as the program they had viewed. Subjects were rated by two independent, experienced programmers who were instructed to ensure that the recalled programs performed the same functions as the original program and to regard semantic errors as more serious than syntactic errors. The two scorers achieved 100% agreement in ranking experts and novices. There was also remarkable agreement within categories. </p><p>2.3. ANALYSIS OF THE PERFORMANCE OF THE RECALL PRE-TEST </p><p>Here, we are concerned with performance of the recall test in determining expertise. Table 1 shows the correspondence among the three classifications used in Vessey's study. The recall and the manager classifications classified eight of the 16 subjects similarly, the recall and ex post classifications also agreed in eight of the 16 cases, while the manager and ex post classifications agreed in 10 of the 16 cases. These three classifications were then assessed against the objective performance criteria of debug time and the number of mistakes subjects made to determine which provided the best expert-novice classification. An ANCOVA model, with verbalization rate as the covariate, was used to test for the effects of the programmer classification and </p><p>TABLE 1 Subject classifications based on three methods for distin- </p><p>guishing between experts and novices </p><p>Expert-novice classification </p><p>1002 N E N 1003 E N N 1004 E E E 1005 E N N 1006 N N E 1007 E E E 1008 E E E 1009 N E E 1010 N E N 1011 N N N 1012 E N N 1014 E N E 1015 E E E 1016 N N N 1018 N E E 1019 N N E </p><p>Subject Recall Manager Ex post </p></li><li><p>70 i. VESSEY </p><p>bug level on debug time.t The recall model accounted for 30-7% of the variation in debug time, the manager model 36.1%, and the ex post model 73.7%. In addition, the recall classification classed only one of the five programmers who made mistakes as a novice, the manager classification classed four of the five as novices, while t...</p></li></ul>