Transcript

Int. J. Man-Machine Studies (1987) 27, 65-89

On matching programmers" chunks with program structures: An empirical investigation

IRIS VESSEY

University of Pittsburgh, Pittsburgh, PA 15260, U.S.A.

(Received 2 June 1986 and in revised form 13 February 1987)

Expertise in a given domain is generally regarded as being manifested in the possession of a large body of knowledge stored as chunks or schema in long-term memory. Recall experiments in a variety of domains have demonstrated that experts possess larger chunks of knowledge on meaningful tasks, while their performance fails to that of novices on non-meaningful tasks. Three experiments are re- ported, two recall and one construction, that were designed to provide information on programmers' (COBOL) knowledge structures. In the initial experiment, the chunking ability of computer programmers, as revealed by program recall, was less successful in predicting performance on a debugging task than were programmers' problem-solving processes. A second experiment sought to determine whether the lack of a match between programmers' chunks and the information structures in the program used for recall was responsible for the poor differentiation of programming skill afforded by the recall test. Although expert programmers recalled more than novice programmers, there were no qualitative differences in the types of structures the two groups recalled. A third experiment required expert programmers to construct a routine to accomplish a similar function to that of the program used for recall. The programmers constructed routines with diverse program structures. In general, the results show that both expert and novice programmers possess a wide variety of chunks of the kind incorporated into the recall program. It appears, however, that even professional programmers do not have well-formulated scripts for validation stored in long-term memory.

1. Introduction

de Groot (1965, 1966) and Chase and Simon (1973a, b), with their pioneering experiments on chess-playing, demonstra ted that expertise is vested in the posses- sion of a large body of meaningful chunks of knowledget stored in long-term memory. The basic studies of this nature expose experts and novices to both meaningful and non-meaningful domain-related materials for limited periods of time. The possession of expertise is demonstra ted by the superior recall of experts on meaningful tasks, but equivalent exper t -novice performance on non-meaningful tasks. Hence, experts are distinguished f rom novices on the basis of their possession of a large store of organized knowledge rather than on specific superior characteris- tics, such as short- term memory. On recall of meaningful materials, experts retrieve from long-term m e m o r y knowledge structures corresponding to the stimuli of the experiment. Since novices do not have these large stores of knowledge in long-term memory, they cannot perform at levels much better than for random recall. Correspondingly, experts do not have knowledge structures in long-term memory

t The term, knowledge structures, is used throughout this paper to refer to chunks of knowledge in long-term memory.

65 0020-7373/87/070065 + 25503.00/0 (~) 1987 Academic Press Limited

66 ~. VESSEY

from which to recall random materials and their performance on the non-meaningful tasks falls to that of novices.

The chess studies spawned a series of similar studies aimed at demonstrating that expertise is manifested in similar ways in domains other than chess. Examination of the literature shows that the paradigm has been established across a wide variety of domains. Reitman (1976) in Go and Egan and Schwartz (1979) in electronics found that expertise was associated with possession of knowledge in other domains with pictorial representations. More important from the viewpoint of computer program- ming, however, are the studies that have validated the paradigm in domains using symbolic representations. Sloboda (1976) conducted a study in the domain of music, Engle and Bukstel (1978) investigated expertise in bridge, and Larkin, McDermott, Simon and Simon (1980) investigated expertise in physics. Although there are similarities and differences among the representations in these domains and computer programs, they are all basically symbolic and possess the overriding characteristic that the principal type of information that dominates them is sequencing information.t Further, a number of studies have been conducted in the domain of computer programming; see, for example, Shneiderman (1976), McKeithen, Reitman, Rueter and Hirtle (1981), and Barfield (1986). Hence, it appears that the paradigm of expertise first elaborated in the domain of chess is applicable also to domains where symbolic representations dominate.

As Pennington (1982) comments, however, these prior program recall studies have not sought to code data systematically: "to reveal principles that might govern chunking or that might imply hierarchical coding", i.e. they have not sought to investigate the nature of the chunks programmers reveal on recall. Hence, the current research investigates the nature of the chunks expert and novice program- mers store in long-term memory by examining the importance to recall of a match between program structures and programmers' knowledge structures.

The paper proceeds as follows. Section 2 presents the results of the first experiment, a study that used the analogy to chess expertise to determine groups of more and less expert programmers for use as subjects in a further experiment. Since the recall pre-test performed poorly according to objective performance measures, this section analyses the recall data from that experiment. It compares the nature of the information structures incorporated in the program vis-h-vis the chunks possessed by programmers, in an attempt to understand the poor performance of expert programmers on this task. Section 3 tests the thesis derived from the analysis of the results of the first experiment: that it is essential to match program information structures with the chunks of knowledge possessed by expert and novice programmers. The following section, section 4, describes a third experiment that sought to determine the types of chunks possessed by experienced programmers by requesting them to construct their preferred routines to fulfill a function similar to that of the recall program used in the previous two experiments. Section 5 presents a general discussion of the results of the three experiments and the conclusions.

~" That sequencing information is the primary information propagated through a program is due to the linear representation employed by current programming languages. However, the structure of languages, and therefore the types of information foregrounded in those languages, may change, for example with the advent of visual programming languages. See, for example, Brown, Carling, Herot, Kramlich & Souza (1985) and Raeder (1985).

$ See Vessey (1986) for a review of recall research in computer programming.

PROGRAMMERS' CHUNKS AND PROGRAM STRUCTURES 67

2. Experiment I

The aim of the first experiment was to derive a set of "expert" and "novice" programmers from a group of professional programmers serving as subjects in a protocol investigation of expert and novice debugging processes (Vessey, 1984). Vessey used three ratings in an attempt to assess the expertise of programmers for the research. One of the ways used to assess programmer expertise was the effectiveness of program recall. The other methods were manager ratings or expert opinion (Reilly et al., 1975) and an ex post measure derived from the protocols themselves. The ex post classification was based on the "tuning" of expert problem-solving, whereby: "(w)ith continued use of the material about a topic, performance becomes much smoother, more efficient, less hesitant" (Norman, 1978).t Hence, instead of assessing the extent or the content of programmers' chunks, this method of classifying programmers assessed the efficiency with which programmers chunked the material under examination, i.e. it assessed chunking processes.

2.1. TASK

The program used was a slightly modified, short (67-line) COBOL program taken from the text, Learning to Program in Structured CO BO L , by Yourdon, Gane, Sarson and Lister (1976, pp. 32-33) (see Fig. 1).:~ The program was originally used by Shneiderman (1977) for recall in programming factors research, where both functional and verbatim recall were used as surrogates for program comprehension.

2.2. RECALL TEST METHODOLOGY USED

Vessey's recall test differed from other recall experiments in the domain of computer programming (Shneiderman, 1976; McKeithen et al., 1981) in that the task objective was functional recall of a program that subjects were not required to memorize. Experts use cues to chunks stored in long-term memory and the effectiveness of their recall is reflected in the extent to which the information structures in the task domain match their stored chunks. HenCe, this study took the view that recall is a process of pattern matching rather than memorization. This recall study also took the view that recall of the DATA DIVISION of a COBOL program is not meaningful in current-day programming environments. With the advent of copy facilities, data dictionaries, and the like, few programmers actually write the DATA DIVISIONs of their programs. Recalling the D A T A DIVISION (especially) verbatim would no doubt require memorization!

t The ex pos t programmer classification was derived from three debugging "efficiency" criteria that sought to determine the extent of smooth-flowingness in programmers' chunking processes. Experts are expected to demonstrate chunking ability by displaying a smooth approach to problem solving, referencing different parts of the program minimally to gain understanding of its functioning. Novices, on the other hand, are expected to exhibit more erratic behavior, returning to parts of the program they have already inspected. The debugging efficiency criteria were derived from a model of the debugging process. For further information on the ex post classification of programmers, see Vessey (1984, 1985).

~:The program was modified in two ways: (1) The CARDS-LEFT field was initialized in the WORKING- STORAGE SECTION with the VALUE 'YES' clause instead of using the initialization statement, MOVE 'YES' TO CARDS-LEFT, in the PROCEDURE DIVISION; (2) The READ statement was placed in a separate module rather than having it appear in two locations in the program--the main routine and the program loop (the PROCESS-CARDS module).

IDENTIFICATION DIVISION. PROGRAM-ID. TEST.

ENVIRONMENT DIVISION. INPUT-OUTPUT SECTION. FILE-CONTROL.

SELECT CARD-FILE ASSIGN TO CDR. SELECT PRINT-FILE ASSIGN TO PTR.

DATA DIVISION. FILE SECTION. FD CARD-FILE

LABEL RECORDS ARE OMITTED. 01 CARD.

05 NAME PICTURE X(20). 05 ADDR PICTURE X(40). 05 PHONE

10 AREA-CODE PICTURE X(3). 10 NUMBR PICTURE X(8).

05 FILLER PICTURE X(9). FD PRINT-FILE

LABEL RECORDS ARE OMITTED. 01 PRINT-RECORD.

05 FILLER PICTURE X(3). 05 NAME PICTURE X(20). 05 FILLER PICTURE X(5). 05 ADDR PICTURE X(40). 05 FILLER PICTURE X(5). 05 PHONE.

10 AREA-CODE PICTURE X(3). 10 FILLER PICTURE X. 10 NUMBR PICTURE X(8).

05 FILLER PICTURE X(10). 05 MESSAGE-AREA PICTURE X(25). 05 FILLER PICTURE X(12).

WORKING-STORAGE SECTION. 01 WARNING-1 PICTURE X(25) VALUE '***** NAME MISSING *****' 01 WARNING-2 PICTURE X(25) VALUE '***** ADDRESS MISSING *****' 01 WARNING-3 PICTURE X(25) VALUE '***** PHONE MISSING *****' 01 CARDS-LEFT PICTURE X(3) VALUE 'YES'.

PROCEDURE DIVISION. OPEN INPUT CARD-FILE

OUTPUT PRINT-FILE. PERFORM READ-CARD. PERFORM PROCESS-CARDS

UNTIL CARDS-LEFT IS EQUAL TO 'NO'. CLOSE CARD-FILE

PRINT-FILE. STOP RUN.

READ-CARD. READ CARD-FILE

AT END MOVE 'NO' TO CARDS-LEFT.

PROCESS-CARDS. MOVE SPACES TO PRINT-RECORD. IF NAME IN CARD IS EQUAL TO SPACES

MOVE WARNING-1 TO MESSAGE-AREA ELSE

MOVE NAME IN CARD TO NAME IN PRINT-RECORD. IF ADDR IN CARD IS EQUAL TO SPACES

MOVE WARNING-2 TO MESSAGE-AREA ELSE

MOVE ADDR IN CARD TO ADDR IN PRINT-RECORD. IF PHONE IN CARD IS EQUAL TO SPACES

MOVE WARNING-3 TO MESSAGE-AREA ELSE

MOVE AREA-CODE IN CARD TO AREA-CODE IN PRINT-RECORD MOVE NUMBR IN CARD TO NUMBR IN PRINT-RECORD.

WRITE PRINT-RECORD.

PERFORM READ-CARD.

FIG. 1. The Recall program.

PROGRAMMERS' CHUNKS AND PROGRAM STRUCTURES 69

In view of the above analysis, subjects in this recall experiment were required to recall the PROCEDURE DIVISION of the COBOL program presented to them with the first three divisions on the first page of the listing and the PROCEDURE DIVISION on the second. Perusal and recall times were determined by pilot tests. Subjects were permitted 90 s to familiarize themselves with the complete program. Referencing only the first three divisions, they were then permitted 10 min to reproduce a functionally equivalent PROCEDURE DIVISION, i.e. the PROCEDURE DIVISION they produced had to be error-free and to result in the same output as the program they had viewed. Subjects were rated by two independent, experienced programmers who were instructed to ensure that the recalled programs performed the same functions as the original program and to regard semantic errors as more serious than syntactic errors. The two scorers achieved 100% agreement in ranking experts and novices. There was also remarkable agreement within categories.

2.3. ANALYSIS OF THE PERFORMANCE OF THE RECALL PRE-TEST

Here, we are concerned with performance of the recall test in determining expertise. Table 1 shows the correspondence among the three classifications used in Vessey's study. The recall and the manager classifications classified eight of the 16 subjects similarly, the recall and ex post classifications also agreed in eight of the 16 cases, while the manager and ex post classifications agreed in 10 of the 16 cases. These three classifications were then assessed against the objective performance criteria of debug time and the number of mistakes subjects made to determine which provided the best expert-novice classification. An A N C O V A model, with verbalization rate as the covariate, was used to test for the effects of the programmer classification and

TABLE 1 Subject classifications based on three methods for distin-

guishing between experts and novices

Expert-novice classification

1002 N E N 1003 E N N 1004 E E E 1005 E N N 1006 N N E 1007 E E E 1008 E E E 1009 N E E 1010 N E N 1011 N N N 1012 E N N 1014 E N E 1015 E E E 1016 N N N 1018 N E E 1019 N N E

Subject Recall Manager Ex post

70 i. VESSEY

bug level on debug time.t The recall model accounted for 30-7% of the variation in debug time, the manager model 36.1%, and the ex post model 73.7%. In addition, the recall classification classed only one of the five programmers who made mistakes as a novice, the manager classification classed four of the five as novices, while the ex post classification classed all five programmers in error as novices. Hence, both the recall and the manager classifications performed badly compared with the ex post classification.

2.4. DISCUSSION OF PROGRAM RECALL AS A MEANS FOR DETERMINING EXPERTISE

There appear to be two major reasons why the recall pre-test was a poor indicator of debug time and the errors subjects made. First, the chunks programmers possess may not be important factors in debugging expertise, i.e. programmers may possess effective debugging strategies that are more important to the debugging process than the programming knowledge they possess. Support for this notion was obtained in this experiment in terms of the performance of the ex post programmer class- ification, which was based on chunking processes rather than the extent of programmers' chunks. Second, the information structures inherent in the program used for recall may not match the chunks stored in programmers' long-term memories. Hence, the lack of a match of program and programmers' knowledge structures could result in degraded performance. The next section examines the possibility that the chunks in the recall program did not match those programmers had stored in long-term memory.

2.5. ANALYSIS OF THE POOR PERFORMANCE OF THE RECALL PRE-TEST

The recall pre-test data was further examined in light of the information structures inherent in the program used for recall and the knowledge structures demonstrated by programmers in performing the recall task, i.e. this section examines the content of programmers' knowledge bases rather than the extent of their knowledge bases.

2. 5.1. Comparison of program and programmer information structures $ Most of the differences in chunks programmers revealed on recall were semantic differences. This would be expected from Brooks' (1977) observations on the relative sizes of programmers' semantic and syntactic knowledge bases.w Of the semantic differences, two were considered significant to recall performance. They are discussed below.

2.5.1.1. Structure of the validation routine. The recall program used in the experiment was a validation routine that tested for the presence of NAME, ADDRESS, and PHONE fields in a name and address file. NAME and ADDRESS are elementary items, while PHONE was a group item made up of the elementary items, AREA-CODE and NUMBR. The output record in the DATA DIVISION

t Since verbalizing is recognized as consuming some mental processing time (Ericsson & Simon, 1980), the test for debug time was controlled for verbalization rate. The location of the bug in the program was also manipulated in this research (see, Atwood & Ramsey, 1978).

:~ Vessey (1986) contains a full description of all semantic and syntactic differences identified in recall.

w (1977) states that programmers possibly possess between 50000 and 100000 semantic knowledge structures and between 100 and 150 syntactic knowledge structures.

PROGRAMMERS CHUNKS AND PROGRAM STRUCTURES 71

consisted of a transaction record area and a message area. The output from the program consisted of the transaction record and the message for the last non-present data item.

It seems likely that experienced programmers would not be familiar with validation routines that overwrite previous messages concerning fields in error, only to produce a message for the last erroneous field. They would most probably expect validation routines to test for all fields and report all those in error (see, for example, Weber, 1982, p. 289). A less desirable approach, though still better than in the original program, would be to produce a message for the first erroneous field.

Postulating that programmers possess a well-defined procedure for validating a record is based on the script approach to knowledge organization (Schank and Abelson, 1977; Bower, Black and Turner, 1979).t Individuals store their knowledge of routine activities, such as visiting the dentist or eating in a restaurant, as scripts. A script is, therefore, an action stereotype for a given event. We postulate, following Weber (1982), that an expert programmer's script for validating a record consists of checking and reporting on all possible data items in error.

Three subjects reconstructed validation routines that did not perform the same function as the program they were requested to recall. They did not perform well in the recall task, therefore, and were classified as novices. Two of the subjects (1010 and 1016) produced routines that printed messages for all erroneous fields, while the third (1018) printed a message for the first erroneous field only. Of these subjects, 1018 was classified as an expert by the ex post classification, while 1016 was classified as a novice. 1010 was a borderline expert-novice.$ Hence these results provide some support for the notion that expert programmers' knowledge structures may not include the information structures inherent in the program used for recall.

2.5.1.2. Validation of the group item, PHONE. PHONE is defined as a group item in the DATA DIVISION. It is also tested as a group item in the validation routine. This means that, for PHONE to be blank, both AREA-CODE and NUMBR must be blank. This unusual treatment is compounded by the fact that the elementary items in the original program are moved separately to their correspond- ing output fields, highlighting the ambiguity about the nature of the data item. The script argument used for analysing the message format, above, applies also to this sub-set of the validation structure.

Six of the 16 subjects did not reproduce coding for PHONE as in the original program. Two subjects (1003 and 1005) fully understood the ambiguity: one MOVEd PHONE IN CARD TO PHONE IN PRINT-RECORD, and then inserted CORRESPONDING before the first PHONE; the other tested for AREA-CODE AND NUMBR. Two subjects (1009 and 1016) MOVEd PHONE IN CARD TO PHONE IN PRINT-RECORD without recognizing the ambiguity in moving a group item to a non-corresponding group item. The fifth subject (1019) tested for

t According to Gioia and Manz (1985), a script is "a procedural knowledge structure or schema for understanding and enacting behaviors".

~t Subject 1010 was classified as a novice by the ex post classification. However, further analysis of debugging strategies revealed that he exhibited expert debugging strategies. The ex post classification was derived from rating scales with half the subjects classified as novices and half as experts. 1010 was ranked close to the cut-off point for each of the three scales used. Hence, the fact that he was classified as a novice by the ex post classification may be an artifact of the rating procedure used in that research to distinguish experts from novices.

72 i. vzssEu

AREA-CODE and NUMBR separately, apparently without realizing that there was no provision for a fourth warning message in WORKING-STORAGE. The sixth subject (1011) reached the point of testing for AREA-CODE OR; he produced no further code.

2. 5.1.3. Critique of the semantic program structures. The two information struc- tures discussed above--the structure of the validation routine and validating a group item--appear to have concerned a number of the programmer subjects. The single, most important factor appears to have been the structure of the validation routine; this routine is the essence of the program, while the validation of PHONE is only a small part of the larger routine. Two of the three subjects who did not reproduce a functionally equivalent version of the validation routine did not complete the recall exercise, perhaps indicating "cognitive dissonance" with the original routine. One of these programmers was classified as an expert by the ex post classification, while the second was shown to possess the debugging strategy of an expert. Of the six subjects who did not produce functionally equivalent code for testing the group item, PHONE, one failed to complete the recall task.

This analysis was substantiated by three independent judges, expert programmers who had a mean of 13 years' professional programming experience and 2 years' experience with COBOL; they knew, on average, six programming languages.t The judges were shown the four program versions used in Experiment II and asked to rate them on the basis of their suitability as validation routines in a business environment. All showed primary concern with the structure of the validation routine, followed by the method of validating PHONE.~:

2. 5. 2. The effect of non-matching knowledge structures on expertise Given the poor results of Vessey's recall test when compared with other measures of expertise and the above analysis of information structures, it appears that the structures in a program used for recall may influence the outcome of recall. We would expect expert programmers to be affected more by non-matching knowledge structures than novice programmers since experts have better defined chunks stored in long-term memory. Novices may be less confused than experts and may perform as well as experts when the program and programmer knowledge structures do not

t Curtis, Sheppard, Kruesi-Bailey, Bailey and Boehm-Davis (1986) suggest that programming expertise is related to "the number of programming languages known and other factors related to the breadth of programming experience", rather than to the length of that experience.

:~ The reasons given for preferring the revised validation structure differed, however. Judge I preferred to see the total record printed first, so that it was easy to see the error field(s) and identify the bad record. However, she preferred to see the record only if it was in error. Judge 2 preferred the revised validation structure because it is "shorter and simpler" than the original version. He regarded the method of validating PHONE in the revised version as "more refined". Judge 3 preferred the revised validation structure because "it is more compact and avoids the repetitive ELSE's. After each IF s t a t ement . . , it's easier for me to follow the logic when the highly similar statements are together as closely (physically) in parallel as possible".

It is apparent, therefore, that these three expert programmers had similar preferences for the four program versions to those presented in Experiment II. However, while they could articulate the problem with the method of validating PHONE, they evaluated the problem with the validation structure subjectively rather than structurally. That is, they evaluated it in terms of how easy the routine was to comprehend rather than in terms of the output messages they expected to see in a validation routine.

PROGRAMMERS' CHUNKS AND PROGRAM STRUCTURES 73

match. Adelson and Soloway (1984) go even further when they state:

"Rechunking a concept is a process which is error prone since it places a substantial strain on both short term memory's storage space and processing capacity. When the structure of the new, to-be-learned, chunk is not coherent the task becomes more difficult. Learning an incoherent chunk is made even more difficult when a meaningful chunk already exists as a representation".

This research takes the more conservative view that the performance of expert programmers vis-gt-vis novice programmers will depend on the levels of experience of the two groups and the extent of the lack of a match between the program information structures and the chunks programmers already possess.

3. Experiment II

The aim of this experiment was to assess empirically the support for the above critique of the information structures inherent in the COBOL validation program. The analysis suggests the following propositions for empirical testing in a further experiment:t

PI The performance of expert programmers in a program recall task exceeds that of novice programmers.

P2 The performance of expert programmers in a program recall task increases as the extent of a match between their knowledge structures and the information structures in the program used for recall increases.

P3 The performance of novice programmers in a computer program recall task is not influenced by the information structures in a program used for recall.

3.1. RESEARCH METHODOLOGY

Since this experiment derives from the prior critique of program information structures, certain aspects of the research procedures and the task materials adopted therefore closely resemble those used in the experiment analysed (Experiment I).

3.1.1. Subjects Thirty-eight expert and forty-two novice programmers, all from Brisbane, Australia, were used to test the above hypotheses. The expert programmers were employed as programming professionals. The novice programmers were two groups of students who were completing courses in COBOL programming at a tertiary institution. In this respect, the group of novice programmers differs from the less experienced professional programmers who were used as novice programmers in Experiment I. This approach was taken in an attempt to better determine the characteristics of expert and novice performance. All subjects were randomly allocated to treatments.

3.1.2. Performance variables This experiment assessed the recall efforts according to the structure of the PROCEDURE DIVISIONs produced, as did the original experiment. However,

t Propositions are identified by the letter P.

74 1. VESSEY

this experiment also used objective measures of recall performance to assess the effects of the four program versions. The number of semantic and syntactic errors subjects committed was used, rather than ranking the programmer subjects according to expert opinion.

3.1.3. Task materials Recall performance for three modified versions of the COBOL program was measured against that for the original program. The original program is shown as Fig. 1. The relationships among the four program versions are shown in Table 2. They are outlined below:

Version 1 is the original COBOL program; it validates PHONE as a group item and prints a message for the last non-present data item.

Version 2 validates PHONE as two elementary items and prints a message for the last non-present data item.

Version 3 validates PHONE as a group item and prints messages for all data items in error.

Version 4 validates PHONE as two elementary items and prints messages for all data items in error.

Versions 2 and 4 have a fourth warning message in the WORKING-STORAGE SECTION. The only other differences in the four versions appear in the PROCESS- CARDS module. The three new versions of this module are shown in Fig. 2(a), (b), (c). Versions 3 and 4, which record error messages for all non-present data items, print the transaction record first, followed by any error messages.

3.1.4. Procedure Since this experiment seeks to explain the results of the previous study (Vessey, 1984 and Experiment I), the procedure used in that study was again employed here. Subjects were permitted 90 s to persue a given program version; they were then requested to reproduce an equivalent PROCEDURE DIVISION within 10 min, i.e. the program had to execute and it had to perform the same function as the original program.

TABLE 2 Program versions used in the current recall study

Structure of the validation routine

Validation of PHONE

As a group item

As an elementary item

Message for last data item in error

Messages for all data items in error

Version 1

Version 3

Version 2

Version 4

PROGRAMMERS' CHUNKS AND PROGRAM STRUCTURES 75

PROCESS-CARDS. MOVE SPACES TO PRINT-RECORD. IF NAME IN CARD IS EQUAL TO SPACES

MOVE WARNING-1 TO MESSAGE-AREA ELSE

MOVE NAME IN CARD TO NAME IN PRINT-RECORD. IF ADDR IN CARD IS EQUAL TO SPACES

MOVE WARNING-2 TO MESSAGE-AREA ELSE

MOVE ADDR IN CARD TO ADDR IN PRINT-RECORD. IF AREA-CODE IN CARD IS EQUAL TO SPACES

MOVE WARNING-3 TO MESSAGE-AREA ELSE

MOVE AREA-CODE IN CARD TO AREA-CODE IN PRINT-RECORD. IF NUMBR IN CARD IS EQUAL TO SPACES

MOVE WARNING-4 TO MESSAGE-AREA ELSE

MOVE NUMBR IN CARD TO NUMBR IN PRINT-RECORD. WRITE PRINT-RECORD. PERFORM READ-CARD.

(a)

PROCESS-CARDS. MOVE SPACES TO PRINT-RECORD. MOVE CORRESPONDING CARD TO PRINT-RECORD. WRITE PRINT-RECORD. MOVE SPACES TO PRINT-RECORD. IF NAME IN CARD IS EQUAL TO SPACES

MOVE WARNING-1 TO MESSAGE-AREA WRITE PRINT-RECORD

IF ADDR IN CARD IS EQUAL TO SPACES MOVE WARNING-2 TO MESSAGE-AREA WRITE PRINT-RECORD.

IF PHONE IN CARD IS EQUAL TO SPACES MOVE WARNING-3 TO MESSAGE-AREA WRITE PRINT-RECORD.

PERFORM READ-CARD.

(b)

PROCESS-CARDS. MOVE SPACES TO PRINT-RECORD. MOVE CORRESPONDING CARD TO PRINT-RECORD. WRITE PRINT-RECORD. MOVE SPACES TO PRINT-RECORD. IF NAME IN CARD IS EQUAL TO SPACES

MOVE WARNING-1 TO MESSAGE-AREA WRITE PRINT-RECORD.

IF ADDR IN CARD IS EQUAL TO SPACES MOVE WARNING-2 TO MESSAGE-AREA WRITE PRINT-RECORD.

IF AREA-CODE IN CARD IS EQUAL TO SPACES MOVE WARNING-3 TO MESSAGE-AREA WRITE PRINT-RECORD.

IF NUMBR IN CARD IS EQUAL TO SPACES MOVE WARNING-4 TO MESSAGE-AREA WRITE PRINT-RECORD.

PERFORM READ-CARD.

FIG. 2. (a),

(c)

main module for version 2 of the Recall experiment. (b), main module for version 3 of the Recall experiment, (c), main module for version 4 of the Recall experiment.

3.2. RESEARCH HYPOTHESES TESTED

Table 3 presents the research hypotheses tes ted in this study. T h e y der ive f rom the

proposi t ions p r e s e n t e d at the beginning of sec t ion 3. These hypo theses re la te to the

effect on exper t and novice p rog ram recall of different va l ida t ion s t ructures and

different m e t h o d s o f test ing the g roup i tem, P H O N E . Reca l l that these two

structures were ident i f ied as undes i rab le f rom the v i ewpo in t of the cogni t ive

76 t. VESSEV

TABLE 3 Hypotheses tested in Experiment II

Proposition Hypothesist

P1

P2

P3

H1 Overall expert vs novice performance E > N for versions 1 to 4

HE1 Overall expert programmer performance version 4 > version 1

HE2 Expert programmer performance on different validation structures (a) version 3 > version 1 (b) version 4 > version 2 (c) versions 3 + 4 > versions 1 + 2

HE3 Expert programmer performance on PHONE validation (a) version 2 > version 1 (b) version 4 > version 3 (c) versions 2 + 4 > versions 1 + 3

HN1 Overall novice programmer performance version 4 = version 1

HN2 Novice programmer performance on different validation structures (a) version 3 = version 1 (b) version 4 = version 2 (c) versions 3 + 4 = versions 1 + 2

HN3 Novice programmer performance on PHONE validation (a) version 2 = version 1 (b) version 4 = version 3 (c) versions 2 + 4 = versions 1 + 3

t Hypotheses are identified by the letter H. Except for the general hypothesis HI, they are presented in pairs for expert and novice programmers. They are, therefore, further identified by E for expert and N for novice. In alI instances, the "greater than" sign (>) is used to denote superior performance of the variables on the left of the inequality compared with the variables on the right of the inequality.

matching of the knowledge structures in the p rogram and those possessed by p rogrammers .

3.3. DATA ANALYSIS

D a t a were analysed in two steps. First, the data were analysed qualitatively to de te rmine the extent to which subjects recalled the p rog ram versions they had originally viewed. Second, the recalled programs were assessed according to the n u m b e r of semant ic and syntactic errors their authors made . A doubly mult ivariate analysis of var iance model was fitted to the data for the number of semant ic and syntactic errors p rog rammers commi t t ed since Pearson p roduc t m o m e n t corre la t ion coefficients indicated the dependen t variables were modera te ly correlated.

3.3.1. The qualitative data analysis The subjects ' recall responses were analysed for changes in the recalled val idat ion and phone structures. Table 4 shows the numbers of different message formats

PROGRAMMERS' CHUNKS AND PROGRAM STRUCTURES

TABLE 4

Validation structure responses

Expert Novice

Version Last First All Last First All

1 (7)? 1 2 (8) 2 - - 2 (9) - - - - (5) 2 4 3 2 1 (7) 4 2 (4) 4 1 1 (7) 2 3 (6)

t The figures in brackets represent the numbers of subjects who re- produced the versions they were shown. Hence, versions 1 and 2 produced a message for the last non-present data item, while versions 3 and 4 produced messages for all non-present data items.

77

produced by subjects, according to the original program version they viewed. Three of the 19 expert programmers who were shown the less desirable validation structure (versions 1 and 2) produced more desirable validation structures (versions 3 and 4). However, five of the 19 experts who were shown the more desirable validation structure reverted to a less desirable structure on recall, three of them to the least desirable structure that produced a message for the last non-present data item, and two to the version that produced a message for the first non-present data item. The difference between the number of expert programmers changing from a more to a less desirable validation structure and vice versa was not significant according to the Chi-square test for differences in probabilities.

Table 4 also shows that eight of the 21 novice programmers who were shown the less desirable validation structure produced routines with bet ter validation struc- tures. However , 11 of the 21 who were shown the more desirable validation structure reverted to less-desirable validation structures, six to the least-desirable last-message format and five to the first-message format. The difference between the number of novice programmers changing from a more- to a less-desirable validation structure and vice versa was not significant according to the Chi-square test.

More novices changed the structure of the validation routine they were to recall than experts (19/42 compared with 8/38). This result is statistically significant according to the Chi-square test for differences in probabilities (the normal distribution is used for a one-sided test; T1 = 2-284; p = 0-012; Conover, 1980). However, the proport ions of expert "and novice programmers changing from a worse to a better validation structure and vice versa were quite similar (3/19 compared with 5/19 and 8/21 compared with 11/21, for ratios of 0.60 and 0.73, respectively).

The results for P H O N E validation are mixed. One expert subject who received version 1 showed that he fully understood the problem when he tested for both A R E A - C O D E A N D PHONE, essentially the same as testing for P H O N E as a group item. A second subject with version 1 committed an error in MOVEing PHONE TO P R I N T - R E C O R D ; this was an error exhibited also by novices. Two expert subjects with version 4 used the test, IF P H O N E , following their tests for A R E A - C O D E . The use of the group item, P H O N E , rather than N U MBR appeared to be a casual error.

78 I. VESSEY

Novices showed remarkably similar patterns of deviations from the phone version they were shown. None of the 10 novices who viewed version 1 reproduced that version. Five novices did not reproduce sufficient code to determine their approach to phone validation. Three subjects validated PHONE as required by this version, but moved PHONE instead of moving AREA-CODE and NUMBR individually to non-equivalent data items. One subject MOVEd CARD TO PRINT-RECORD, when he probably intended to MOVE CORRESPONDING CARD. The remaining subject used the statement:

IF AREA-CODE IN CARD = SPACES OR

NUMBR IN CARD = SPACES,

apparently realizing that there was only one message for what were, in fact, two elementary data items. Of the 11 novices who received version 2, three produced the version they had viewed, while four others did not reproduce sufficient code to determine the phone construct they were using. A further four subjects attempted to use the MOVE CORRESPONDING construct to move input data items to output data items. Three subjects omitted the CORRESPONDING, while the fourth subject MOVEd CARD TO CORRESPONDING PRINT-RECORD, instead of MOVEing CORRESPONDING CARD. The only other divergence from the versions presented occurred for version 4. One novice tested for AREA-CODE and then tested for PHONE rather then NUMBR, as did two experts.

3. 3. 2. The multivariate model Subjects' recall efforts were scored according to the number of semantic and syntactic errors they made. The total numbers of each type of error were recorded, with the exception of the syntactic error, qualification. There are two reasons for recording errors in qualification as one syntactic error. First, subjects who made errors in qualification may have recorded from three to 16 errors, depending on the program version, for what was essentially the same error (version 3 and version 2, respectively). Second, subjects ranged substantially in their knowledge of qualifica- tion, so that including the total number of syntactic errors may have led to confounding with the effects under investigation. An incomplete or missing program statement was scored as one semantic error.

The multivariate model permitted the two main effects, program version (P) and expertise (E), and their interaction (P x E) to be investigated. Since Box's M statistic, which measures the homogeneity of the variance-covariance matrices, was significant for the model with four program versions and two levels of expertise, the dependent variables were transformed using the formula log 10 (X + 1) (see Kirk, 1968). The transformed variables then complied with the assumptions of the MANOVA model and were used throughout the ensuing analysis. Table 5 shows the means and standard deviations for the performance measures for the logarithmic transformation of each program version-expertise combination. The Appendix contains the means and standard deviations for the untransformed variables.

3.3.2.1. The analytical procedure used. A procedure described by Messmer and Homans (1980) was used to determine which criterion measure produced sig- nificant effects. This procedure involves a series of step-down tests, first entering one

PROGRAMMERS' CHUNKS AND PROGRAM STRUCTURES 79

TABLE 5 Means and standard deviations for dependent measures

(logarithmic transformations)

Errors Expert Novice

Version 1 Semantic 0.306 (0-161) 0.892 (0.281) Syntactic 0.168 (0.257) 0.405 (0-345)

Version 2 Semantic 0.358 (0.320) 0-727 (0.348) Syntactic 0.220 (0.298) 0.431 (0.331)

Version 3 Semantic 0.181 (0.324) 0.635 (0.398) Syntactic 0.208 (0.286) 0.561 (0.314)

Version 4 Semantic 0.332 (0.313) 0.748 (0.389) Syntactic 0~254 (0.339) 0.439 (0.279)

criterion variable, then entering a second with the first as a covariate, and so on. To use the procedure, it is necessary to have an a priori ordering of the importance of the criterion variables. This study used the Messmer-Homans procedure with semantic errors selected as the more important dependent variable, based on two premises: first, that the study hypotheses test semantic rather than syntactic constructs; and, second, that semantic errors, in general, are more crucial to expertise in programming, since by definition syntactic errors are errors that compilers can detect (Shneiderman and McKay, 1976). The data were recoded to fit a one-way ANOVA model so that an overall measure of significance could be obtained from the Messmer-Homans procedure. Since the individual tests of significance for each criterion variable are related, the Bonferroni inequality was used to calculate the individual statement levels of significance that must be attained (Neter & Wasserman, 1974). Using a 0.05 family level of significance, the individual statement level of significance for two dependent variables is 0-025. Similarly, the individual statement level of significance for a family significance level of 0.01 is 0.005. Factors responsible for producing effects on the criterion variables were determined using pairwise contrasts. 3.3.2.2. Results of hypothesis testing. Tables 6, 7, 8, and 9 present the results of testing the hypotheses stated in section 3.2. The following sections describe the results:

(a) Testing the performance of experts and novices in program recall Hypothesis H1 tested expert-novice performance in recall across all four program versions. Table 6 presents the results. There is an expertise effect for semantic errors across all program versions. Further, novice programmers made significantly more semantic errors than expert programmers for each of the program versions (see, also, Table 5 and the Appendix). There is no effect for syntactic errors once the effect of semantic errors has been removed. Hence, hypothesis H1 is supported.

80 I. VESSEY

TABLE 6 Results of testing the effects of expertise on program recall

Semantic Syntactic with errors semantic errors

Overall 0.000t 0-335 Experts

Versions 1, 2 0.735 0.748 Versions 1, 3 0.403 0.686 Versions 1, 4 0.867 0.561 Versions 2, 3 0.251 0.942 Versions 2, 4 0.867 0.800 Versions 3, 4 0.327 0.852

Novices Versions 1, 2 0.263 0.734 Versions 1, 3 0-090 0.183 Versions 1, 4 0.327 0.699 Versions 2, 3 0.530 0.294 Versions 2, 4 0.885 0.962 Versions 3, 4 0.442 0.316

Version 1-expertise 0.000t 0.258 Version 2-expertise 0.0165 0.259 Version 3-expertise 0.003t 0.047 Version 4-expertise 0.0075 0.363

t Significant at the 0.01 family level. $ Significant at the 0-05 family level.

(b) Testing the effect of program knowledge structures on expert programmer recall

This group of hypotheses tested the effect of matching program and programmer knowledge structures on the recall performance of expert programmers. Hypothesis HE1 compared the performance of expert programmers on the best and the worst program versions based on the preceding analysis. Table 7 presents the results. Program version did not significantly affect the number of semantic or syntactic errors experts made (p = 0.861 and 0.558, respectively).

Hypothesis HE2 assessed the effect of the validation structure on expert recall performance. This was achieved by comparing performance on versions 1 and 3, 2 and 4, and 1 + 2 and 3 + 4. Table 8 presents the results. None of the measures for program version was significant for expert programmers (p = 0.387 and 0.629; p = 0-873 and 0.815; p = 0.467 and 0.640, respectively). Hence subhypotheses HE2 (a), (b) and (c) are not supported.

Hypothesis HE3 assessed the effect on expert recall performance of validating PHONE as a group item rather than two elementary items. The hypothesis was tested by comparing performance on versions 1 and 2, 3 and 4, and 1 + 3 and 2 + 4. Table 9 presents the results. Program version did not significantly affect either the number of semantic or number of syntactic errors expert programmers made (p =0.712 and 0.692; p =0.366 and 0.993; p =0.350 and 0.692, respectively). Hence, none of the subhypotheses HE3 (a), (b), and (c) was supported.

pROGRAMMERS' CHUNKS AND PROGRAM STRUCTURES 81

TABLE 7 Results of testing the effects of matching knowledge structures

on program recall

Semantic Syntactic with errors semantic errors

Overall 0.000t 0.455 Experts-version 0-861 0-558 Novices-version 0.306 0.760 Version 1-expertise 0.000t 0-234 Version 4-expertise 0.006t 0-304

t Significant at the 0-01 family level. :~ Significant at the 0-05 family level.

(c) Testing the effect of program knowledge structures on novice p rogrammer recall

This set of hypotheses tested the proposit ion that novice p rogrammers are not affected by the information structures incorporated in a program when they perform a recall task. Hypothes is HN1 compared the performance of novice programmers under the least and the most favorable conditions according to the previous analysis of desirable knowledge structures. Table 7 presents the results. There were no significant differences in the performance of novices for the two program versions

TABLE 8 Results of testing the effects of validation structure on recall

performance

Semantic Syntactic with errors semantic errors

Versions 1, 3 0.000t 0-139 Experts-version 0.387 0.629 Novices-version 0-082 0.144

Version 1-expertise 0-000t 0.479 Version 3-expertise 0.003t 0-097

Versions 2, 4 0.011t 0.519 Experts-version 0.873 0-815 Novices-version 0.890 0-955

Version 2-expertise 0.0235 0.225 Version 4-expertise 0-011:~ 0.307

Versions 1 + 2, 3 + 4 0-000t 0.077 Experts-version 0.467 0-640 Novices-version 0.279 0.331

Version 1 + 2-expertise 0.000t 0.117 Version 3 + 4-expertise 0.000t 0.045

t Significant at the 0.01 family level. :~ Significant at the 0-05 family level.

82 I. VESSEY

TABLE 9

Results of testing the effects of PHONE validation on recall performance

Semantic Syntactic with errors semantic errors

Versions 1, 2 0.000t 0.206 Experts-version 0.712 0.692 Novices-version 0.226 0-955

Version 1-expertise 0.000t 0.092 Version 2-expertise 0.0115 0.116

Versions 3, 4 0-0115 0.519 Experts-version 0-366 0.993 Novices-version 0-478 0-232

Version 3-expertise 0-0085 0-134 Version 4-expertise 0-0145 0-657

Versions 1 + 3, 2 + 4 0.000t 0-108 Experts-version 0.350 0.692 Novices-version 0.802 0.627

Version 1 + 3-expertise 0-000t 0.028 Version 2 + 4-expertise 0.000t 0.127

(p =0.306 and 0.760 for semantic and syntactic errors, respectively). Hence hypothesis HN1 is supported.

Hypothesis HN2 assessed the effect of the validation structure on novice recall performance, by comparing the number of errors produced with three sets of program pairs. Table 8 presents the results. None of the measures for program version was significant for novice programmers (p = 0.082 and 0.144; p = 0.890 and 0.995; p = 0.279 and 0.331, respectively). Hence subhypotheses HN2 (a), (b), and (c) are supported.

Hypothesis HN3 assessed the effect on novice recall performance of validating a group item rather than two elementary items by comparing performance on three sets of program pairs. Table 9 presents the results. None of the measures for program version was significant for novice programmers (p =0.226 and 0.955; p = 0.478 and 0.232; p = 0.802 and 0.627, respectively). Hence subhypotheses HN3 (a), (b), and (c) are supported.

3.4. DISCUSSION OF RESULTS

Several observations can be made based on the qualitative results. First, the fact that more novices changed the structure of the validation routine they recalled than experts indicates that novice programmers found greater difficulty in abstracting the essence of the program function in the time permitted for perusal. Second, the qualitative analysis revealed that the content of expert and novice knowledge structures are quite similar since similar proportions of experts and novices changed from poorer to better and from better to poorer validation structures.

pROGRAMMERS' CHUNKS AND PROGRAM STRUCTURES 83

Third, as many programmers changed to less desirable validation structures as changed to more desirable structures (16/40 compared with 11/40). The reason for this apparent anomaly appears to lie in the fact that programmers do not possess the desirable validation structure in long-term memory. Hence, expert programmers do not appear to possess well-defined (i.e. generally accepted) knowledge structures of the kind examined in this research.

Fourth, the results for the validation of the PHONE structure show that novices were more troubled by the structure than were experts, although the pattern of divergences from the original version was similar for experts and novices. Only two subjects, one expert and one novice, both with version 1, changed the structure of the PHONE validation procedure presented to them. It is impossible to state whether or not they were aware of this fact. Other divergences from the procedure presented occurred when subjects moved PHONE as a group item (three novices compared with one expert, all with version 1), avoided the movement of PHONE by using or trying to use the MOVE CORRESPONDING statement (four novices all with version 2), and tested for PHONE following a test for AREA-CODE (two experts and one novice, all with version 4).

The quantitative results show that expert programmers perform better in a program recall task than novice programmers, the difference being manifested in the semantic errors committed. This research therefore supports, in essence, the prior research conducted by Shneiderman and McKeithen et al. that showed that experts recall more of a meaningful program than do novices.t There were no differences in the number of semantic and syntactic errors committed by expert or novice programmers between the worst and the best program versions examined, the two validation structures, or the two methods of validating PHONE. These were the results predicted for novices and they are supported by novice performance in this recall experiment. Experts were expected to perform better as the information structures in the recalled program matched better the chunks programmers were expected to possess. Hence, the quantitative results support the qualitative analysis--expert programmers' knowledge does not appear to be organized in long-term memory in terms of well-defined program structures of the kind investigated here.

4. Experiment I I I

Since the second experiment did not differentiate among the chunks possessed by expert and novice programmers, a third experiment was conducted. This time the programmers were requested to cons t ruc t a validation routine to fulfill a similar function to the program used in the previous experiments. This strategy removed the constraints imposed in the previous experiments where subjects were requested to reproduce the information structures they had already viewed in the program.

t There is a basic difference between this recall experiment and those seeking to replicate the chess results, oiz. all program versions used in this experiment were meaningful (i.e. executable) programs.

Strictly speaking, this research assessed the accuracy of recall in testing for the number of semantic and syntactic errors programmers made, while the Shneiderman (1976) and McKeithen et al., (1981) studies assessed qualitatively the extent of program recall. However, the recall efforts in this research were scored so that a missing program statement was regarded as a semantic error. Hence, the results of this experiment can be compared with those of Shneiderman and McKeithen et al.

84 ~. VESSEY

The construction approach, therefore, permitted maximum flexibility in investigat- ing programmers' knowledge structures.

4.1. SUBJECTS

The subjects in this experiment were four doctoral students at the University of Minnesota, who had completed or were about to complete preliminary examina- tions. The subjects had a mean of 5 years' professional programming experience (ranging from 2.5 to 8 years) and a mean of 2 years' COBOL programming experience.

4.2. TASK

The first three subjects were given the first three divisions of the original COBOL program up to and including the heading for the WORKING-STORAGE SECTION. They were not given the WORKING-STORAGE SECTION as this would have constrained them, at least in validating PHONE. The fourth subject was even less constrained than the first three subjects in that he received only the input record format of the DATA DIVISION. This subject, then, was free to determine his own message format and method of presentation.

4.3. PROCEDURE

Subjects were requested to write an appropriate PROCEDURE DIVISION based on the information presented to them, and anything else they wished to add to the WORKING-STORAGE SECTION. The instructions stressed the importance of constructing a PROCEDURE DIVISION that would be suitable for business practise.

All subjects were requested to complete a detailed debriefing questionnaire concerning their choice of validation format, the structures they considered, and the structures they considered should be produced.

4.4. RESULTS

The results for the four subjects are discussed in turn.

4. 4.1. Subject 1 This subject produced a message for the last non-present data item. However, he indicated via the questionnaire that:

(a) another "possibil i ty. . . which leads to better output is to print each input record followed by zero, one, or more error messages";

(b) he considered a message for all items in error; (c) "all error messages should be produced".

During debriefing, he further indicated that he had produced the code he had because the experimenter had indicated that it was not a difficult problem. However, it would have been equally as simple to produce code for the first non-present data item as for the last, for example.

PROGRAMMERS' CHUNKS AND PROGRAM STRUCTURES 85

4. 4. 2. Subject 2 This subject produced a very complex routine with a number of switches and counts of valid and invalid records. From the viewpoint of validation structure, she produced a routine that wrote a generic error message for any field in error--"This record has a missing value".

This subject, unique to this group, produced a non-standard PHONE validation structure. She wrote:

IF AREA-CODE OF CARD = SPACES OR NUMBR OF CARD = SPACES

M O V E . . .

The routine as presented in the original experiment and versions 1 and 3 of Experiment II, used:

IF PHONE EQUALS SPACES M O V E . . .

which, in effect, is the same as testing for:

IF AREA-CODE EQUALS SPACES AND NUMBR EQUALS SPACES

M O V E . . .

This subject possibly realized that AREA-CODE would not have been essential on all occasions, though her interpretation did not adequately deal with the possibility of NUMBR being absent.

4. 4. 3. Subject 3 This subject produced a message for the first non-present data item. He gave two reasons:

(a) he thought the requirement of the experiment was for one error message only;

(b) he "considered messages for all items in error, but could not see how that could be supported with X(25). [Had it been 9(9), one digit could have been stored per error]".

Hence, his overriding concern was. to produce a single line of output per record.

4. 4. 4. Subject 4 This subject had greater flexibility than the preceding three subjects in that he could determine the structure of his own output format and was therefore free to present the error message in whatever way he chose. Similar to the original output record, this subject constructed a message section at the end of the output record format--this time of 38 characters, rather than 25. He wrote an asterisk in the output record for each missing input data item. In the message area, he wrote: "ASTERISKED ITEMS ARE MISSING ON INPUT".

The factor that motivated the subject to choose this format was the "desire not to produce multiple lines of output for a single input record". He further indicated that

86 i. VESSEY

messages should be produced for all items in error, and that is, in effect, what his routine accomplished.

4.5. DISCUSSION

When left to their own initiatives, four programmers with considerable practical experience, validated multiple input items in four different ways:

(a) one subject produced a message for the last non-present data item (subject 1); (b) a second subject produced a message for the first non-present data item

(subject 3); (c) a third subject produced a message applicable to all non-present data items

(subject 4); while (d) the fourth subject produced a generic message that a data item was missing

(subject 2).

These diverse observations provide further support for the results of Experiments I and II. Further, one subject validated PHONE in a manner difficult to explain.

5. Discussion and conclusions

What, then, are the implications of the results of these experiments for a concept of programmer expertise based on the storage and retrieval of well-defined chunks of programming knowledge in long-term memory?

First, the results of all three experiments reported here and the results of the three judges' evaluation of the four program versions investigated in Experiment II (see footnote on p. 72), all indicate that expert programmers do not have well-defined concepts of what a validation routine should do, i.e. they do not appear to possess well-defined scripts for validating a record. It is apparent, however, that expert programmers do possess large bodies of knowledge (see, for example, Brooks, 1977). There have been (almost) innumerable suggestions made regarding the size and organization of programmers' knowledge bases. Programmers' knowledge may be organized as: (a) a large store to stereotypic patterns; (b) a moderate store of conceptual building blocks; or (c) a small store of data episode forms (Pennington, 1982). Probably more important than size is the organization of those knowledge structures, i.e. the bases on which the knowledge structures are formulated. We must determine whether, for example, knowledge structures are based on specific content, procedures, scripts, programming structures, plan knowledge, functions, data structure, data flow, control flow, algorithms, and so on. In this research, Experiment II showed that scripts did not prove to be a fruitful way of determining the knowledge structures programmers possess.

Second, a direct implication of the apparent lack of a script for validating records is the normative question of whether programmers ought to have a script for validation in long-term memory. This raises the issue of what we need to teach in our training courses and/or how effective our training methods are.

Third, the structure of our programming languages may contribute to the diversity of knowledge structures programmers possess. Malhotra, Thomas, Carroll and Miller (1980), for example, claim that current programming languages are so non-specific that they provide no guidance to programmers as to which knowledge

PROGRAMMERS' CHUNKS AND PROGRAM STRUCTURES 87

structures are more appropriate than others. They state:

"Programming languages and access methods have been designed to be as general and as powerful as possible. One can do literally anything with them and they do little to constrain, suggest or facilitate good, appropriate or preferred data structures and algorithms. Thus, the software designer can create a bewildering array of potential pieces from which the program can be put together... This diversity of potential intermediate components and the difficulty of evaluation makes software design difficult and the designer often picks intermediate structures on arbitrary bases. Familiarity is probably the most important of these. Furthermore, comparing competing data structures or algorithms is difficult and it is usually possible to find several examples that all perform adequately (pp. 129-130).t

Hence, with the current trend in programming languages, the possibility of ever forming a standard set of building blocks for computer programming as, for example, Alexander, Ishikawa, Silverstein, Jacobson, Fiksdahl-King and Angel (1977) have done for architectural design, is very remote. Rather than supporting the proliferation of programming languages per se, we would be well advised to focus the direction of programming language evolution by identifying psychologi- cally meaningful cognitive components and supporting them in our programming languages (see, for example, Soloway, Bonar and Ehrlich, 1983).

Fourth, the results of the first experiment suggest that programming knowledge per se may not play a particularly important role in debugging. The success of the ex post classification of subjects in predicting debugging time and success suggests that processes or strategies may have a greater impact on debugging performance. Further, subsequent analysis of programmers' debugging processes showed that the formulation of a model of how the program functioned and the role of the error in that context was essential to successful debugging (Vessey, 1984). Brooks (1977) also provides evidence that programmers exhibit this type of behavior. If, following Malhotra et al. (1980), our programming languages permit the formulation of almost any information structure, then programmers engaged in debugging programs they have not authored must first assimilate the information structures contained in the program. Hence, the processes they use to assimulate that knowledge be more important than the knowledge structures they possess prior to examining the program.

In summary, this research substantiates previous research by Shneiderman (1976) and McKeithen et al. (1981) that expert programmers perform better on a program recall task than novice programmers. It appears likely, however, that programmers possess a variety of knowledge structures in long-term memory. They do not appear to possess well-defined scripts for validating records. Further, it appears that debugging strategies may have a greater influence on debugging performance than the chunks of programming knowledge programmers possess.

References

ADELSON, B. & SOLOWAY, E. (1984). Methodology revisited: The cognitive underpinnings of programmers' behavior. In SALVENDY, G. Ed., Human-Computer Interaction, pp. 181-184. Amsterdam: Elsevier.

t See also Floyd (1979).

88 I. VESSEY

ALEXANDER, C., ISHIKAWA, S., SILVERSTEIN, M., JACOBSON, M., FIKSDAHL-KING, I. & ANGEL, S. (1977). A Pattern Language: Towns, Buildings, Construction. New York: Oxford University Press.

Aa-wooD, M. E. & RAMSEY, H. R. (1978). Cognitive structures in the comprehension and memory of computer programs: An investigation of computer program debugging. NTIS, AD-A060522/0.

BARFIELD, W. (1986). Expert-novice differences for software: Implications for problem- solving and knowledge acquistion. Behaviour and Information Technology, 5, 15-29.

BOWER, G. H., BLACK, J. B. &; TURNER, T. J. (1979). Scripts in memory for text. Cognitive Psychology, 11, 177-220.

BROOKS, g. E. (1977). Towards a theory of the cognitive processes in computer program- ming. International Journal of Man-Machine Studies, 9, 737-751.

BROWN, G. P., CARLING, R. T., HEROT, C. F., KRAMLICH, D. A. & SOUZA, P. (1985). Program visualization: Graphical support for software development. Computer, 18, 27-35.

CHASE, W. G. and SIMON, H. A. (1973a). Perception in chess. Cognitive Psychology, 4, 55-81.

CHASE, W. G. and SIMON, H. A. (1973b). The mind's eye in chess. In CHASE, W. G. Ed., Visual Information Processing. New York: Academic Press.

CONOVER, W. J. (1980). Practical Nonparametric Statistics (2nd edn). New York: Wiley. CURTm, B., SHEPPARD, S. B., KRUESI-BAILEY, E., BAILEY, J. & BOEHM-DAvIS, D. (1986).

Experimental evaluation of software specification formats. Journal of Systems and Software (forthcoming).

EGAN, D. E. & SCHWARTZ, B. J. (1979). Chunking in recall of symbolic drawings. Memory and Cognition, 7, 149-158.

ENGLE, R. W. & BUKSTEL, L. (1978). Memory processes among bridge players of differing expertise. American Journal of Psychology, 91, 673-689.

ERICSSON, K. A. & SIMON, H. A. (1980). Verbal reports as data. Psychological Review, 87, 215-251.

FLOYD, R. W. (1979). The paradigms of programming. Communications of the ACM, 22, 455-460.

GIOIA, D. A. & MANZ, C. C. (1985). Linking cognition and behavior: A script processing interpretation of vicarious learning. Academy of Management Review, 10, 527-539.

de GROOT, A. D. (1965). Thought and Choice in Chess. The Hague: Mouton. de GROOT, A. D. (1966). Perception and memory vs. thought: Some old ideas and recent

findings. In KLEINMUNTZ, B. Ed., Problem Solving: Research, Method, and Theory. New York: Wiley.

KIRK, R. E. (1968). Experimental Design: Procedures for the Behavioral Sciences. Belmont, California: Brooks.

LARKIN, J. H., McDERMOTr, J., SIMON, D. P. & SIMON, H. A. (1980). Models of competence in solving physics problems. Cognitive Science, 4, 317-345.

MALHOTRA, A., THOMAS, J. C., CARROLL, J. M. & MILLER, L. A. (1980). Cognitive processes in design. International Journal of Man-Machine Studies, 12, 119-140.

McKEITHEN, K. B., REITMAN, J. S., RUETER, H. H. & HIRTLE, S. C. (1981). Knowledge organization and skill differences in computer programmers. Cognitive Psychology, 13, 307-325.

MESSMER, D. J. & HOMANS, R. E. (1980). Methods for analyzing experiments with multiple criteria. Decision Sciences, 11, 42-57.

NETER, J. & WASSERMAN, W. (1974). Applied Linear Statistical Models. Homewood, Illinois: Irwin.

NORMAN, D. A. (1978). Notes toward a theory of complex learning. In LESGOLD, A. M. PELLIGRINO, J. W., FOKKEMA, S. & GLASER, R. Eds., Cognitive Psychology and Instruction. New York: Plenum Press.

PENNINGTON, N. (1982). Cognitive components of expertise in computer programming: a review of the literature. Technical Report No. 46, University of Michigan.

PROGRAMMERS' CHUNKS AND PROGRAM STRUCTURES 89

RAEDER, G. A. (1985). A survey of current graphical programming techniques. Computer, 18, 11-25.

REILLV, R. et al. (1975). The use of expert judgment in the assessment of experimental learning. CAEL Working Paper No. 10, 1975.

REITMAN, J. S. (1976). Skilled perception in Go: Deducing memory structures from interresponse times. Cognitive Psychology, 8, 336-356.

SCHANK, R. C. & ABELSON, R. B. (1977). Scripts, Plans, Goals, and Understanding. Hillsdale, New Jersey: Lawrence Erlbaum Associates.

SHNEIDERMAN, B. (1976). Exploratory experiments in programmer behavior. International Journal of Computer and Information Sciences, 5, 123-143.

SHNEIDERMAN, B. (1977). Measuring computer program quality and comprehension. International Journal of Man-Machine Studies, 9, 465-478.

SHNEIDERMAN, B. & McKnv, D. (1976). Experimental investigations of computer program debugging and modification. Proceedings of the 6th International Congress of the International Ergonomics Association.

SLOBODA, J. A. (1976). Visual perception of musical notation: Registering pitch symbols in memory. Quarterly Journal of Psychology, 28, 1-16.

SOLOWAY, E. S., BONAR, J. & EHRLICH, K. (1983). Cognitive strategies and looping constructs: An empirical study. Communications of the A CM, 26, 853-860.

VESSEY, I. (1984). An investigation of the psychological processes underlying the debugging of computer programs. Doctoral dissertation, University of Queensland.

VESSEV, I. (1985). Expertise in debugging computer programs: A process analysis. International Journal of Man-Machine Studies, 23, 459-494.

VESSEY, I. (1986). Expert-novice knowledge organization: An analysis and empirical investigation of computer program recall. Behaviour and Information Technology (forthcoming).

WEBER, R. A. (1982). EDP Auditing. New York: McGraw-Hill. YOURDON, E., GANE, C., SARSON, Z. t~ LISTER, T. R. (1976). Learning to Program in

Structured COBOL. New York: Yourdon, Inc.

Appendix: means and standard deviations for dependent variables

Errors Expert Novice

Version 1 Semantic 1-400 (1-578) 8.200 (5-224) Syntactic 0.800 (1-549) 2-400 (2-914)

Version 2 Semantic 1.889 (2-086) 6-091 (5-594) Syntactic 1.111 (1-764) 2-455 (2.464)

Version 3 Semantic 1.100 (2-283) 4.800 (3.824) Syntactic 1.000 (1-491) 3.600 (3-658)

Version 4 Semantic 1.778 (2.489) 7-091 (7-582) Syntactic 1.444 (2-351) 2.273 (1.954)


Top Related