CONSTRUCT VALIDITY OF ACCESMENT CENTRES: LEST WE FORGET THAT ASSESSMENT CENTRES ARE STILL PSYCHOLOGICAL ASSESSMENTS 31st Annual ACSG Conference March 2011

Download CONSTRUCT VALIDITY OF ACCESMENT CENTRES: LEST WE FORGET THAT ASSESSMENT CENTRES ARE STILL PSYCHOLOGICAL ASSESSMENTS 31st Annual ACSG Conference March 2011

Post on 24-Dec-2015

212 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

<ul><li> Slide 1 </li> <li> CONSTRUCT VALIDITY OF ACCESMENT CENTRES: LEST WE FORGET THAT ASSESSMENT CENTRES ARE STILL PSYCHOLOGICAL ASSESSMENTS 31st Annual ACSG Conference March 2011 </li> <li> Slide 2 </li> <li> What is known about the construct validity currently: Over last 50 years popular in the assessment of personal differences for managerial development purposes Multi-occupation, multi-company investigation with high face validity AC post-exercise dimension ratings (PEDRs) is more pervasive than cross situational stability in candidate ratings Bowler, M. C., &amp; Woehr, D. J. (2006). A meta-analytic evaluation of the impact of dimension and exercise factors on assessment center ratings. Journal of Applied Psychology, 91, 11141124. Lance, Lambert, Gewin, Lievens, &amp; Conway, (2004) found in a meta-analysis that exercise effect explain almost three times more variance than dimension ratings Problematic for construct validity: PEDRs is a function of exercise design and not person competencies. </li> <li> Slide 3 </li> <li> What is known about the construct validity currently: Recently there has been two schools of thought to assess the construct validity of ACs: Confirmatory Factor Analysis (CFA) [MTMM] Generalization Theory FOUR basic models within the CFA tradition: Correlated Dimension Correlated Exercises Model (CDCE): MTMM One-dimension-correlated exercise model (1DCE) an uncorrelated dimensions, correlated exercises, plus g model (UDCE + g) Correlated dimension-correlated uniqueness (CDCU) model Lance, Woehr &amp; Meade (2007). A Monte Carlo Investigation of Assessment Center Construct Validity Models. Organizational Research Methods, 10(3), 430-448 </li> <li> Slide 4 </li> <li> Advantages of CFA approach Partition out error variance; ALSO Partition out Exercise effects Thus PEDRs are a function of both exercise and dimension effects However, technically CTCE model difficult to model (Empirical under-identification) Prerequisite is construct validity before partitioning out exercise effects Thus critical first step was to assess construct validity of dimensions with actual DAC data </li> <li> Slide 5 </li> <li> An Example: Achievement motivation and Financial Perspective </li> <li> Slide 6 </li> <li> An Example: Achievement motivation AM: DIMENSION: ACHIEVEMENT MOTIVATION EXCERCISES ANALYSIS PROBLEM (AP) SIMULATED IN BASKET (SIB) TRAITS INNOVATIONIN_AP ENERGYEN_AP PROCESS SKILLSPS_APPS_SIB </li> <li> Slide 7 </li> <li> Correlation Matrix </li> <li> Slide 8 </li> <li> AM: Option 1 CDCE model would be preferable: WHY? Differentiate sources of variance: SCENARIO 2: ACHIEVMENT MOTIVEATION IN_AP IN_SIB ANALYSIS PROBLEM SIMULATED IN BASKET </li> <li> Slide 9 </li> <li> -Empirical under-identification -We have 13 parameters to measure in the model, yet only 10 pieces of information in the covariance matrix -Thus we have to much model parameters to gauge with too little information (-3df) -Similar to equation: X + Y = 6 -Unlimited possible combinations to solve the equation ANALYSIS PROBLEM x2x2 x3x3 22 x 11 33 x x x1x1 SIMULATED IN BASKET x4x4 x 33 ACHIEVEMENT MOTIVATION x x x x 21 </li> <li> Slide 10 </li> <li> AM: Technical Problems Simulated in Basket only measures one dimension (trait): Process Skills Whereas Innovation, Energy and Process skills are gauged with analysis problem excercise For basic CFA we need at least three indicators for each dimension. However, if we have a single dimension and single exercise effect we need a minimum of five indicators This have DAC design implications if we want to gauge the measurement effect in addition to the dimension effects Literature review by Lievens and Conway (2001) suggest that median number of three exercises and five dimensions </li> <li> Slide 11 </li> <li> AM: Option 2 ACHIEVMENT MOTIVEATION IN_AP GLOBAL METHOD EFFECT Still not enough degrees of freedom, need at least 5 indicators (10 possible sources of information yet must measure 12 parameters, thus -2 df SOLUTION: Include more exercises per dimension </li> <li> Slide 12 </li> <li> Financial Perspective (FP) DIMENSION: FINANCIAL PERSPECTIVE EXCERCISES ANALYSIS PROBLEM (AP) GROUP DISCUSSION (GD) ONE:ONE (ONE) SIMULATED IN BASKET (SIB) TRAITS BROKER MARKET (BM) BM_APBM_GDBM_ONEBM_SIB CROSS UP SELLING (CUS) CUS_APCUS_GDCUS_ONECUS_SIB PROFIT (PROF) PROF_APPS_SIBPS_ONEPS_SIB </li> <li> Slide 13 </li> <li> Correlation Matrix LARGE CORRELATIONS BETWEEN EXCERCERCISES </li> <li> Slide 14 </li> <li> FP:CTCE BROKER MARKET ANALYSIS PROBLEM CUS_AP CUS_GD CUS_ONE CUS_SIB CROSS UP SELLING BM_AP BM_GD BM_ONE BM_SIB PROF_AP PROF_GD PROF_ONE PROF_SIB GROUP DISCUSSION ONE:ONE SIMULATED IN BASKET PROFIT </li> <li> Slide 15 </li> <li> FP: CDCE Model did not converge although enough df (78-44=34df) Singularity problems: Chiefly because of multi-colinearity Go back to only dimension level without exercise effects Thus only Broker Market, Cross up Selling and Profit individually </li> <li> Slide 16 </li> <li> CFA: Broker Market: FIT </li> <li> Slide 17 </li> <li> CFA: BM: Parameter estimates Thus BM showed good fit and parameter estimates Broker Market Simulated in Basket was the best predictor of Broker Market All factor loadings were statistically significant (p </li> <li> Construct Validity G-study construct validity: person, dimension &amp; person*dimension variance must collectively &gt; exercise, and person*exercise effect Consider a practical DAC example with G-Theory N=372 Nine dimensions with mostly two exercise: Simulated In-Basket Role Play </li> <li> Slide 27 </li> <li> A Practical example Dimension Exercises SIBRole PlayInterview Change Orientation Communication Customer Service Orientation Interpersonal Interaction Planning &amp; Organizing Problem Analysis &amp; Decision-making Self-Management Team Management </li> <li> Slide 28 </li> <li> A Practical example: Variance Components for entire DAC </li> <li> Slide 29 </li> <li> A Practical example: Important note In SPSS: For the ANOVA and MINQUE methods, negative variance component estimates may occur. Some possible reasons for their occurrence are: (a) the specified model is not the correct model, or (b) the true value of the variance equals zero In light of the foregoing example: Variance attributed to exercise effects (.108) &gt; variance attributed to person effects (.322) This finding seems to be in-line with Lance et als (2004) contention that method effects are three time more than trait effects In the current example 2,9 more variance was explained by exercise effects compared to dimension effects. </li> <li> Slide 30 </li> <li> A Practical example: Variance Components for selected dimensions However, could it be that the G-study on the entire DAC ironed out some of the robust dimension effects on the sub-dimension level? I.e. are we throwing out the good with the bad? To investigate the relative contribution of each dimension to the overall G- coefficient one could conduct forward G-analysis on the individual dimension level However, when we calculate the coefficient on subscale level, there will be no variance component for dimension, dimension*exercise, dimension*person, or dimension*person*exercise effect The biggest problem with the approach is that it will not be able to compare person*dimension variance with person*exercise variance since no person*exercise variance component is generated However it is still possible to compare person variance with person*exercise variance </li> <li> Slide 31 </li> <li> A Practical example: Variance of communication </li> <li> Slide 32 </li> <li> A Practical example: Variance of Team Management </li> <li> Slide 33 </li> <li> Final Verdict: G-study and DAC Investigate dimensions individually to assess contribution of different sources Poorly designed dimensions may inflate observed variance attributed to exercise, exercise by dimension, and exercise by person effects The way G-studies is conducted have design implications for DAC: All vs some approach to design </li> <li> Slide 34 </li> <li> IRT ANALYSIS Previously we noted: Recently there has been two schools of thought to assess the construct validity of ACs: Confirmatory Factor Analysis (CFA) [MTMM] Generalization Theory Fairly new area: IRT modeling with interval data Consider Achievement Motivation discussed earlier </li> <li> Slide 35 </li> <li> IRT Approach Logistical model dictate that a respondents response to an item should depend on two parameters only: Difficulty of endorsing the items (item location parameter) Standing of respondent on the latent trait (person location parameter) The expectation is that persons with a higher standing on the latent trait should have a higher probability of endorsing a particular item compared to a person with a lower standing on the same trait This is a key requirement for DAC since the central aim is to discriminate between person who is low and high on the trait (dimension). Deviations from these indications might suggest that the DAC exercises are not operating as expected </li> <li> Slide 36 </li> <li> Rating scale The current DAC was rated on a 5-point response scale with non-integer values (i.e. decimal values) Common wisdom: more response categories = more reliable measure that resemble interval data. However it remains to be seen if people actually make distinction between response categories. It is expected that thresholds between 5 response categories will be sequentially ordered along the latent traits We can examine the Graphed category response function to see if each of the 4 thresholds becomes the modal category at some point on the latent trait continuum </li> <li> Slide 37 </li> <li> Empirical response categories for INN-AP </li> <li> Slide 38 </li> <li> Empirical response categories for EN_AP </li> <li> Slide 39 </li> <li> Empirical response categories for PS_AB </li> <li> Slide 40 </li> <li> Empirical response categories for PS_SIB </li> <li> Slide 41 </li> <li> Empirical response categories ITEM DIFFICULTY MEASURE OF -1.13 ADDED TO MEASURES ------------------------------------------------------------------- |CATEGORY OBSERVED|OBSVD SAMPLE|INFIT OUTFIT||STRUCTURE|CATEGORY| |LABEL SCORE COUNT %|AVRGE EXPECT| MNSQ MNSQ||CALIBRATN| MEASURE| |-------------------+------------+------------++---------+--------| | 1 1 1 1| -4.95 -4.41|.29.10|| NONE |( -9.04)| 1 | 2 2 27 54| -.60 -.43|.55.62|| -6.82 | -3.31 | 2 | 3 3 15 30| 1.89 1.67|.72.42|| 2.46 | 2.28 | 3 | 4 4 7 14| 3.55 3.31|.68.60|| 4.36 |( 4.43)| 4 | 5 1 1| | || NONE | | 5 ------------------------------------------------------------------- OBSERVED AVERAGE is mean of measures in category. It is not a parameter estimate. </li> <li> Slide 42 </li> <li> Response Scales What we see here is that although there is supposed to be 5 response categories raters effectively make use of three response categories when rating PEDRs Furthermore, person reliability is not very good. This indicates estimates the confidence we have that people will be allocated to the same ranking order when exposed to the Achievement Motivation DAC again This is similar to the person*dimension effect in G- studies </li> <li> Slide 43 </li> <li> Fit Statistics SUMMARY OF 97 MEASURED (NON-EXTREME) PERSON ------------------------------------------------------------------------------- | TOTAL MODEL INFIT OUTFIT | | SCORE COUNT MEASURE ERROR MNSQ ZSTD MNSQ ZSTD | |-----------------------------------------------------------------------------| | MEAN 34.6 8.0.66.64 1.01 -.2 1.05 -.1 | | S.D. 6.0.0 2.32.16.94 1.3 1.29 1.3 | | MAX. 48.0 8.0 5.09 1.29 5.40 3.4 8.32 4.2 | | MIN. 19.0 8.0 -8.77.45.10 -2.5.08 -2.7 | |-----------------------------------------------------------------------------| | REAL RMSE.76 TRUE SD 2.19 SEPARATION 2.88 PERSON RELIABILITY.89 | |MODEL RMSE.66 TRUE SD 2.22 SEPARATION 3.39 PERSON RELIABILITY.92 | | S.E. OF PERSON MEAN =.24 | ------------------------------------------------------------------------------- </li> <li> Slide 44 </li> <li> Fit Statistics: PERSON AND ITEM PARAMETERS ITEM STATISTICS: MISFIT ORDER -------------------------------------------------------------------------------------------------- |ENTRY TOTAL TOTAL MODEL| INFIT | OUTFIT |PT-MEASURE |EXACT MATCH| | |NUMBER SCORE COUNT MEASURE S.E. |MNSQ ZSTD|MNSQ ZSTD|CORR. EXP.| OBS% EXP%| ITEM G | |------------------------------------+----------+----------+-----------+-----------+-------------| | 8 590 98 -.24.17|1.48 2.7|2.45 4.3|A.61.74| 53.6 64.1| TRANS_SIB 0 | | 6 632 98 1.63.14|1.24 1.4|1.14.6|C.80.81| 62.9 62.0| TRANS_EN 0 | | 7 545 98 -1.53.14|.95 -.3|.95 -.1|D.80.78| 57.7 56.6| TRANS_PS 0 | | 5 545 98 -1.53.14|.92 -.5|.83 -.6|d.81.78| 58.8 56.6| TRANS_IN 0 || |------------------------------------+----------+----------+-----------+-----------+-------------| | MEAN 426.4 98.0.00.18|1.00 -.1|1.05 -.1| | 65.7 64.4| | | S.D. 154.5.0 1.52.03|.29 1.9|.58 2.0| | 8.2 5.4| | -------------------------------------------------------------------------------------------------- THUS from this Table we can see from the high ZSTD infit statistics that PS_SIB underestimates expected item scores </li> <li> Slide 45 </li> <li> Expected Item Characteristic Curves: PS-SIB </li> <li> Slide 46 </li> <li> Expected Item Characteristic Curves: EN_AP </li> <li> Slide 47 </li> <li> Slide 48 </li> <li> Expected Item Characteristic Curves: PS_SIB </li> <li> Slide 49 </li> <li> Validation problems of DACs If the SEM approach is to preferred: Empirical Considerations At least 5 exercises per dimension for an uni- dimensional construct and single exercise effect If the 1DCE approach is used with multiple sub- dimensions than at least 3 exercises per sub- dimension is needed Multiple raters for each dimension Sample size &gt; 150 Minimum of 5-point rating scale </li> <li> Slide 50 </li> <li> Validation problems of DACs Substantive considerations: Theoretical underpinnings of DAC dimensions Are we really measuring more than fluid intelligence (g) in DACs? Have we considered discriminant and convergent validity outside the MTMM doctrine: Cross-validation with paper &amp; pencil measures? Rater calibration: Higher inter-rater agreement at the expense of restriction of range and construct validity </li> <li> Slide 51 </li> <li> PEDRs lies at the heart of the problem: What are we rating? Competency potential Competence Observable Behaviour PEDRs ? ? ? </li> <li> Slide 52 </li> <li> PEDRs lies at the heart of the problem: What are we rating? If we are proponing to measure competency potential - would it not be better to use paper &amp; pencil measures with more control (standardisation) and objectivity? When designing exercises to measure AC dimensions what is the constitutive meaning of the proposed dimensions? Creative thinking &amp; Entrepreneuric Energy Why not cross-validate AC constructs with known constructs? For example: Empowering Leadership (DAC) Trans...</li></ul>