diagnostic reliability in osteopathic medicine

5
Masterclass Diagnostic reliability in osteopathic medicine Nicholas Lucas a, * , Nikolai Bogduk b a Screening and Test Evaluation Program, Edward Ford Building, Sydney School of Public Health, University of Sydney, Sydney NSW, Australia b Department of Clinical Research, Royal Newcastle Centre and University of Newcastle, Australia article info Article history: Received 17 November 2010 Accepted 21 January 2011 Keywords: Osteopathic medicine Osteopathy Reliability Diagnosis abstract In order to apply an effective treatment we must rst know how to identify those who will and will not likely respond to that treatment. Determining the accuracy and reliability of diagnostic tests used in osteopathy is therefore a high priority. Diagnostic research in osteopathy is far reaching, as diagnosis impacts treatment choice, prognosis, referral, and patient monitoring. The accuracy and reliability of diagnostic tests also impacts the selection of patients for participation clinical trials and can be an source of misclassication bias. This masterclass provides a brief overview of diagnostic research and then explains in more detail the methodology, statistical analysis and quality appraisal of diagnostic reliability studies. Ó 2011 Published by Elsevier Ltd. 1. Introduction Many of those within and outside the profession have called for, or wished for, a larger evidence base for osteopathy. While early research in osteopathy focussed on the pathophysiology of somatic dysfunction and the physiological effects of osteopathic treatment, 1 this research could only support arguments for why osteopathy shouldwork, rather than being able to show that osteopathy doeswork. In order to answer the question, does osteopathy work?, the randomised controlled trial has been used and recommended. 1e3 As a profession, we entertained the idea that if we were to compare osteopathic treatmentwith no osteopathic treatment, then the positive results would demonstrate the value of our profession and the work we do with patients on a daily basis. After all, we know that our patients respond to our treatment because they tell us so, and we observe their improvement. This optimistic expectation of randomised controlled trials in osteopathy has not eventuated as we might have hoped, and many research ndings seem to be incongruent with the everyday experience of osteopaths. As a result, the effectiveness of the RCT as a way to measure outcomes in osteopathy has been questioned. 4 While I would argue that RCTs do have their place in osteopathy, it is true that this methodology is not as straightforward to implement as early adopters thought. There are many factors which can complicate the supposedly simple RCT, and which can threaten the validity of the results. 5e8 Of the many factors that can inuence an RCT, one of the most fundamental is that of diagnosis. 9 Patients are selected or excluded from participation in an RCT based on the outcome of a diagnostic test for the condition of interest. If the test correctly identies those with and without the condition, then we know that the subjects enrolled in the RCT have been appropriately selected. If the test, however, is not able to correctly identify a person as either having the condition or not, then this inaccuracy will compromise the results of the RCT before it is even begun. Clearly, then, it is important to know how well a diagnostic test can identify those with, and without, the condition of interest. This is not only important for clinical studies of course, but for the safe and effective practice of osteopathy. The reach of diagnostic research is broad, and the potential ramications are both daunting and exciting. It is for these reasons that understanding diagnostic research is important for the development of the profession. The purpose of this masterclass is to provide a brief overview of diagnostic research and to provide an argument for the particular importance of diagnostic reliability studies in osteopathy. This is then followed by a more detailed explanation of the methodology, statistical analysis and quality appraisal of diagnostic reliability studies. 2. Diagnostic test requirements The fundamental requirements of a diagnostic test are that it be both accurate and reliable. 10 Accuracy and reliability each represent * Corresponding author. E-mail address: [email protected] (N. Lucas). Contents lists available at ScienceDirect International Journal of Osteopathic Medicine journal homepage: www.elsevier.com/ijos 1746-0689/$ e see front matter Ó 2011 Published by Elsevier Ltd. doi:10.1016/j.ijosm.2011.01.001 International Journal of Osteopathic Medicine 14 (2011) 43e47

Upload: nicholas-lucas

Post on 21-Jun-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Diagnostic reliability in osteopathic medicine

lable at ScienceDirect

International Journal of Osteopathic Medicine 14 (2011) 43e47

Contents lists avai

International Journal of Osteopathic Medicine

journal homepage: www.elsevier .com/i jos

Masterclass

Diagnostic reliability in osteopathic medicine

Nicholas Lucas a,*, Nikolai Bogduk b

a Screening and Test Evaluation Program, Edward Ford Building, Sydney School of Public Health, University of Sydney, Sydney NSW, AustraliabDepartment of Clinical Research, Royal Newcastle Centre and University of Newcastle, Australia

a r t i c l e i n f o

Article history:Received 17 November 2010Accepted 21 January 2011

Keywords:Osteopathic medicineOsteopathyReliabilityDiagnosis

* Corresponding author.E-mail address: [email protected] (N. Luc

1746-0689/$ e see front matter � 2011 Published bydoi:10.1016/j.ijosm.2011.01.001

a b s t r a c t

In order to apply an effective treatment we must first know how to identify those who will and will notlikely respond to that treatment. Determining the accuracy and reliability of diagnostic tests used inosteopathy is therefore a high priority. Diagnostic research in osteopathy is far reaching, as diagnosisimpacts treatment choice, prognosis, referral, and patient monitoring. The accuracy and reliability ofdiagnostic tests also impacts the selection of patients for participation clinical trials and can be an sourceof misclassification bias. This masterclass provides a brief overview of diagnostic research and thenexplains in more detail the methodology, statistical analysis and quality appraisal of diagnostic reliabilitystudies.

� 2011 Published by Elsevier Ltd.

1. Introduction

Many of those within and outside the profession have called for,or wished for, a larger evidence base for osteopathy. While earlyresearch in osteopathy focussed on the pathophysiology of somaticdysfunction and the physiological effects of osteopathic treatment,1

this research could only support arguments for why osteopathy‘should’work, rather than being able to show that osteopathy ‘does’work.

In order to answer the question, “does osteopathy work?”, therandomised controlled trial has been used and recommended.1e3

As a profession, we entertained the idea that if we were tocompare ‘osteopathic treatment’ with ‘no osteopathic treatment’,then the positive results would demonstrate the value of ourprofession and the work we do with patients on a daily basis. Afterall, we know that our patients respond to our treatment becausethey tell us so, and we observe their improvement.

This optimistic expectation of randomised controlled trials inosteopathy has not eventuated as we might have hoped, and manyresearch findings seem to be incongruent with the everydayexperience of osteopaths. As a result, the effectiveness of the RCT asa way to measure outcomes in osteopathy has been questioned.4

While I would argue that RCTs do have their place in osteopathy,it is true that this methodology is not as straightforward toimplement as early adopters thought. There aremany factors which

as).

Elsevier Ltd.

can complicate the supposedly simple RCT, and which can threatenthe validity of the results.5e8

Of the many factors that can influence an RCT, one of the mostfundamental is that of diagnosis.9 Patients are selected or excludedfrom participation in an RCT based on the outcome of a diagnostictest for the condition of interest. If the test correctly identifies thosewith and without the condition, then we know that the subjectsenrolled in the RCT have been appropriately selected. If the test,however, is not able to correctly identify a person as either havingthe condition or not, then this inaccuracy will compromise theresults of the RCT before it is even begun.

Clearly, then, it is important to know how well a diagnostic testcan identify those with, and without, the condition of interest.This is not only important for clinical studies of course, but for thesafe and effective practice of osteopathy. The reach of diagnosticresearch is broad, and the potential ramifications are bothdaunting and exciting. It is for these reasons that understandingdiagnostic research is important for the development of theprofession.

The purpose of this masterclass is to provide a brief overview ofdiagnostic research and to provide an argument for the particularimportance of diagnostic reliability studies in osteopathy. This isthen followed by a more detailed explanation of the methodology,statistical analysis and quality appraisal of diagnostic reliabilitystudies.

2. Diagnostic test requirements

The fundamental requirements of a diagnostic test are that it beboth accurate and reliable.10 Accuracy and reliability each represent

Page 2: Diagnostic reliability in osteopathic medicine

Table 1The structure of a 2 � 2 contingency table. When matched, the dichotomous resultsof two phenomena generate four cells: a, b, c, and d.

Examiner 1Yes No

Examiner 2 Yes a b a þ bNo c d c þ d

a þ c b þ d N ¼ a þ b þ c þ d

N. Lucas, N. Bogduk / International Journal of Osteopathic Medicine 14 (2011) 43e4744

a different diagnostic concept and each are investigated usingdifferent research methodology.11

2.1. Diagnostic Accuracy

A diagnostic test should measure what it is supposed tomeasure.9,12 For example, if a test is supposed to measure bloodpressure, then it should measure blood pressure. If a test issupposed to measure joint range of motion, then it should measurejoint range of motion. For many tests, however, this is not sostraightforward.

For example, the standing flexion test is supposed to measurethe movement of the sacroiliac joint, but perhaps it is measuringthe movement of the soft tissues overlying the joint?13,14 Similarly,palpation of the cranium is supposed to measure movement of thecranial bones or restrictions in tissues not directly in contact withthe palpating hands and fingers, yet perhaps palpation ismeasuring something else?15

The only way to assess the accuracy of the test is to compare it toa “gold standard”, or what is now referred to as the referencestandard. The reference standard is the test believed to be the mostaccurate, or most direct measure, of the index condition. Forexample, an MRI might indicate an area of haemorrhage in thebrain of a stroke victim, whereas the haemorrhage is directlyconfirmed at autopsy. The autopsy is themost direct measure of thehaemorrhage and is therefore the reference standard. If the MRI isas good, or nearly as good, as autopsy at identifying patients who doand do not have a brain haemorrhage, then MRI is said to be‘accurate’.

A problem arises when there is no accepted reference standardby which to confirm the accuracy of a diagnostic test. For example,if there is no accepted reference standard available to measurerestriction in the movement of the sacroiliac joints, then there isnothing with which to compare a physical examination test that isalso supposed to measure sacroiliac joint movement. Similarly, ifthere is no excepted reference standard to measure restriction inthe movement of the cranial bones, then there is nothing by whichto verify the palpatory tests that are supposed to measure suchrestriction. This problem is common in osteopathy and it has beendifficult to investigate the accuracy of many of the tests used ineveryday clinical practice.

2.2. Diagnostic reliability

A diagnostic test should be reproducible in the same patient bytwo or more independent examiners, or at the very least, by thesame examiner on two separate occasions. These are referred to asinter-examiner and intra-examiner reliability. The principle here isthat for a sign to be clinically relevant, two independent examinersshould be able to agree on its presence or absence. If a diagnostictest does not satisfy this basic requirement, then it is said to beunreliable.

Unreliable tests are problematic for a number of reasons. First,the results cannot be relied upon for the identification of a diag-nosis, which has ramifications for both individual patients and theenrolment of participants in RCTs. Even if a test were accurate, toomuch variability in the results may make it unsuitable for use inclinical practice. Second, unreliable tests cannot be relied upon toform a prognosis. Again, the variability in the test results reducesthe confidence the clinician can place in predictions that are madeon the basis of the test. Third, a patient who is labelled with anincorrect diagnosis can suffer adverse psychological effects thatmay contribute to their sense of being unwell. Fourth, an unreliablediagnosis is associated with the potential to either administer anineffective treatment or a failure to prescribe the correct treatment.

Finally, an unreliable test cannot be used to monitor patient prog-ress, which precludes the test from being used in clinical trials andin clinical practice.

For these reasons, it is clear that we should use tests that arereliable, and avoid those that are unreliable. In order to determine ifa test is reliable the test must be subjected to a study of its reli-ability. Unlike diagnostic accuracy, studies of diagnostic reliabilityare not based on a comparison between the test and a referencestandard. Instead, the comparison is made between the results oftwo or more independent examiners, or between two or moreresults obtained by the same examiner on the same patient. Studiesof diagnostic reliability, therefore, can be a useful tool in thedevelopment of an evidence base for osteopathy.

2.3. How is the reliability of a diagnostic test calculated?

Reliability, which is also known as agreement, is often presumedto be a simple calculation of the number of occasions that twoexaminers agree on a given set of test results. Whilst this approachdoes provide one answer to examiner agreement, it fails to take intoaccount the agreement that would occur between examiners bychance alone. Since we are not interested in chance agreement, butin real agreement beyond that of chance, it is necessary to usestatistical tests to calculate reliability that specifically control forchance agreement.

The type of statistical test used depends upon the type of datagenerated by the diagnostic test. For continuous data, such as jointrange of motion or blood pressure, the intraclass correlation coef-ficient (ICC) is used.16 For diagnostic tests that generate categoricaldata, such as ‘positive’ or ‘negative’, or “mild, “moderate”, or“severe”, the kappa statistic is used.

In osteopathy, many diagnostic tests generate categorical data,and in many instances continuous data are reduced into categoriesfor ease of reporting. For example, leg length discrepancy isa continuous variable that is measured in millimetres, however leglength discrepancy is often rated as being either ‘absent’ or‘present’, which is categorical. Similarly, a radial fissure in a lumbarintervertebral disc is a continuous variable, whereas it is the cate-gorical grade of the fissure (grade 1, 2, 3 or 4) that is used fordiagnostic purposes.17 For this reason, categorical data and thekappa statistic will be the focus of this masterclass.

In order to measure the reliability of a test, two examiners areinvited to use the test to examine the same sample of subjects. Eachexaminer independently decides for each subject whether thecondition (or sign) is present or not. After the examination isconcluded, the results of the two examiners are tabulated for thepurpose of calculating kappa (Table 1).

Referring to Table 1, N represents the total number of subjects inthe sample, which is the sum of cells a, b, c, and d. The row andcolumn totals represent the total number of patients reported aspositive and negative by each examiner. Thus, examiner 2 recordedpositive results in a þ b subjects, and negative results in c þ dsubjects. Meanwhile, examiner 1 recorded positive results in a þ csubjects and negative results in b þ d subjects. These totals,however, do not tell us anything about howwell the two examiners

Page 3: Diagnostic reliability in osteopathic medicine

Table 2Hypothetical results of two doctors examination findings for occipital neuralgiatabulated in a contingency table.

Examiner 1Positive Negative

Examiner 2 Positive 57 19 76Negative 11 13 24

68 32 100

N. Lucas, N. Bogduk / International Journal of Osteopathic Medicine 14 (2011) 43e47 45

agreed on individual patients. For this information we must look atthe values in each cell.

2.3.1. Examiner agreementThe value, a, is the number of subjects in whom both examiners

agreed that the test was positive. The value, d, is the number ofsubjects inwhomboth examiners agreed that the test was negative.Therefore cells a and d represent the total observed agreementbetween the two examiners. By dividing the total observedagreement by the total number of subjects (a þ d) / N one cancalculate the percentage agreement (Po).

2.3.2. Examiner disagreementThe value, b, is the number of subjects in which examiner 1

found the test to be negative, while examiner 2 found the test to bepositive. The value, c, is the number of subjects inwhich examiner 1found the test to be positive while examiner 2 found the test to benegative. Therefore, cells b and c represent the total observeddisagreement between the two examiners, which can be reportedas the percentage disagreement when divided by the total numberof subjects (b þ c) / N.

2.3.3. Controlling for chance agreementWhen making categorical decisions, there is a probability that

the examiners will agree simply by chance. For example, if oneexaminer were to assess the stiffness of a joint and decide if it wererestricted or not, and another examiner were to simply toss a coin,there would still be some agreement between the two examinersresults. It is for this reason that the observed agreement betweenexaminers is an incomplete measure of reliability.18,19 The principleat hand is that the true strength of a test lies not in its apparentagreement, but in its agreement beyond chance. Two or moreexaminers might score well simply by chance, but only a good testwould secure agreement beyond chance alone.20

For example, if the total observed agreement between exam-iners was 70%, yet they agreed by chance in 40% of cases, then theirscore should be discounted by 40%, leaving a score of 30% agree-ment above chance. In this example, the total available range ofagreement beyond chance is 100e40%, which is 60%. Since theexaminers achieved only 30% agreement beyond chance, their trueacumen is 30%/60%, which amounts to 50% true agreement, not theoriginal observed agreement of 70% (Fig. 1).

In order to determine the reliability of a test, a calculation of thetrue agreement discounted for chance is required. This calculation,known as kappa, is derived from a contingency table such as pre-sented in Table 1. An illustrative example will be used to demon-strate the calculation of kappa, derived from data presented inTable 2.

Note how, on the average, examiner 2 recorded positive resultsin (a þ b)/N subjects (percentage agreement). Therefore, whenasked to review the (a þ c) subjects that examiner 1 recorded as

Available agreement beyond chance alone Agreement beyond chance Chance agreement

Observed disagreement Observed agreement Complete agreement in all cases

x

y

Fig. 1. A dissection of agreement. For an ideal test, complete agreement would occur inall cases. In reality, there will be an observer agreement in a proportion of cases, anddisagreement in the remainder. The observed agreement, however, has two compo-nents e the agreement due to chance alone, and the agreement beyond chance. Of thetotal possible agreement, however, there is a range beyond chance alone in whichagreement could have been achieved. The strength of the test lies not in its observedagreement, but the extent to which the test finds cases in the range of availableagreement beyond chance alone, i.e. x/y.

positive, one would expect that, on the average, examiner 2 wouldfind (a þ b)/N of these cases as positive. Therefore, the number ofcases (a*) in which the observers would agree by chance as beingpositive is:

a* ¼ ðaþ bÞN

� ðaþ cÞ;

or 0:76� 68 ¼ 51:68

Similarly, on the average, examiner 2 recorded negative resultsin (c þ d)/N subjects. Therefore, when asked to review the (b þ d)subjects that examiner 1 recorded as negative, one would expectthat, on the average, examiner 2 would record (c þ d)/N of thesesubjects as negative. Therefore, the number of cases (d*) in whichthe examiners would agree by chance as being negative is:

d* ¼ ðcþ dÞN

� ðbþ dÞ;

or 0:24� 32 ¼ 7:68

The total number of cases in which the examiners would beexpected to agree by chance is (a* þ d*), and the agreement rate(expressed as a decimal fraction) will be (a* þ d*)/N. The availablerange of agreement beyond chance is therefore

1��a* þ d*

N;

which is 1� ð51:68þ7:68Þ100

or 1� 0:59 ¼ 0:41

The observed agreement, however, is (aþ d), and the percentageagreement is (a þ d)/N. The difference between observed agree-ment and chance agreement is

ðaþ dÞN

��a* þ d*

N

or 0:7� 0:59 ¼ 0:11

The true reliability of the test, therefore is a proportion of thecases in the range available beyond chance agreed upon by theexaminers, and will be:

Reliability ¼½ðaþ dÞ=N� �

h�a* þ d*

�.Ni

1�h�

a* þ d*�.

Ni

orð0:7� 0:59Þ

0:41¼ 0:27

Page 4: Diagnostic reliability in osteopathic medicine

N. Lucas, N. Bogduk / International Journal of Osteopathic Medicine 14 (2011) 43e4746

2.4. The kappa statistic (k)

This preceding calculation introduces the kappa statistic,18

which in its simplified form is mathematically expressed as:

k ¼ ðP0 � PeÞð1� PeÞ

where Po is the percentage of observed agreement; Pe is the ex-pected agreement; and

(1�Pe) is the available range of agreement beyond that ofchance.

The advantage of kappa is that in one figure it expresses thenumerical value of the agreement beyond chance. The availablerange of kappa is �1.0 to þ1.0, with þ1.0 being perfect agreement,0 being chance agreement, and �1.0 being perfect disagreement.

The various ranges of kappa have been described by a series ofadjectives that translate the numerical value into a qualitativejudgement (Table 3)21; however, the choice of adjectives is arbi-trary. Whereas some investigators might prefer to describe a scorebetween 0.8 and 1.0 as “excellent”, others might prefer to describethis as “acceptable”. Nevertheless, the kappa statistic serves itspurpose in summarising agreement into one figure that can be usedto evaluate the reliability of a test.

2.4.1. Limitations of kappaKappa suffers from a variety of problems when the frequency of

the clinical sign or index condition in the study sample is too highor too low.22e26 In this regard, Cicchetti and Feinstein,23 and Lantzand Nebenzahl,26 suggest that authors should report the observedagreement and the observed disagreement in order to give thekappa statistic context, and argue that reported on it’s own, kappadoes not provide enough useful information. Also, for suchcircumstances, certain adjustments can be applied to kappa, suchas using the bias and prevalence adjusted kappa or weighedkappa.22

The original description of kappa was from,18 and variousauthors have described modifications to the original statistic fordifferent needs and for use in different circumstances.19 In addition,kappa as described by Fleiss can be used to calculate kappa formultiple examiners.27

3. Critical appraisal of studies of diagnostic reliability

Prior to accepting the results of a reliability study, it is importantto assess its quality and applicability. There are a number ofessential criteria for reliability studies which if absent make thestudy invalid, even though other aspects of the study may havebeen well conducted. These criteria have been published ina quality appraisal tool for studies of reliability (QAREL) and aredescribed below.11

One must initially consider if the study was standardised ornon-standardised. In a standardised study, the test is investigated

Table 3Qualitative descriptors used to describe the meaning ofkappa.a

Value of K

0.81e1.0 Almost perfect0.61e0.80 Substantial0.41e0.60 Moderate0.21e0.40 Fair0.00e0.20 Slight<0.00 Poor

a From Landis and Koch. (6).

under ideal conditions with certain variables controlled. In a non-standardised study, the test is investigated as it is used ineveryday clinical practice, with little or no intervention by theresearchers. If the subjects are a highly selected group, or if theexaminers are experts with intense training, then this may reducethe likelihood that typical practitioners will perform at the samelevel of reliability as reported in the paper. The issue here iswhether you should expect to get the same results in your clinic asthose reported in the paper.

3.1. Examiner blinding

In studies of inter-examiner reliability, the examiners should beblinded to the findings of other examiners, as this knowledge mayinfluence their findings. In studies of intra-examiner reliability, theexaminers should be prevented from recognising subjects that theyhave previously examined, as remembering their initial findingsmay influence their subsequent findings.

Examiners should be blinded to the results of any referencestandard tests that have been conducted, or other clinical data notintended to form part of the test under investigation. Examinersshould also be blinded to additional cues that were not intended toform part of the test procedure. Such cues might include recog-nisable pain behaviour or identifying features such as scars, tattoos,hair colour, and voice accent. Knowledge of clinical data or othernon-clinical cues, may influence the examiners findings where-upon their findings are not due to the test, but the other data theyhave been provided with.

3.2. Order of examination

The order in which subjects are examined should be varied. Instudies of intra-examiner reliability, this can be achieved by ran-domising the order in which subjects are examined in the test andre-test conditions. In studies of inter-examiner reliability, the orderof examination can be achieved by alternating or randomising theorder in which the examiners examine the subjects. This is notpractical or necessary in all types of studies. For example, in a studyof the inter-rater reliability of radiologists MRI reports, it wasunnecessary to vary the order in which the MRI films were read, asthe MRI films are static, not dynamic, and are therefore not subjectto change by the examination procedure.

3.3. Test application and interpretation

Readers should determine whether the application and inter-pretation of the test was appropriate. Firstly, the incorrect appli-cation of the test may give poor results. In this case, it would not bethe test, or the examiners who are unreliable per se, but errorassociated with incorrect test application. Examples includeorthopaedic tests that are performed incorrectly, or laboratory teststhat have been contaminated.

Secondly, the incorrect interpretation of the test may give poorresults, and in this situation it is not the test that is unreliable per se,but error associated with variation in the interpretation of testresults by the examiners. One example of this is a group of exam-iners, each of whom have a different threshold for “positive” and“negative”. Another is the situation in which examiners agree onthe presence of the index condition, or sign, but label it incorrectlyon their reports; for example, examiners may agree on the exactlocation of a rib fracture on chest X-ray, but label the rib incorrectlyon their report, which would result in disagreement.

Page 5: Diagnostic reliability in osteopathic medicine

N. Lucas, N. Bogduk / International Journal of Osteopathic Medicine 14 (2011) 43e47 47

3.4. Time interval between repeated measures

The time interval between each test procedure should take intoaccount random and systematic changes in the variable beingmeasured. These changes can relate to true biologic variation, orsystematic variation associated with the measurement procedure.For example, the straight leg raise has been found to have a diurnalpattern, with variations of up to 17� in available range of motion inthose with lumbar disc protrusion.28 This should be taken intoaccount when planning a study to evaluate the reliability of thestraight leg raise. As another example, physical examination ofankle dorsiflexion may result in a change in ankle stiffness, so thata subsequent measurement might be expected to give a differentresult.

Readers will often be left to make their own judgement on thisissue based on the underlying theoretical rationale of the test. If theclinical sign of interest is subject to change simply by the applica-tion of the test procedure, then the role the clinical sign plays in theaetiology or maintenance of the condition of interest is question-able. For example, if a persistent myofascial trigger point isconsidered to be the cause of chronic muscle pain, then a singlephysical examination of the trigger point should not be sufficient tomodify it in any clinically meaningful way.

3.5. Statistics

It is essential that the authors have reported appropriatestatistical measures of agreement. For continuous data, the intra-class correlation coefficient should be used. For categorical data,kappa should be used. If alternative statistics are reported, theseshould be explained and justified. Confidence intervals should alsobe reported to help readers determine the relevance of the findings.

3.6. Summary

The critical appraisal of a reliability study is completed byconsideration of the conclusions in the context of the full appraisalof the study: “what do these results mean and can they be appliedto my patients?” The aim of the critical appraisal is to develop ananswer to this question.

4. Conclusion

There are continued calls for randomised controlled trials toassess the effectiveness of osteopathic treatment. Randomisedcontrolled trials, however, can be compromised by diagnostic teststhat are used to select or exclude participants, but which are eitherinaccurate, unreliable, or both. Determining the accuracy and reli-ability of diagnostic tests in osteopathy is therefore a high priority.This masterclass provides a brief overview of diagnostic researchand then explains in more detail the methodology, statisticalanalysis and quality appraisal of diagnostic reliability studies.

Author contribution statement

Nicholas P. Lucas and Nikolai Bogduk are the authors of thiswork. Both contributed to the idea, planning, preparation, writing

and reviewing of this article. Both approved the final versionsubmitted for consideration.

Referencesa

1. Patterson MM. Foundations for osteopathic medical research. In: Ward RC,editor. Foundations for osteopathic medicine. 2nd ed. Philadelphia: LippincottWilliams & Wilkins; 2003.

2. Licciardone JC. Responding to the challenge of clinically relevant osteopathicresearch: efficacy and beyond. Int J Osteopath Med 2007;10:3e7.

3. Licciardone JC. Time for the osteopathic profession to take the lead inmusculoskeletal research. Osteopath Med Prim Care 2009;3:6.

4. Leach J. Towards an osteopathic understanding of evidence. Int J Osteopath Med2008;11:3e6.

5. Chalmers TC, Smith Jr H, Blackburn B. A method for assessing the quality ofa randomized control trial. Control Clin Trials 1981;2:31e49.

6. Jadad AR, Moore RA, Carroll D, Jenkinson C, Reynolds DJM, Gavaghan DJ, et al.Assessing the quality of reports of randomized clinical trials: is blindingnecessary? Control Clin Trials 1996;17:1e12.

7. Moher D, Jadad AR, Nichol G, Penman M, Tugwell P, Walsh S. Assessing thequality of randomized controlled trials: an annotated bibliography of scalesand checklists. Control Clin Trials 1995;16:62e73.

8. Verhagen AP, de Vet HCW, de Bie RA, Kessels AGH, Boers M, Bouter LM, et al.The Delphi list: a criteria list for quality assessment of randomized clinical trialsfor conducting systematic reviews developed by Delphi consensus. J ClinEpidemiol 1998;51:1235e41.

9. Haynes RB, Sackett DL, Guyatt GH, Tugwell P. Clinical epidemiology. How to doclinical practice research. 2nd ed. Boston: Lippincott Williams & Wilkins; 2005.

10. Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig LM, et al. TheSTARD statement for reporting studies of diagnostic accuracy: explanation andelaboration. Ann Intern Med 2003;138:W1e12.

11. Lucas NP, Macaskill P, Irwig L, Bogduk N. The development of a qualityappraisal tool for studies of diagnostic reliability (QAREL). J Clin Epidemiol2010;63.

12. Whiting P, Rutjes AW, Reitsma JB, Bossuyt PM, Kleijnen J. The development ofQUADAS: a tool for the quality assessment of studies of diagnostic accuracyincluded in systematic reviews. BMC Res Methodol 2003;3:25.

13. Vincent-Smith B, Gibbons P. Inter-examiner and intra-examiner reliability ofthe standing flexion test. Man Ther 1999;4:87e93.

14. Kmita A, Lucas N. Reliability of physical examination to assess asymmetry ofanatomical landmarks indicative of pelvic somatic dysfunction in subjects withand without low back pain. Int J Osteopath Med 2008;11:16e25.

15. Jordan T. Swedenborg’s influence on Sutherland’s ‘primary respiratory mech-anism’ model in cranial osteopathy. Int J Osteopath Med 2009;12:100e5.

16. Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability.Psychol Bull 1979;86:420e7.

17. Vanharanta H, Sachs BL, Spivey MA, Guyer RD, Hochschuler SH, Rashbaum RF,et al. The relationship of pain provocation to lumbar disc deterioration as seenby CT/discography. Spine 1987;12:295e8.

18. Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Measur1960;20:37e46.

19. Kramer MS, Feinstein AR. Clinical biostatistics. LIV. The biostatistics ofconcordance. Clin Pharmacol Ther 1981;29:111e23.

20. Bogduk N. Truth in diagnosis: reliability. Australasian Musculoskel Med1998:21e3.

21. Landis JR, Koch GG. The measurement of observer agreement for categoricaldata. Biometrics 1977;33:159e74.

22. Byrt T, Bishop J, Carlin JB. Bias, prevalence and kappa. J Clin Epidemiol1993;46:423e9.

23. Cicchetti DV, Feinstein AR. High agreement but low kappa: II. Resolving theparadoxes. J Clin Epidemiol 1990;43:551e8.

24. Feinstein AR, Cicchetti DV. High agreement but low kappa: I. the problems oftwo paradoxes. J Clin Epidemiol 1990;43:543e9.

25. Kraemer HC, Bloch DA. Kappa coefficients in epidemiology: an appraisal ofa reappraisal. J Clin Epidemiol 1988;41:959e68.

26. Lantz CA, Nebenzahl E. Behavior and interpretation of the k statistic: a reso-lution of the two paradoxes. J Clin Epidemiol 1996;49:431e4.

27. Fleiss J. Statistical methods for rates and proportions. 3rd ed. Wiley-Interscience;2003. Hoboken, N.J.

28. Porter RW, Trailescu IF. Diurnal changes in straight leg raising. Spine1990;15:103e6.