Outcome Measures Table 3 Criteria for Rating Outcome Measures

Criterion

Definition

Standard

Reliability

- the reproducibility and internal consistency of the tool (synonyms include stability, repeatability, etc)
- Reproducibility is the degree to which the score is free from random error. Test retest, inter/intra observer reliability are commonly evaluated using statistics including ICC, Pearson’s or Spearman’s coefficients and kappa coefficients (weighted or unweighted).
- Internal consistency assesses the homogeneity of the scale items.  It is generally examined using split-half reliability or Cronbach’s alpha statistics.  Item-to-item and item-to-scale correlations are also accepted methods.

- Internal consistency ratings are: excellent (≥0.80), adequate (0.70-0.79), or poor (≤0.69) (Andresen 2000).
- ICC and Kappa for inter/intra and test-retest ratings are: excellent (≥0.75), adequate (0.40-0.74), or poor (≤0.39). (Andresen 2000).

Validity

Does the instrument measure what it purports to measure?  Forms of validity include face, content, construct and criterion.  Concurrent, convergent or discriminative and predictive validity are all considered to be forms of criterion validity.  However, concurrent, convergent and discriminative validity all depend on the existence of a “gold standard” to provide a basis for comparison.  If no gold standard exists, they represent a form of construct validity in which the relationship to another measure is hypothesized (Finch et al. 2002)

Construct/convergent and concurrent correlations:
Excellent (≥0.60), Adequate (0.30-0.59), Poor (≤0.29) (Andresen 2000)
ROC analysis – AUC: Excellent (≥0.90), Adequate (0.70-0.89), Poor (≤0.69) (McDowell & Newell 1996)
There are no agreed on standards by which to judge sensitivity and specificity as a validity index. (Riddle & Stratford 1999)

Responsiveness

Sensitivity to changes within patients over time (which might be indicative of therapeutic effects).
Responsiveness is most commonly evaluated through correlation with other changes scores, effect sizes, standardized response means, relative efficiency, sensitivity & specificity of change scores and ROC analysis.
Assessment of possible floor and ceiling effects is included as they indicate limits to the range of detectable change beyond which no further improvement or deterioration can be noted.

Sensitivity to change:
Excellent:
Evidence of change in expected direction using methods such as standardized effect sizes:
Small (<0.50), Moderate (0.50-0.80), Large (≥0.80)
Also, by the way of standardized response means, ROC analysis of change scores (area under the curve – see above) or relative efficiency.
Adequate:
Evidence of moderate/less change than expected; conflicting evidence.
Poor:
Weak evidence based solely on p-values (statistical significance) (Andresen 2000).
Floor/Ceiling Effects:
Excellent: No floor or ceiling effects
Adequate: Floor and ceiling affects ≤20% of patients who attain either the minimum (floor) or maximum (ceiling) score.
Poor: >20%. (Hobart et al. 2001)

Interpretability

How meaningful are the scores?  Are there consistent definitions and classifications for results?  Are there norms available for comparison?

Jutai & Teasell (2003) point out these practical issues should not be separated from consideration of the values that underscore the selection of outcome measures.  A brief assessment of practicality will accompany each summary evaluation.

Acceptability

How acceptable the scale is in terms of completion by the patient – does it represent a burden?  Can the assessment be completed by proxy, if necessary? Are there different formats available?

Feasibility

Extent of effort, burden, expense & disruption to staff/clinical care arising from the administration of the instrument. Availability of the tool or representative version of the tool. Cost of the tool.

Clinical
Summary

Will the tool prove useful in clinical situations? What SCI sub groups is it suitable to use with? What type of information is generated (descriptive, predictive, and evaluative)? Will it help with discharge planning? Is the tool used as a component of an administrative data base?