Measurement Validation in Affective Computing

Tutorial date/time

September 10 from 09:00–12:00

Room E15-359

Tutorial presenter(s)

Jeffrey Girard, Department of Psychology, University of Kansas

Tutorial description

Does your label really measure what you think it does? How can you provide evidence of this to reviewers and readers of your research? In affective computing, researchers are often interested in studying and building models to predict psychological quantities that are difficult to measure. For example, there is no ruler or thermometer for measuring amusement, depression, or extraversion. These quantities must be measured indirectly, e.g., using self-report questionnaires, structured interviews, or observer rating scales. The process of evaluating the extent to which such measurements are consistent and trustworthy (i.e., are “valid” and therefore can be used to measure what they were developed to measure) is called “measurement validation.” This three-hour tutorial will teach attendees about the theory and practice of this critically important part of the research process. Theoretical topics will include overviews of classical test theory, generalizability theory, and contemporary validity theory. Practical topics will include the estimation of external and criterion validity coefficients, inter-item reliability for self-report questionnaires (using Cronbach’s alpha and McDonald’s omega), and inter-rater reliability for structured interviews and observer rating scales (using generalized kappa coefficients and modern intraclass correlation coefficients). We will discuss best practices for designing, conducting, and reporting a measurement validation study, drawing examples from the affective computing community. We will also discuss common challenges that come up in this process (e.g., imbalanced classes, low variance, ordered categories, and missing data) and how to address these challenges using recent advances in statistical methods (e.g., generalized coefficients, multilevel decomposition, and simulation-based methods).

Structure and Contents

The three-hour block will be broken into six roughly-equal sections.

Introduction / Rationale / Theory
1. What is measurement and what are we measuring?
2. What are measurement errors and where do they come from?
3. What are validity and validation? How are they similar and distinct?
4. When and why do validity and validation matter in Affective Computing?
Evidence Sources: Content and Development, Administration and Response Processes
1. Where does a measurement instrument (and its items) come from?
2. How do we determine whether the construct has been adequately captured?
3. Measurement administration: who, what, where, when, how, and why?
4. How do participants perceive and respond to the measurement items?
Evidence Sources: Relationships Among Internal Variables
1. Is the construct unitary or does it have dimensions/facets?
2. How consistent are scores across items?
3. How consistent are scores across raters, time, etc.?
4. Generalizability studies (advanced topic)
Evidence Sources: Relationships with External Variables
1. How correlated are our scores with criterion (or “ground truth”) variables?
2. Do our scores correlate with other variables in expected/desired ways?
3. Do our scores differ (on average) between groups in expected/desired ways?
4. Multi-trait, Multi-method analyses (advanced topic)
Practical Issues in Affective Computing
1. How to develop, refine, and report a construct conceptualization?
2. How to choose an existing instrument or develop a new one?
3. How to design, conduct, and evaluate a measurement validation study?
4. How to deal with skewed data, ordered categories, and missing data?
Group or Individual Activities
1. Assess papers in your area for how well they provide evidence of validity
2. Assess existing measurement instruments in your area for validity evidence
3. Work on a construct conceptualization for a construct in your area
4. Plan a measurement validation study for one of your projects

Tutorial materials

Recommended Articles

Cizek, G. J. (2016). Validating test score meaning and defending test score use: Different aims, different methods. Assessment in Education: Principles, Policy & Practice, 23(2), 212–225. https://doi.org/10.1080/0969594x.2015.1063479

Flake, J. K., & Fried, E. I. (2020). Measurement schmeasurement: Questionable measurement practices and how to avoid them. Advances in Methods and Practices in Psychological Science, 3(4), 456–465. https://doi.org/10/ghnbdg

Flake, J. K., Pek, J., & Hehman, E. (2017). Construct validation in social and personality research: Current practice and recommendations. Social Psychological and Personality Science, 8(4), 370–378. https://doi.org/10.1177/1948550617693063

Flora, D. B. (2020). Your coefficient alpha is probably wrong, but which coefficient omega is right? A tutorial on using R to obtain better reliability estimates. Advances in Methods and Practices in Psychological Science, 3(4), 484–501. https://doi.org/10.1177/2515245920951747

Gehlbach, H., & Brinkworth, M. E. (2011). Measure twice, cut down error: A process for enhancing the validity of survey scales. Review of General Psychology, 15(4), 380–387. https://doi.org/10/bnn2s3

Jacobucci, R., & Grimm, K. J. (2020). Machine Learning and Psychological Research: The Unexplored Effect of Measurement. Perspectives on Psychological Science, 15(3), 809–816. https://doi.org/10/ghdp3b

Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741–749. https://doi.org/10.1037/0003-066x.50.9.741

Qiu, L., Chan, S. H. M., & Chan, D. (2017). Big data in social and psychological science: Theoretical and methodological issues. Journal of Computational Social Science, 1(1), 59–66. https://doi.org/10.1007/s42001-017-0013-6

ten Hove, D., Jorgensen, T. D., & van der Ark, L. A. (2022). Updated guidelines on selecting an intraclass correlation coefficient for interrater reliability, with applications to incomplete observational designs. Psychological Methods. https://doi.org/10.1037/met0000516

Weidman, A. C., Steckler, C. M., & Tracy, J. L. (2017). The jingle and jangle of emotion assessment: Imprecise measurement, casual scale usage, and conceptual fuzziness in emotion research. Emotion, 17(2), 267–295. https://doi.org/10.1037/emo0000226

Recommended Books

AERA, APA, & NCME. (2014). Standards for educational and psychological testing. American Educational Research Association. https://www.testingstandards.net/open-access-files.html

Bowden, S. C. (Ed.). (2017). Neuropsychological assessment in the age of evidence-based practice. Oxford University Press.

Brennan, R. L. (2001). Generalizability theory. Springer.

Gwet, K. L. (2021). Handbook of inter-rater reliability: Chance-corrected agreement coefficients (5th ed., Vol. 1).

Gwet, K. L. (2021). Handbook of inter-rater reliability: Analysis of quantitative ratings (5th ed., Vol. 2).

Kline, R. (2015). Principles and practice of structural equation modeling (4th ed.). Guilford Press.

Revelle, W. (2014). An introduction to psychometric theory with applications in R. https://www.personality-project.org/r/book/Zumbo, B. D., & Hubley, A. M. (Eds.). (2017). Understanding and investigating response processes in validation research. Springer.

Contact

Jeffrey Girard (University of Kansas): jmgirard@ku.edu

General enquiries to ACII2023 Tutorial Chairs

Emily Mower Provost (University of Michigan): emilykmp@umich.edu

Albert Ali Salah (Universiteit Utrecht): a.a.salah@uu.nl