ACII 2023 | Jeffrey M. Girard
https://affcom.ku.edu/mvac
To understand and predict events in the world, it often helps to hypothesize explanatory latent variables
These explanatory variables aren’t directly observable but rather are inferred from observations they explain
Examples of explanatory latent variables:
Hypothesized latent variables are called constructs
The observations that they explain are called indicators
Most constructs are estimated using “effect” indicators
We often want to know an individual, group, or object’s “standing” or score on a psychological construct
This means assigning a numerical score to it
We call this construct estimation or measurement
By definition, constructs cannot be directly measured
So we must infer their scores from measured indicators
Estimating construct scores thus proceeds as follows:
Self-report questionnaires ask participants to indicate their own standing on one or more constructs
Questionnaires are composed of multiple items, each of which is responded to using a numerical rating scale
Each item is meant to measure a single indicator
Construct scores are often estimated by summing or averaging all items that correspond to that construct (though sophisticated methods include PCA, SEM, and IRT)
*These assumptions can be tested and even relaxed with advanced methods.
Observers or raters are individuals who view stimuli (e.g., multimedia) and provide scores on various constructs
These measurements are standardized using an instrument (e.g., coding scheme or rating scale)
Such instruments tell observers what to focus on and help them to make consistent measurements
The goal is to take out any unwanted subjectivity
A smile is a facial movement that pulls the mouth corners upwards and toward the ears (see examples below). You will view several images; please examine each image carefully and determine whether the image would be best described as either Smiling or Not Smiling.
Persuasiveness is the capability of a person or argument to convince or persuade someone to accept a desired way of thinking. You will read several brief movie reviews in which a reviewer will argue that a movie is either worth watching or not worth watching. Use the following scale to indicate how persuasive you found each movie review to be overall.
Measurement errors are differences between an object’s estimated score and its “true” standing on the construct
Measurement errors come from two main sources:
Validity is the degree to which scores on an appropriately administered instrument support inferences about variation in the characteristic it was designed to measure.
Validation is the ongoing process of gathering, summarizing, and evaluating relevant evidence concerning the degree to which that evidence supports the intended meaning of scores yielded by an instrument and inferences about standing on the characteristic it was designed to measure.
Validity applies to inferences, not instruments
Validity varies across populations and contexts
Validity is integrative and singular
Validity exists on a continuum
Validation is an ongoing process
Validation has three main phases:
(substantive, structural, and external)
Ginsberg et al. (2009) used ML to predict flu pandemics from Google searches faster than traditional CDC methods, but Lazer et al. (2014) later found that it was merely predicting seasonality and the model completely missed nonseasonal influenza
Liu et al. (2015) found that positive emotional expressions online were not related to self-reported life satisfaction, but it is difficult to argue against past theories because the inconsistency could be due to measurement errors
Ribeiro et al. (2016) found that an ML model learned to distinguish between images of Huskies and Wolves by merely looking for the presence of snow yet many participants trusted the model before learning about this relationship
Bleidorn & Hopwood (2019) and Tay et al. (2020) find that validity issues may be holding back ML approaches to personality assessment
Jacobucci & Grimm (2020) found that predictors/features with low reliability (an aspect of validity) attenuate predictive performance, especially for ML algorithms
Defining the construct’s breadth, scope, and indicators
Does the construct definition make sense?
Do the selected indicators represent the construct well?
Are the items/scales being understood/used as intended?
Is the instrument being administered properly?
How has this construct been defined before?
What are the main aspects of this construct?
What are the main indicators of this construct?
How is it related to and distinct from other constructs?
How has it been measured in previous work?
What theories are relevant to this construct?
What support is there for these theories?
Provide a precise definition of the construct
Create a list of indicators for the construct
Create a list of related constructs / hierarchies
Create a Venn Diagram of related constructs
For each aspect of the construct, generate a list of indicators
Create multiple “candidate” items/scales for each indicator
Have experts review the list of aspects, indicators, and items
Pilot test the items/scales and refine based on feedback
Select the best items/scales based on pilot testing
Ensure representation of all aspects in final selection
Consider how participants will respond to items/scales
Are participants understanding the questions?
Are participants understanding the response options?
Are participants choosing responses as you intended?
Cognitive interviewing and think-aloud techniques
Consider where/how the instrument will be completed
Are there any distractions in the environment?
Are there any biasing factors in the environment?
Are there any sources of error in the procedure?
Are there any test security issues (e.g., cheating)?
Relationships among internal variables
Are scores consistent across items?
Average inter-item correlation (simplistic)
Congeneric CFA (confirmatory factor analysis)
\[x_j = \lambda_jf + e_j\]
McDonald’s omega
Reliability of the factor scores \(f\) is captured by \(\omega_u\)
How much of the score \((x)\) variance is explained by \(f\)?
\[\omega_u = \frac{(\Sigma_j\lambda_j)^2}{\sigma_x^2}\]
Are scores consistent across raters or time?
Weighting scheme (partial credit for close errors)
\[r_{ik}^\star = \sum_{l=1}^{q}w_{kl}r_{il}\]
Observed agreement
\[p_o = \frac{1}{n'}\sum_{i=1}^{n'}\sum_{k=1}^{q}\frac{r_{ik}(r_{ik}^\star-1)}{r_i(r_i-1)}\]
How much score variance is due to items vs. raters?
Variance components can be estimated many ways
\(ICC\in(-1,+1)\) where higher is better
\(ICC\ge0.75\) is acceptable, \(ICC\ge0.90\) is good
One-way Single-Measures ICC \[ICC(1)=\frac{\sigma_i^2}{\sigma_i^2+\sigma_{r:i}^2}\]
One-way Average-Measures ICC \[ICC(k)=\frac{\sigma_i^2}{\sigma_i^2 + \sigma_{r:i}^2/k}\]
Two-way Single-Measures Agreement ICC \[ICC(A,1) = \frac{\sigma_i^2}{\sigma_i^2 + \sigma_r^2 + \sigma_{ir}^2}\]
Two-way Average-Measures Agreement ICC \[ICC(A,k) = \frac{\sigma_i^2}{\sigma_i^2 + (\sigma_r^2 + \sigma_{ir}^2)/k}\]
Two-way Single-Measures Consistency ICC \[ICC(C,1) = \frac{\sigma_i^2}{\sigma_i^2 + \sigma_{ir}^2}\]
Two-way Average-Measures Consistency ICC \[ICC(C,k) = \frac{\sigma_i^2}{\sigma_i^2 + \sigma_{ir}^2/k}\]
Relationships with External Variables
A good score of X should be highly correlated with other, trusted measures of X (called criterion variables)
Do our scores correlate with criterion variables?
A good score of X should correlate positively with A…
A good score of X should correlate negatively with B…
A good score of X should be uncorrelated with C…
Do our scores correlate with others as expected?
A good score of X should differ between groups A and B…
Do our scores differ between known groups?
Gathered ~290,000 facial images of celebrities
Used OpenFace 2.0 to measure smiling
Compared smiling between genders and countries
How should we go about validating our measure?
Selected ~300 images (balanced gender, country, smile)
Each coded by 1 of 3 certified FACS coders for AU12 intensity
Each coded by 5 untrained crowdworkers
Correlations between openface, FACS coding, and ratings
Recorded 96 participants interacting in 3-person groups
Want to predict “perceived emotional expressivity”
Developed a four-item rating scale for our construct
Collected ratings of each video from 8 crowdworkers
How to validate these measures?
How expressive was the person in this video (use your own understanding of what it means to be expressive)?
How much did the person in this video show their emotions (through their words and nonverbal behavior)?
How animated (lively, energetic, or active) was the person in this video?
How much did the person in this video react to the other people (through their words and nonverbal behavior)?
Inter-Rater Reliability: \(ICC(C,8)\)
Inter-Item Reliability
\[\omega_u=0.966\]
External Phase: Nomological Network