EPSI Psychometric Properties

Note: The information below refers to early work on the EPSI assessment. Monthly benchmark scores and the calculation for the EPSI Total score have recently been updated as described in the EPSI Update. A more complete description of the methods and revised EPSI psychometric properties are forthcoming.

The Early Problem Solving Indicator (EPSI) was developed in a program of research designed to test its soundness as a measure of early problem solving (cognitive) skills (see Walker, Greenwood, Carta, et al., 2005).  Some of the important features of soundness, or the technical adequacy expected of any sound measure, are reliability and validity.

A measure is reliable when two observers simultaneously recording a child’s performance and return the same, or nearly the same score.  A measure is also reliable when a child’s score on one occasion is comparable to that obtained on another occasion separated by only a very brief period of time, e.g., several days.

A measure is valid when it is shown to measure what it is supposed to measure, in this case, early problem solving.

  • One proof of validity is a significant correlation between the EPSI and a standardized measure of early cognitive/mental abilities like the Bayley Scales of Infant Development, 2nd Edition (BSID-II, Bayley, 1993) or the Wechsler Preschool and Primary Scale of Intelligence-Revised (WPPSI-R, Wechsler, 1989).
  •  A second proof would be finding a significant difference in the problem solving proficiency of older children measured by the EPSI compared to younger children – because in general, we expect older children to be more proficient than younger children in the birth to 3 years age range.

Sample Description

Thirty children were recruited in two child care centers serving infants and toddlers located in metropolitan Kansas City and eventually participated in data collection. Any child in the 12 to 48 month age range was eligible to participate in the study. Nineteen children participated from Center 1 (63%) and 11 (37%) from Center 2. The proportions of one, two, and three year olds participating from each Center were statistically equal as was the number of boys and girls, 15 each. The mean age of children at first measurement was 31.4 months (SD = 8.0, min = 14.6, max = 46.4). The general design was a cross-sectional, longitudinal study with children in three age cohorts at start (C1 = 12-23 [n = 8], C2 = 24-35 [ n =14], and C3 = 36-48 [n = 8] months of age).

Centers represented a range of parochial and private sponsoring groups and some were affiliated with neighboring high schools serving adolescent mothers. The centers served children of varied racial and socioeconomic backgrounds. Five additional children were subsequently identified in the developmental delay range by research staff based on mental ability scores at or below -1.5 SD of the Bayley Scales mean. Three of these five identified by research staff children had existing IFSP’s.

Each eligible child’s parent received a packet of information that included an informed consent form and demographic questionnaire. Any child whose parents returned a signed informed consent participated over the next 6 months. By the end of the study, 2 children had dropped from participation because they had moved or otherwise left the center without forwarding information. Thus, the analysis sample was comprised of 28 children who had completed at least some form of measurement.


Technical Measurement Results


Reliability – Interobserver Agreement

Interobserver agreement assesses the extent to which two observers produce the same score. Agreement assessments tap the extent that two observers record the key skills elements displayed by the same child being observed by both observers at the same time. High percentage agreement indicates that observers are well trained because they understand and apply the key skill element definitions in the same way in the recording process.

Interobserver agreement was calculated using the frequency ratio method (Kazdin, 1982) using 58 paired agreement checks. The ratio was computed by dividing the smaller of the two scores by the larger times 100.

  • Overall percentage agreement was 93% (range, 76% to 100%) (using 58 paired agreement checks)

By Key Skill Element, percentage agreement scores between two observers recordings of the same child were:

  • 71.1%, Look
  • 86.1%, Explore
  • 95.3%, Function
  • 92.9%, Solutions
  • 72.0%, Engagement

Pearson r also was used to calculate the similarity between observers’ scores, yielding an overall correlation of .97. By Key Skill Element, correlations were:

  • .60, Look
  • .76, Explore
  • .98, Function
  • .99, Solutions
  • .99, Engagement

View complete table of percent agreement and correlations

Reliability – Split-half (Odd vs. Even)


This form of reliability tests the comparability of EMI scores when scores are based on odd versus even observation occasions and compared.

Split-half reliability findings were: .88 for Functions, .88 all skills combined, and .83 for Functions + Solutions, each produced very strong reliability correlations. Explore produced a moderate correlation of .60, while all remaining correlations were weak.

View complete table of split-half reliability data

Reliability – Alternate Toy Forms

This test of reliability compares movement scores formed when observations were made using alternate toys, in this case the Form A and Form B toy sets. A desired reliability outcome is for children’s scores to be equal when measured with both forms very close in time.

Pearson correlations were very strong at .90 for Function, All combined, and Functions + Solutions. Look and Engagement produced the weakest correlations at .32 and .26 respectively. The weak correlation with Engagement was explained by its limited variability as most children scored at its ceiling level.

Scores for each skill and composite (means and standard deviations) were similar in magnitude comparing one form to the other. Mean differences were significantly different only for Solutions. Form B consistently underestimated solutions compared to form A.

Validity – Criterion

Analyses were conducted to test whether or not EPSI scores correlated with other measures of problem solving and mental ability. These measures were either the Bayley Scales of Infant Development, 2nd Edition (BSID-II, Bayley, 1993) or the Wechsler Preschool and Primary Scale of Intelligence-Revised (WPPSI-R, Wechsler, 1989). For children younger than 42 months of age, the Bayley Scales of Infant Development, 2nd Edition (BSID-II, Bayley, 1993) was used.  For two children who were older than 42 months, the Wechsler Preschool and Primary Scale of Intelligence-Revised was used (WPPSI-R, Wechsler, 1989). Both measures yield standard scores (M= 100, SD=15) for mental ability.

The Bayley’s Mental Scale is a measure of cognitive functioning for children birth through 42 months of age that is administered to a respondent who is familiar with the child’s behavior via a semi-structured interview. The mental scale taps a range of skills including: sensory/perceptual acuities, discriminations, and response; acquisition of object constancy; memory, learning, and problem solving; vocalization, beginning of verbal communication; basis of abstract thinking; habituation; mental mapping; complex language; and mathematical concept formation. For children older than 15 months, the full protocol requires up to 60 minutes to complete, up to 35 minutes for children younger than 15 months.  The BSID-II is normed on a stratified random sample of 1,700 U.S. children (850 boys and 850 girls) ages one month to 42 months, grouped at one-month to three-month intervals on the variables of age, sex, region, race/ethnicity, and parental education.

The Wechsler Preschool and Primary Scale of Intelligence-Revised (WPPSI-R) also measures the cognitive ability of young children.  A range of studies indicate that the WPPSI has adequate construct, concurrent, and predictive validity for many types of normal and handicapped children in the age range from 4 to 6.5 years. The WPPSI-R contains 12 subtests, 6 in the Performance Scale and 6 in the Verbal Scale. Five of the six subtests in each scale are designated as the standard subtests. They are Object Assembly, Geometric Design, Block Design, Mazes, and Picture Completion in the Performance Scale and Information, Comprehension, Arithmetic, Vocabulary, and Similarities in the Verbal Scale. The optional subtests are Animal Pegs in the Performance Scale and Sentences in the Verbal Scale. The WPPSI-R was standardized on 1,700 children, 100 boys and 100 girls in each of eight age groups from 3 to 7 years and one group of 50 boys and 50 girls from 7 years. The 1986 U.S. census data were used to select representative children for the normative sample.  Test-retest reliabilities for a period of approximately 3 to 7 weeks for Performance, Verbal, and Full Scale IQS were .87, .89, and .91, respectively.

Does the EPSI measure early cognitive ability? Children received five repeated EPSI measures separated in time by 3 weeks. Twenty-one children (70%) had all 5, four children had 4 observations (13%), five had 3 or fewer observations (17%).

Moderately strong criterion validity correlations were obtained between the criterion measure and three of the EPSI scores:

  • Functions (r = .48)
  • Solutions (r = .40)
  • Total composite (r = .42)

Is the EPSI sensitive to age differences in early communication? Analyses by age cohort indicated that Functions, Solutions, Total Composite (rate per minute), and Engagement (percentage duration) showed an orderly increases from one age span to the next. Older children were more proficient in these skills than younger children. This was not the case for Look and Explore. Look remained relative stable over age cohort, while Explore increased slightly in Cohort 2 compared to Cohort 1, and then dropped back in Cohort 3.


Is the EPSI sensitive to changes in Key Skill Elements (Look, Explore, Functions, Solutions, Engagement)?


  • Overall children and ages combined, Function was the most frequently occurring key skill (6.7 responses per minute) followed by Explore (5.5), Look (1.1), and Solutions (0.9) in rank order.
  • Beginning at 14 months of age, children had greater than zero rates of occurrence of all key skills except Solutions.
  • Solutions emerged 8 months later at 22 months of age.
  • Beginning at 1.0 per minute, Functions accelerated to 8.0 per minute at 36 months of age, increasing again to 10.0 per minute at 49 months of age.
  • Solutions accelerated from 0 to 1.0 per minute by 28 months of age leveling off thereafter through 49 months of age.
  • Look, Explore, and Engagement were flat and relatively unchanged over the entire age span.

View raw data trends in key skill elements over age at testing


 Is the EPSI sensitive to growth over time? The linear slopes (responses per minute per month) were the following for these key skill and combinations:

  • 0.001, Look
  • 0.003, Explore
  • 0.050, Solutions
  • 0.275, Functions
  • 0.320, Composite (Look + Explore + Solutions + Functions)
  • 0.325, Function + Solution

These slopes indicate that Look and Explore, with slopes near zero, were flat and not growing over time.  Functions + Solutions were growing at a rate of one-third of a response per month of age.