9 Validity Studies

The preceding chapters and the Dynamic Learning Maps^® (DLM^®) Alternate Assessment System 2015–2016 Technical Manual—Science (Dynamic Learning Maps Consortium, 2017) provide evidence in support of the overall validity argument for results produced by the DLM assessment. This chapter presents additional evidence collected during 2019–2020 for four of the five critical sources of evidence described in Standards for Educational and Psychological Testing (American Educational Research Association et al., 2014): evidence based on test content, response process, internal structure, and external variables. Additional evidence can be found in Chapter 9 of the 2015–2016 Technical Manual—Science (Dynamic Learning Maps Consortium, 2017) and the subsequent annual technical manual updates (Dynamic Learning Maps Consortium, 2017, 2018a, 2018b, 2019a).

9.1 Evidence Based on Test Content

Evidence based on test content relates to the evidence “obtained from an analysis of the relationship between the content of the test and the construct it is intended to measure” (American Educational Research Association et al., 2014, p. 14).

This section presents results from data collected during 2019–2020 regarding an internal alignment review and an external alignment follow-up study. For additional evidence based on test content, including the alignment of test content to content standards, see Chapter 9 of the 2015–2016 Technical Manual—Science (Dynamic Learning Maps Consortium, 2017).

An external alignment study was originally conducted in 2017 to evaluate the relationships between the NGSS, the DLM science EEs and linkage levels, and assessment items (see Nemeth & Purl (2017)). The results from the study demonstrated acceptable overall alignment within the assessment system. However, there were a few specific areas that did not meet the criteria and benefited from additional evaluation. The following sections describe the steps taken to further evaluate the 2017 alignment study findings and the results from those investigations.

9.1.1 Internal Alignment Follow-up Review

One finding from the 2017 external alignment study indicated that 87% of high school unique EEs were rated as matching the corresponding SEP from the NGSS performance expectation (PE), just missing the criterion threshold of 90% (Nemeth & Purl, 2017). In this study, alignment was reported for three EE pools: those unique to the high school general science blueprint, those unique to the high school Biology blueprint, and those that overlapped the two blueprints. The original study concluded that, when the ratings for unique and common EEs were considered together based on the blueprint for the general HS science assessment, the criterion was met. While no corrective action was recommended, internal staff reviewed the two EEs rated as partially or not aligned to the NGSS SEP by some of the panelists. The review was conducted by two internal DLM staff who were not engaged with the original EE review or writing process. The purpose of the review and feedback was to inform any future potential revisions to the EEs.

The consensus from the internal review of the two high school EEs confirmed the findings from the 2017 study. That is, the reviewers agreed that the two EEs, as currently written, aligned to different SEPs than the intended corresponding PE SEP. Both reviewers suggested revising the language of the two EEs to better align to the intended SEP.

Staff determinations for the two EEs fed into separate but related development work, already planned, to revise and expand the existing DLM Science EEs. That process began with an educator event in November 2019. The two high school EEs noted in the HumRRO study (Nemeth & Purl, 2017) and confirmed during the internal review were included in the review process during the November event. By 2022, the updated DLM science EEs will be voted on for acceptance by the DLM Governance Board. ATLAS staff will work with the DLM Governance Board and Technical Advisory Committee to develop a timeline and process to revise the blueprints based on the 2022 EEs, develop and field test new items, administer new assessments, set new achievement standards, and provide validity evidence including technical quality for the new assessments.

9.1.2 External Alignment Follow-up Study

The original external alignment study for the Dynamic Learning Maps (DLM) Science Assessment System was conducted by HumRRO in 2016 (Nemeth & Purl, 2017). HumRRO used panelist ratings to examine the alignment between the NGSS and DLM Science EEs, the vertical articulation of linkage levels for each EE, and the alignment of testlets and items to linkage levels. Alignment was evaluated using various criteria, which included examining the relationship between the cognitive process dimension (CPD) of the EEs and the NGSS, and the relationship between the CPD of the science items and the corresponding EE. The Target linkage level was used in the item-EE CPD comparisons. While the taxonomy for CPD ranges from 1 (pre-intentional) to 10 (create), items were written to focus on level 5 (remember) and above. Results of the study indicated good overall alignment on most criteria. However, three middle school EEs were rated as having a higher CPD than the NGSS performance expectation and in 65% of cases, the CPD for high school and high school biology common items were rated higher than the associated EE. For further details about the original external alignment study findings, refer to the 2015–2016 Technical Manual—Science (Dynamic Learning Maps Consortium, 2017).

EdMetric conducted a follow-up external alignment study in 2019 (Davidson, 2020). The purpose of the follow-up study was to (1) provide formative feedback about the three middle schools EEs that were rated as having a higher CPD than the NGSS performance expectation, and (2) evaluate how the CPD of an updated pool of science items (N = 409) align with the EEs at each intended linkage level (Initial, Precursor, and Target). The criterion for the CPD alignment was based on the previous study (Nemeth & Purl, 2017).

The following sections provide a brief summary of findings from the follow-up external alignment study. A summary of the ATLAS response to those findings is described after the EdMetric findings.

9.1.2.1 Middle School Essential Element Review of Cognitive Process Dimension

To investigate the three middle school EEs flagged in the original study as having a higher CPD than the NGSS (EE.MS.PS3.3, EE.MS.LS1.5, and EE.MS.ESS3.3), panelists reviewed the original findings and acknowledged through focus group interviews if they recommended different CPD ratings than the original study, and if they had an explanation for the original study’s findings. Panelists’ responses were reviewed and coded. Table 9.1 explores the panelists’ ratings compared to the ratings from the original study. While the updated panelists’ ratings differed slightly from the original ratings, the overall alignment outcome did not change based on the original criterion (that 75% or more of the EE ratings were at the same or lower cognitive process dimension as the NGSS performance expectation). Panelists supported the original findings, that the three middle school EEs had a higher CPD rating than the NGSS performance expectations. Panelists provided language from each standard as evidence for their ratings (Davidson, 2020).

Table 9.1: Cognitive Process Dimension Ratings for EE.MS.PS3.3, EE.MS.LS1.5, and EE.MS.ESS3.3
Standard	NGSS Rating	Group	Original Rating	New Rating
PS3.3	8	1	9	9
PS3.3	8	2	9	9
LS1.5	7	1	9	8
LS1.5		2	9	9
ESS3.3		1	9	10
		2	9	(Rater 1) 9;
		2	9	(Rater 2) 10

9.1.2.2 Linkage Level and Item Cognitive Process Dimensions

To evaluate the relationship of the CPD of science items to the linkage levels for each EE, panelists first rated the CPD for the Initial and Precursor linkage levels for each EE. The Target level ratings from the original study were used (Nemeth & Purl, 2017). Panelists completed independent ratings and then participated in a group discussion to determine a consensus CPD rating for the linkage levels. For the 409 items in 102 total testlets across grade bands (elementary, middle, and high school), panelists rated the items independently before coming to a consensus based on group discussions. The item-level CPD ratings for each grade band represented each level at the intended range (level 5 [remember] was the lowest rating; there were no level 10 [create] ratings; Davidson (2020)).

The alignment results comparing the CPD ratings of items to the CPD ratings of linkage levels is displayed in Table 9.2 by domain and Table 9.3 by linkage level. The alignment criterion was based on the original study (Nemeth & Purl, 2017): acceptable alignment based on CPD meant that 75% or more of the item ratings were lower than or equal to the corresponding linkage level.

By domain, the alignment criterion was met for items in the elementary and middle school grade bands at the domain level and the overall level (9.2). In the high school grade band (general science), 70% of items in the earth/space science domain, and 71% of items in the physical science domain had CPD rated as lower or equal to the CPD in corresponding linkage level, which is near the 75% alignment threshold. In the life science domain, the alignment criterion was not met for high school (general science) or high school (biology; 50% for both pools). By linkage level, the criterion for alignment was met across grade bands and assessments for items at the Initial linkage level (9.3). Items at the Precursor linkage level met the criterion for alignment in the elementary and middle school grade bands, but did not meet the criterion for alignment for high school (general science) and high school (biology). Items at the Target linkage level met the criterion for alignment for middle school but did not reach the 75% threshold for alignment for the elementary grade band and high school (general science) and high school (biology).

Table 9.2: Cognitive Process Dimension Rating Alignment Results by Domain
Domain	Total Items	% Higher CPD	% Lower or Equal CPD	Alignment
Elementary
Earth/Space Science	35	20	80	Yes
Life Science	26	19	81	Yes
Physical Science	48	4	96	Yes
Total	109	13	87	Yes
Middle School
Earth/Space Science	39	3	97	Yes
Life Science	39	3	97	Yes
Physical Science	39	18	82	Yes
Total	117	8	92	Yes
High School - General Science
Earth/Space Science	37	30	70	No
Life Science	40	50	50	No
Physical Science	38	29	71	No
Total	115	37	64	No
High School – Biology
Life Science	107	51	49	No
Note. Three Life Science EEs were common to the both General Science and Biology assessments and had the same items.

Table 9.3: Cognitive Process Dimension Rating Alignment Results by Linkage Level
Linkage Level	Total Items	% Higher CPD	% Lower or Equal CPD	Alignment
Elementary
Initial	39	0	100	Yes
Precursor	43	14	86	Yes
Target	27	30	70	No
Total	109	13	87	Yes
Middle School
Initial	45	0	100	Yes
Precursor	45	11	89	Yes
Target	27	15	85	Yes
Total	117	8	92	Yes
High School - General Science
Initial	42	10	90	Yes
Precursor	45	40	60	No
Target	28	72	29	No
Total	115	37	64	No
High School – Biology
Initial	37	19	81	Yes
Precursor	37	49	51	No
Target	33	88	12	No
Total	107	50	50	No
Note. Three Life Science EEs were common to the both General Science and Biology assessments and had the same items.

9.1.2.3 ATLAS Response to EdMetric Findings

Parallel to the EdMetic follow-up study, development work began in 2019 to revise and expand upon the existing DLM Science EEs which were originally developed in 2014. The three middle school EEs flagged as misaligned based on CPD were included in a broader educator panel review process during a November 2019 EE expansion event. During the review, educators who were trained to evaluate EEs against the three dimensions of the NGSS, suggested edits to the three middle school EEs descriptions to address the identified CPD issue. In 2022, the updated DLM science EEs will be voted on for acceptance by the DLM Governance Board. ATLAS staff will work with the DLM Governance Board and Technical Advisory Committee to develop a timeline and process to revise the blueprints based on the 2022 EEs, develop and field test new items, administer new assessments, set new achievement standards, and provide validity evidence including technical quality for the new assessments.

Additionally, the DLM Essential Element Concept Maps that item writers use in testlet development will be updated by spring 2022 to include a a range of expected CPDs that may be demonstrated for a linkage level. This information is intended to allow for flexibility in design while promoting further alignment between items and their corresponding linkage level. Item alignment to linkage level within testlets will also be reviewed internally and externally using updated criteria and in accordance with the annual testlet development cycle.

9.2 Evidence Based on Response Processes

The study of test takers’ response processes provides evidence about the fit between the test construct and the nature of how students actually experience test content (American Educational Research Association et al., 2014). Due to the COVID-19 pandemic, teacher survey responses and test administration observations were significantly reduced from prior years. The data collected from the limited samples of survey responses and teacher administration observations are not included in this chapter as they may not accurately represent the full DLM teacher population. Information on the number of test administration observations collected is presented in this section. For additional evidence based on response process, including studies on student and teacher behaviors during testlet administration and evidence of fidelity of administration, see Chapter 9 of the 2015–2016 Technical Manual—Science (Dynamic Learning Maps Consortium, 2017).

9.2.1 Test Administration Observations

Prior to the onset of the COVID-19 pandemic, test administration observations were conducted in multiple states during 2019–2020 to further understand student response processes. Students’ typical test administration process with their actual test administrator was observed. Test administration observations were collected by state and local education agency staff.

Consistent with previous years, the DLM Consortium used a test administration observation protocol to gather information about how educators in the consortium states deliver testlets to students with the most significant cognitive disabilities. This protocol gave observers, regardless of their role or experience with DLM assessments, a standardized way to describe how DLM testlets were administered. The test administration observation protocol captured data about student actions (e.g., navigation, responding), educator assistance, variations from standard administration, engagement, and barriers to engagement. The observation protocol was used only for descriptive purposes; it was not used to evaluate or coach educators or to monitor student performance. Most items on the protocol were a direct report of what was observed, such as how the test administrator prepared for the assessment and what the test administrator and student said and did. One section of the protocol asked observers to make judgments about the student’s engagement during the session.

In 2019–2020, there were 223 observations collected in six states.

9.3 Evidence Based on Internal Structure

Analyses of an assessment’s internal structure indicate the degree to which “relationships among test items and test components conform to the construct on which the proposed test score interpretations are based” (American Educational Research Association et al., 2014, p. 16). Given the heterogeneous nature of the DLM student population, statistical analyses can examine whether particular items function differently for specific subgroups (e.g., male versus female). Additional evidence based on internal structure is provided across the linkage levels that form the basis of reporting.

9.3.1 Evaluation of Item-Level Bias

Differential item functioning (DIF) addresses the challenge created when some test items are “asked in such a way that certain groups of examinees who are knowledgeable about the intended concepts are prevented from showing what they know” (Camilli & Shepard, 1994, p. 1). DIF analyses can uncover internal inconsistency if particular items function differently in a systematic way for identifiable subgroups of students (American Educational Research Association et al., 2014). While identification of DIF does not always indicate weakness in a test item, it can point to construct-irrelevant variance, posing considerations for validity and fairness.

9.3.1.1 Method

DIF analyses for 2020 followed the same procedure used in previous years and examined ethnicity in addition to gender. Analyses included data from 2015–2016 through 2018–2019⁵ DIF analyses are conducted on the sample of data used to update the model calibration, which uses data through the previous operational assessment. See Chapter 5 of this manual for more information. to flag items for evidence of DIF. Items were selected for inclusion in the DIF analyses based on minimum sample-size requirements for the two gender subgroups: male and female; and five ethnicity subgroups: white, black, Indian, Asian, and multiple ethnicities.

The DLM student population is unbalanced in both gender and ethnicity. The number of female students responding to items is smaller than the number of male students by a ratio of approximately 1:2. Similarly, the number of non-white students responding to items is smaller than the number of white students by a ratio of approximately 1:2. Therefore, a threshold for item inclusion was retained from previous years whereby the focal group must have at least 100 students responding to the item. The threshold of 100 was selected to balance the need for a sufficient sample size in the focal group with the relatively low number of students responding to many DLM items.

Consistent with previous years, additional criteria were included to prevent estimation errors. Items with an overall proportion correct (p-value) greater than .95 or less than .05 were removed from the analyses. Items for which the p-value for one gender or ethnicity group was greater than .97 or less than .03 were also removed from the analyses.

Using the above criteria for inclusion, 490 (88%) items were selected for gender, and 490 (88%) items were selected for at least one ethnicity group comparison. The number of items evaluated by grade for gender ranged from 157 in grade 3–5 to 169 in grade 6–8. The number of items evaluated by grade for ethnicity ranged from 157 in grade 3–5 to 169 in grade 6–8. Because there are a total of seven ethnicity groups that students can be categorized in for DLM assessments,⁶ See Chapter 7 of this manual for a summary of participation by ethnicity and other demographic variables. there are up to six comparisons that can be made for each item, with the White ethnic group as the reference group and each of the other six ethnic groups (i.e., African American, Asian, American Indian, Native Hawaiian or Pacific Islander, Alaska Native, two or more races) as the focal group. Across all items, this results in 3,360 possible comparisons. Using the inclusion criteria specified above, 1,814 (54%) item and focal group comparisons were selected for analysis. Overall, 7 items were evaluated for two ethnic groups, 132 items were evaluated for three ethnic groups, and 351 items were evaluated for four ethnic groups. Table 9.4 shows the number of items that were evaluated for each ethnic focal group. Across all gender and ethnicity comparisons, sample sizes for each comparison ranged from 2,319 to 16,249 for gender and from 1,597 to 13,566 for ethnicity.

Table 9.4: Number of Items Evaluated for Each Ethnicity
Focal Group	Items (n)
Asian	483
African American	490
American Indian	351
Two or more races	490

Of the 70 items that were not included in the DIF analysis for gender, 70 (100%) had a focal group sample size of less than 100. A total of 70 items were not included in the DIF analysis for ethnicity for any of the subgroups. Of the 1,546 item and focal group comparisons that were not included in the DIF analysis for ethnicity, 1,532 (99%) had a focal group sample size of less than 100 and 14 (1%) had a subgroup p-value greater than .97. Table 9.5 and Table 9.6 show the number and percent of items that did not meet each inclusion criteria for gender and ethnicity, respectively, by the linkage level the items assess.

Table 9.5: Comparisons Not Included in DIF Analysis for Gender, by Linkage Level
	Sample Size		Item Proportion Correct		Subgroup Proportion Correct
Subject	n	%	n	%	n	%
Initial	22	31.4	0	0.0	0	0.0
Precursor	25	35.7	0	0.0	0	0.0
Target	23	32.9	0	0.0	0	0.0

Table 9.6: Comparisons Not Included in DIF Analysis for Ethnicity, by Linkage Level
	Sample Size		Item Proportion Correct		Subgroup Proportion Correct
Subject	n	%	n	%	n	%
Initial	598	39.0	0	0.0	0	0.0
Precursor	626	40.9	0	0.0	6	42.9
Target	308	20.1	0	0.0	8	57.1

For each item, logistic regression was used to predict the probability of a correct response, given group membership and performance in the subject. Specifically, the logistic regression equation for each item included a matching variable comprised of the student’s total linkage levels mastered in the subject of the item and a group membership variable, with the reference group (i.e., males for gender, White for ethnicity) coded as 1 and the focal group (i.e., females for gender; African American, Asian, American Indian, Native Hawaiian or Pacific Islander, Alaska Native, or two or more races for ethnicity) coded as 0. An interaction term was included to evaluate whether non-uniform DIF was present for each item (Swaminathan & Rogers, 1990); the presence of non-uniform DIF indicates that the item functions differently because of the interaction between total linkage levels mastered and the student’s group (i.e., gender or ethnic group). When non-uniform DIF is present, the group with the highest probability of a correct response to the item differs along the range of total linkage levels mastered, thus one group is favored at the low end of the spectrum and the other group is favored at the high end.

Three logistic regression models were fitted for each item:

\[\begin{align} \text{M}_0\text{: } \text{logit}(\pi_i) &= \beta_0 + \beta_1\text{X} \tag{9.1} \\ \text{M}_1\text{: } \text{logit}(\pi_i) &= \beta_0 + \beta_1\text{X} + \beta_2G \tag{9.2} \\ \text{M}_2\text{: } \text{logit}(\pi_i) &= \beta_0 + \beta_1\text{X} + \beta_2G + \beta_3\text{X}G\tag{9.3}; \end{align}\]

where \(\pi_i\) is the probability of a correct response to the item for group \(i\), \(\text{X}\) is the matching criterion, \(G\) is a dummy coded grouping variable (0 = reference group, 1 = focal group), \(\beta_0\) is the intercept, \(\beta_1\) is the slope, \(\beta_2\) is the group-specific parameter, and \(\beta_3\) is the interaction term.

Because of the number of items evaluated for DIF, Type I error rates were susceptible to inflation. The incorporation of an effect-size measure can be used to distinguish practical significance from statistical significance by providing a metric of the magnitude of the effect of adding group and interaction terms to the regression model.

For each item, the change in the Nagelkerke pseudo \(R^2\) measure of effect size was captured, from \(M_0\) to \(M_1\) or \(M_2\), to account for the effect of the addition of the group and interaction terms to the equation. All effect-size values were reported using both the Zumbo and Thomas (1997) and Jodoin and Gierl (2001) indices for reflecting a negligible, moderate, or large effect. The Zumbo and Thomas thresholds for classifying DIF effect size are based on Cohen’s (1992) guidelines for identifying a small, medium, or large effect. The thresholds for each level are .13 and .26; values less than .13 have a negligible effect, values between .13 and .26 have a moderate effect, and values of .26 or greater have a large effect. The Jodoin and Gierl thresholds are more stringent, with lower threshold values of .035 and .07 to distinguish between negligible, moderate, and large effects.

9.3.1.2 Results

9.3.1.2.1 Uniform DIF Model

A total of 86 items for gender were flagged for evidence of uniform DIF when comparing \(\text{M}_0\) to \(\text{M}_1\). Additionally, 215 item and focal group combinations across 171 items were flagged for evidence of uniform DIF. Table 9.7 and Table 9.8 summarize the total number of combinations flagged for evidence of uniform DIF by grade for gender and ethnicity, respectively. The percentage of combinations flagged for uniform DIF ranged from 11% to 25% for gender and 11% to 13% for ethnicity.

Table 9.7: Combinations Flagged for Evidence of Uniform DIF for Gender
Grade	Items flagged (n)	Total items (N)	Items flagged (%)	Items with moderate or large effect size (n)
3–5	18	157	11.5	0
6–8	42	169	24.9	1
9–12	26	164	15.9	0

Table 9.8: Combinations Flagged for Evidence of Uniform DIF for Ethnicity
Grade	Items flagged (n)	Total items (N)	Items flagged (%)	Items with moderate or large effect size (n)
3–5	68	593	11.5	0
6–8	67	624	10.7	0
9–12	80	597	13.4	1

For gender, using the Zumbo and Thomas (1997) effect-size classification criteria, all but one combination were found to have a negligible effect-size change after the gender term was added to the regression equation. When using the Jodoin and Gierl (2001) effect-size classification criteria, all but one combination were found to have a negligible effect-size change after the gender term was added to the regression equation.

The results of the DIF analyses for ethnicity were similar to those for gender. When using the Zumbo and Thomas (1997) effect-size classification criteria, all but one combination were found to have a negligible effect-size change after the ethnicity term was added to the regression equation. Similarly, when using the Jodoin and Gierl (2001) effect-size classification criteria, all but one combination were found to have a negligible effect-size change after the ethnicity term was added to the regression equation.

Table 9.9 provides information about the flagged items with a non-negligible effect-size change after the addition of the group term, as represented by a value of B (moderate) or C (large). The \(\beta_2G\) values in Table 9.9 indicate which group was favored on the item after accounting for total linkage levels mastered, with positive values indicating that the focal group had a higher probability of success on the item and negative values indicating that the focal group had a lower probability of success on the item. The focal group was favored on zero combinations.

Table 9.9: Combinations Flagged for Uniform DIF With Moderate or Large Effect Size
Item ID	Focal	Grade	EE	\(\chi^2\)	\(p\)-value	\(\beta_2G\)	\(R^2\)	Z&T^*	J&G^*
50455	Female	6–8	SCI.EE.MS.PS2-2	7.09	.008	−0.14	.859	C	C
49386	Black	9–12	SCI.EE.HS.LS4-2	18.25	< .001	−0.31	.771	C	C
Note. EE = Essential Element; Z&T = Zumbo & Thomas; J&G = Jodoin & Gierl.
^* Effect-size measure.

9.3.1.2.2 Combined Model

A total of 123 items were flagged for evidence of DIF when both the gender and interaction terms were included in the regression equation, as shown in equation (9.3). Additionally, 242 item and focal group combinations across 193 items were flagged for evidence of DIF when both the ethnicity and interaction terms were included in the regression equation. Table 9.10 and Table 9.11 summarize the number of combinations flagged by grade. The percentage of combinations flagged ranged from 21% to 31% for gender and 12% to 16% for ethnicity.

Table 9.10: Items Flagged for Evidence of DIF for the Combined Model for Gender
Grade	Items flagged (n)	Total items (N)	Items flagged (%)	Items with moderate or large effect size (n)
3–5	35	157	22.3	0
6–8	53	169	31.4	1
9–12	35	164	21.3	0

Table 9.11: Items Flagged for Evidence of DIF for the Combined Model for Ethnicity
Grade	Items flagged (n)	Total items (N)	Items flagged (%)	Items with moderate or large effect size (n)
3–5	76	593	12.8	0
6–8	72	624	11.5	0
9–12	94	597	15.7	1

Using the Zumbo and Thomas (1997) effect-size classification criteria, all but one combination were found to have a negligible effect-size change after the gender and interaction terms were added to the regression equation. When using the Jodoin and Gierl (2001) effect-size classification criteria, all but one combination were found to have a negligible effect-size change after the gender and interaction terms were added to the regression equation.

The results of the DIF analyses for ethnicity were similar to those for gender. When using the Zumbo and Thomas (1997) effect-size classification criteria, all but one combination were found to have a negligible effect-size change after the ethnicity and interaction terms were added to the regression equation. Similarly, when using the Jodoin and Gierl (2001) effect-size classification criteria, all but one combination were found to have a negligible effect-size change after the ethnicity and interaction terms were added to the regression equation.

Information about the flagged items with a non-negligible change in effect size after adding both the group and interaction term is summarized in Table 9.12, where B indicates a moderate effect size, and C a large effect size. In total, two combinations had a large effect size. The two combinations flagged for DIF for the combined model are the same two combinations flagged for DIF for the uniform model. The \(\beta_3\text{X}G\) values in Table 9.12 indicate which group was favored at lower and higher numbers of linkage levels mastered. All combinations favored the focal group lower numbers of total linkage levels mastered and the reference group at higher numbers of total linkage levels mastered.

Table 9.12: Combinations Flagged for DIF With Moderate or Large Effect Size for the Combined Model
Item ID	Focal	Grade	EE	\(\chi^2\)	\(p\)-value	\(\beta_2G\)	\(\beta_3\text{X}G\)	\(R^2\)	Z&T^*	J&G^*
50455	Female	6–8	SCI.EE.MS.PS2-2	7.12	.028	−0.11	0.00	.859	C	C
49386	Black	9–12	SCI.EE.HS.LS4-2	22.79	< .001	−0.11	−0.05	.771	C	C
Note. EE = Essential Element; Z&T = Zumbo & Thomas; J&G = Jodoin & Gierl.
^* Effect-size measure.

Appendix A includes plots labeled by the item ID, which display the best-fitting regression line for each sub-group, with jitter plots representing the total linkage levels mastered for individuals in each sub-group. Plots are included for the two combinations with a non-negligible effect-size change in the uniform DIF model (Table 9.9), which are the same two items as the two combinations with non-negligible effect-size changes in the combined model (Table 9.12).

9.3.1.3 Test Development Team Review of Flagged Items

The science test development team was provided with a data file that contained information about the items flagged with a large effect size. To avoid biasing the review of the items, the file did not indicate which group was favored.

During their review of the flagged items, the test development team was asked to consider facets of the items that may lead one gender or ethnicity group to provide correct responses at a higher rate than the other. Because DIF is closely related to issues of fairness, the bias and sensitivity external review criteria (see Clark, Beitling, et al., 2016) were provided for the test development team to consider as they reviewed the items. After reviewing the flagged items and considering their context in the testlet, including the engagement activity, the test development team was asked to provide one of three decision codes.

Accept: There is no evidence of bias favoring one group or the other. Leave item as is.
Minor revision: There is a clear indication that a fix will correct the item if the edit can be made within the allowable edit guidelines.
Reject: There is evidence the item favors one gender or ethnicity group over the other. There is no allowable edit to correct the issue. The item is slated for retirement.

After review, the items flagged with a large effect size were given a decision code of 1 by the test development team. No evidence could be found in the items indicating the content favored one gender or ethnicity group over the other.

As additional data are collected in subsequent operational years, the scope of DIF analyses will be expanded to include additional items and approaches to detecting DIF.

9.3.2 Internal Structure Within Linkage Levels

Internal structure traditionally indicates the relationships among items measuring the construct of interest. However, for DLM assessments, the level of scoring is each linkage level, and all items measuring the linkage level are assumed to be fungible. Therefore, DLM assessments instead present evidence of internal structure across linkage levels, rather than across items. Further, traditional evidence, such as item-total correlations, are not presented because DLM assessment results consist of the set of mastered linkage levels, rather than a scaled score or raw total score.

Chapter 5 of this manual includes a summary of the parameters used to score the assessment, which includes the probability of a master providing a correct response to items measuring the linkage level and the probability of a non-master providing a correct response to items measuring the linkage level. Because a fungible model is used for scoring, these parameters are the same for all items measuring the linkage level. Chapter 5 also provides a description of the linkage level discrimination (i.e., the ability to differentiate between masters and non-masters).

When linkage levels perform as expected, masters should have a high probability of providing a correct response, and non-masters should have a low probability of providing a correct response. As indicated in Chapter 5 of this manual, for 102 (> 99%) linkage levels, masters had a greater than .5 chance of providing a correct response to items. Additionally, for 98 (96%) linkage levels, masters had a greater than .6 chance of providing a correct response, compared to only 0 (< 1%) linkage levels where masters had a less than .4 chance of providing a correct response. Similarly, for 84 (82%) linkage levels, non-masters had a less than .5 chance of providing a correct response to items. For most linkage levels (n = 60; 59%) non-masters had a less than .4 chance of providing a correct response; however, for 4 (4%) linkage levels, non-masters had a greater than .6 chance of providing a correct response. Finally, 65 (64%) linkage levels had discrimination index of greater than .4, indicating that most linkage levels are able to discriminate well between masters and non-masters.

Chapter 3 of this manual includes additional evidence of internal consistency in the form of standardized difference figures. Standardized difference values are calculated to indicate how far from the linkage level mean each item’s p-value falls. Across all linkage levels, 560 (> 99%) of items fell within two standard deviations of the mean for the linkage level.

These sources, combined with procedural evidence for developing fungible testlets at the linkage level, provide evidence of the consistency of measurement at the linkage levels. For more information on the development of fungible testlets, see the 2015–2016 Technical Manual—Science (Dynamic Learning Maps Consortium, 2017). In instances where linkage levels and the items measuring them do not perform as expected, test development teams review flags and prioritize content for revision and re-field test, or retirement, to ensure the content measures the construct as expected.

9.4 Evidence Based on Relation to Other Variables

According to Standards for Educational and Psychological Testing, “analyses of the relationship of test scores to variables external to the test provide another important source of validity evidence” (American Educational Research Association et al., 2014, p. 16). Results from the assessment should be related to other external sources of evidence measuring the same construct.

9.4.1 Postsecondary Opportunities

During 2019–2020, evidence was collected to evaluate the extent to which the DLM alternate academic achievement standards are aligned to ensure that a student who meets these standards is on track to pursue postsecondary education or competitive integrated employment. The 2015–2016 Technical Manual—Science (Dynamic Learning Maps Consortium, 2017) provides evidence of vertical alignment for the alternate academic achievement standards.

Further evidence describes the relationship of the DLM alternate academic achievement standards to the knowledge, skills, and understandings needed for pursuit of postsecondary opportunities. We developed two hypotheses about the expected relationship between meeting DLM alternate academic achievement standards and being prepared for a variety of postsecondary opportunities.

Nearly all academic skills will be associated with performance level descriptors at a variety of grades between grade 3 and high school. Few if any academic skills will first occur before grade 3 At Target or after high school At Target.
Because academic skills may be associated with multiple opportunities and with soft skills needed for employment and education, we expected Hypothesis 1 to hold for academic skills associated with employment opportunities, education opportunities, and soft skills.

Similar to academic education for all students, academics for students with significant cognitive disabilities develops across grades. Individuals use academic skills at varying levels of complexity, depending on specific employment or postsecondary education settings. Therefore, academic skills associated with achieving At Target in lower grades demonstrate where students are able to apply the least-complex version of the skill. Given the vertical alignment of DLM content and achievement standards, students are expected to continue learning new skills in subsequent grades and be prepared for more-complex applications of the academic skills by the time they transition into postsecondary education and employment.

A panel of experts on secondary transition and/or education of students with significant cognitive disabilities identified postsecondary competitive integrated employment and education opportunities. Their goal was to identify an extensive sampling of opportunities rather than an exhaustive list. Panelists also considered the types of educational and employment opportunities currently available to students with significant cognitive disabilities as well as opportunities that may be more aspirational (i.e., opportunities that may become available in the future). Panelists identified 57 employment opportunities and seven postsecondary education opportunities. Employment opportunities spanned sectors including agriculture, business, arts, education, health sciences, hospitality, information technology, manufacturing, and transportation.

Panelists next identified the knowledge, skills, and understandings needed to fulfill the responsibilities for the employment opportunities as well as eight common responsibilities across all postsecondary education opportunities. Finally, the panel identified the knowledge, skills, and understandings within soft skills (e.g., social skills, self-advocacy) applicable across multiple postsecondary settings. A science subject-matter expert reviewed and refined the academic skills statements to provide clarity and consistency across skills. This resulted in 150 science skills, from which 53 were systematically sampled to be used in the next phase of the study.

The second panel examined the relationship between the academic skills and the types of academic knowledge, skills, and understandings typically associated with meeting the DLM alternate academic achievement standards (i.e., achieving At Target). By identifying the lowest grade where a student achieving At Target is likely to consistently demonstrate the academic skill, the second panel identified the first point where students would be ready to pursue postsecondary opportunities that required the least-complex application of the skill.

Panels consisted of general educators and special educators who administered DLM assessments from across DLM states. Most panelists had expertise across multiple grade bands, and some had certification in both an academic subject and special education. Panels completed training and calibration activities prior to making independent ratings. Panels discussed ratings until consensus when there was not an initial majority agreement.

Panels identified the lowest grade in which students who achieve At Target on the DLM alternate assessment are at least 80% likely to be able to demonstrate each skill, showing the first point of readiness to pursue postsecondary opportunities that require the least-complex application of academic skills. Skill ratings were distributed evenly: 40% of science skills are expected to be first demonstrated by students achieving At Target in elementary grades, followed by 35% in middle grades and 25% in high school. Within high school, two skills (4%) were associated with the biology PLDs, 9 (17%) with general science, and two (4%) with both biology and high school science.

Overall, findings from panels indicate that most academic skills needed to pursue postsecondary opportunities are first associated with meeting the DLM academic achievement expectations in elementary or middle grades. Given the vertical alignment of the DLM academic achievement standards, students who achieve At Target in early grades further develop these skills so that, by the time they leave high school, they are ready to pursue postsecondary opportunities that require more-complex applications of the academic skills.

Evaluations of panelists’ experiences from both panels and DLM Technical Advisory Committee members’ review of the processes and evaluation results provide evidence that the methods and processes used achieved the goals of the study. See Karvonen et al. (2020) for the full version of the postsecondary opportunities technical report.

9.5 Conclusion

This chapter presents additional studies as evidence for the overall validity argument for the DLM Alternate Assessment System. The studies are organized into categories where available (content, response process, internal structure, and relation to other variables), as defined by the Standards for Educational and Psychological Testing (American Educational Research Association et al., 2014), the professional standards used to evaluate educational assessments.

The final chapter of this manual, Chapter 11, references evidence presented through the technical manual, including Chapter 9, and expands the discussion of the overall validity argument. Chapter 11 also provides areas for further inquiry and ongoing evaluation of the DLM Alternate Assessment System, building on the evidence presented in the 2015–2016 Technical Manual—Science (Dynamic Learning Maps Consortium, 2017) and the subsequent annual technical manual updates (Dynamic Learning Maps Consortium, 2017, 2018a, 2018b, 2019a), in support of the assessment’s validity argument.