Computerized adaptive tests based on new assessment formats research team:

Since September 2023, we have been launching this exciting project in which we will address topics such as the use of the graded response format in adaptive tests, the exploration of factorial structure, or the discussion on whether models for continuous or discrete traits should be employed to maximize reliability. We’ll be sharing the results here, stay tuned!
Some results:
9 Escudero, S., Sorrel, M. A., Kreitchmann, R. S., & Abad, F. J. (manuscript accepted for publication). A comparison of optimization algorithms for forced-choice questionnaire assembly. Methodology. https://meth.psychopen.eu/index.php/meth/aam
This study addresses the challenge of constructing forced-choice questionnaires (FCQs) that balance resistance to faking with accurate trait score recovery, given that item pairing strategies can substantially influence measurement quality. To address this issue, we compare four FCQ assembly methods—a genetic algorithm (GA), two simulated annealing (SA) variants (blueprint-based and scale-parameter-optimized), and a brute-force (BF) random search—through simulation and an empirical application. We examine how questionnaire length, social desirability matching, and the relationship between item discrimination and SD influence score recovery across different item bank configurations, including the presence of heteropolar blocks. Results indicate that GA consistently yields the most accurate and reliable trait estimates, followed by SA with parameter optimization, while all design factors significantly affect performance. The study provides practical recommendations for optimizing FCQ construction, particularly in short tests and high a_j–SD_j correspondence settings.
8 Escudero, S., Vázquez-Lira, R., Leenen, L., & Sorrel, M. A. (in press). Issues and possible solutions in cognitive diagnosis modeling applications: The case of a large-scale educational assessment in Mexico. Annals of Psychology, 42(2), 1-14. https://doi.org/10.6018/analesps
This study addresses the limited number of applied investigations of cognitive diagnosis models (CDMs), despite their strong theoretical development and increasing use in psychological and educational measurement. To bridge the gap between theory and practice, we apply CDM to a large-scale assessment of high school teachers in Mexico that was explicitly designed within a diagnostic framework. Through this application, we identify five key practical issues that emerge during model implementation and illustrate how they can be addressed using R-based procedures. Results highlight the importance of careful empirical evaluation of CDMs in real-world settings, as well as the need for close collaboration with content experts to ensure meaningful interpretation and valid diagnostic inferences. The study provides practical guidance for future applied CDM research and large-scale assessment design.
7 57 Iglesias, D., Sorrel, M. A., & Olmos, R. (in press). Evaluating the performance of R-Squared measures in multilevel models. Multivariate Behavioral Research. https://doi.org/10.1080/00273171.2026.2634294
This study addresses the lack of empirical evaluation of multilevel model (MLM) R² measures, which, although integrated into a unifying interpretative framework, have been primarily defined at the population level without assessing their performance under realistic applied conditions. To address this gap, we evaluate the performance of various MLM R² measures as estimators of their population counterparts through Monte Carlo simulations, systematically varying factors such as the number of level-1 and level-2 predictors, the presence of cross-level interactions, and random slopes. Results indicate that increasing the number of level-2 predictors requires a larger number of clusters to achieve accurate estimates, while greater model complexity—reflected in more level-1 predictors, cross-level interactions, and random slopes—demands increases in either cluster size or observations per cluster.
6 Nájera, P., Abad, F. J., Chiu, C-Y., & Sorrel, M. A. (in press). Variable-length cognitive diagnostic computerized adaptive testing in small-scale assessments. Journal of Educational and Behavioral Statistics. https://doi.org/10.3102/10769986251366581
This study addresses the limitations of cognitive diagnostic computerized adaptive testing (CD-CAT) in small-scale assessments, where traditional parametric approaches often suffer from overfitting and inflated reliability due to limited calibration samples, while nonparametric methods lack reliability information. To overcome these issues, we propose four CD-CAT procedures based on the parsimonious R-DINA model, including both calibration-dependent and calibration-free approaches. We evaluate their performance through simulation under varying conditions relevant to small-sample contexts. Results indicate that the proposed methods achieve a better balance between classification accuracy and reliability estimation, with calibration-free approaches showing particular promise when calibration samples are unavailable. The study provides practical guidance for implementing adaptive cognitive diagnostic testing in small-scale settings. The proposed method is available in the R package cdcatR.
5 Graña, D. F., Kreitchmann, R. S., Sorrel, M. A., Garrido, L. E., & Abad, F. J. (2026). Dimensionality assessment in forced-choice questionnaires: First steps toward an exploratory framework. Educational and Psychological Measurement, 86(1), 54-81. https://doi.org/10.1177/00131644251358226
This study addresses the challenge of assessing dimensionality in forced-choice (FC) questionnaires, a format increasingly used to reduce social desirability but characterized by complex multidimensional structures and restrictive confirmatory assumptions. To fill the lack of systematic evaluation of exploratory approaches, we compare five commonly used dimensionality assessment methods through a Monte Carlo simulation that manipulates key design features such as number of dimensions, item structure, response format, and sample size. Results indicate that Parallel Analysis and the Maximal Kaiser Criterion outperform alternative methods in terms of accuracy and bias, with performance improving under specific design conditions (e.g., longer tests or inclusion of heteropolar blocks). The study provides practical guidance for improving both questionnaire design and dimensionality assessment in FC contexts.
4 Nájera, P., Ma, W., Sorrel, M. A., & Abad, F. J. (2025). Assessing item-level fit for the sequential G-DINA model. Behaviormetrika, 1-24. https://doi.org/10.1007/s41237-025-00263-8
This study addresses the need for item-level model fit assessment in the Sequential Process Model (SPM), an extension of Cognitive Diagnostic Models (CDMs) designed for graded responses. While prior work proposed tools for Q-matrix validation and category-level model selection in the SPM, item-level fit had not yet been explored. The authors adapt three well-known item-fit statistics to the SPM and evaluate their performance through simulation. Results show the methods are generally conservative but effective in detecting meaningful misspecifications, with practical guidance provided for applied use.
3 Nájera, P., Kreitchmann, R. S., Escudero, S., Abad, F. J., de la Torre, J., & Sorrel, M. A. A general diagnostic modeling framework for forced-choice items (2025). British Journal of Mathematical and Statistical Psychology. https://doi.org/10.1111/bmsp.12393
This study extends diagnostic classification modeling (DCM) to better
handle forced-choice (FC) item formats used in assessing noncognitive
traits. It introduces an adaptation of the G-DINA model for FC
responses, addressing limitations in Huang’s (2023) FC-DCM, particularly
under variable item discrimination. Simulation studies and a real-data
application show that the adapted G-DINA model achieves more accurate
classifications and better model fit.
The proposed method is available in the R package cdmTools.
2 Iglesias, D., Sorrel, M. A., & Olmos, R. Cross-validation and predictive metrics in psychological research: Do not leave out the leave-one-out (2025). Behavioral Research Methods, 3, 57-85. https://doi.org/10.3758/s13428-024-02588-w
This study explores how to better integrate explanatory and predictive practices in psychological research by improving how prediction error is estimated. It highlights the limitations of common cross-validation (CV) methods, especially when estimating prediction error in the widely used R² metric. We propose the use of an alternative method to compute leave-one-out (LOO) R², which outperforms traditional approaches. Results from simulations and real data show that this method is more accurate, and it is available in the R package OutR2.
1 Graña, D. F., S. Kreitchmann, R., Abad, F. J., & Sorrel, M. A. (2024). Equally vs. unequally keyed blocks in forced-choice questionnaires: Implications on validity and reliability. Journal of Personality Assessment, 1-14. https://doi.org/10.1080/00223891.2024.2420869
Forced-choice (FC) questionnaires have gained interest, but the inclusion of unequally keyed item pairs remains debated. Unequal pairs may introduce social desirability issues but could enhance reliability and score interpretation. A study with 1,125 psychology students compared two FC questionnaires (with and without unequally keyed pairs) and assessed reliability, validity, and ipsativity. The results showed no significant differences in reliability, validity, or ipsativity, suggesting no clear advantage for one format. The study recommends using equally keyed blocks to avoid potential validity issues due to response biases.