Carrying out testing has a purpose. However, seeking to measure something in line with this purpose is not a simple matter.
Tests with many goals are possible, such as to measure the achievement level of a person’s ability, or the progress of their learning, or to find their relative position in a group of examinees, or to analyze the trends in a group of examinees. If the test isn’t designed correctly according to the purpose, it won’t be possible to perform accurate measurement.
At JIEM, we think carefully how to measure human ability correctly, researching, developing and applying testing techniques to achieve that.
Besides the effects of the population, it’s also necessary to consider the difficulty of the test. It's very strange if the difficulty of a test fluctuates each time, even though the pass mark remains fixed. One year might have a difficult test and thus very few successful candidates and another year might have a test with lots of relatively easy questions and thus a lot of successful candidates. Such a situation could result in inconsistent pass rates each year.
In testing, reliability means that a test obtains very consistent results, so that if the same person repeatedly takes a test with the same conditions, the result will always be the same.
Also, however reliable a test may be, it’s meaningless if it measures something other than the targeted ability. The relevancy of a test is where the test itself correctly measures the ability it seeks to assess.
When creating a test, it’s necessary to consider whether it measures the same thing however many times it’s taken, irrespective of the population or test items, and whether the test really measures the targeted ability.
Varying the questions in every test is intended to prevent candidates from being able to answer by studying for those questions alone. But isn’t it strange that despite the fact that the questions are different, they’re represented by a score?
In fact, standards can be established that make it possible to measure values correctly every time by eliminating the effects of differing test questions. This is called test equating.
In classical test theory (true score and deviation score) the effects of examinee population and test difficulty are inescapable. This is because the test score includes both the elements of examinee ability and test difficulty. With item response theory, test difficulty and examinee ability are assessed separately. The probability of a person with a certain ability responding to a question of a certain difficulty is X%, and based on this, the ability of examinees is sought probabilistically.
Since this requires extremely advanced statistics, it has still not been used widely in Japanese education, but overseas, it’s used widely for TOEFL® and other tests as a reliable method of measurement.
The ability to establish a standard where a person of a certain ability can give correct answer to a certain question with a certain probability makes it possible to provide questions that adapt to the situation, similar to the approach used in eye tests, offering a slightly easier question if the examinee can't answer a question, or offering slightly more challenging questions if the examinee correctly answers a question.
Using a computer, the next question is varied according the examinee’s response. This computerized adaptive testing (CAT) enables highly accurate measurement of examinees Eability in a short time.
Paper tests that are computerized (true score and deviation score tests) were the first generation. Adaptive testing based on item response theory was the second generation. Research is now underway on third generation testing called continuous measurement, for children’s English tests and so on. Whereas earlier testing was confined to measurement, this approach actually supports learning.
The items that must be acquired in a given unit are encoded in a mastery map on which testing is based. The test identifies areas that haven’t been completely learned, and through retesting or self-study, this approach aims to link measurement directly to efficient learning.
JIEM is currently developing 4th generation AI technology for CASEC-GTS, CASEC-WT and so on.
There was a time when selection and relative evaluation through examination hell and academic competition were the norm. Of course tests are useful for selection and acceptance decisions, but there’s a growing requirement for ways to test individual ability development and class improvement.
This kind of a test is a measure for correctly assessing proficiency to enable the growth of each individual examinee.
The day isn’t far off when everybody realizes that getting poor marks in a test isn’t a bad or a shameful thing. Instead, it’s a positive means of correctly identifying your weak points.