The process of basic training, applied training, maintaining the performance of an observer

In the field of observational methodology the observer is obviously a central figure, and close attention should be paid to the process through which he or she acquires, applies, and maintains the skills required. Basic training in how to apply the operational definitions of categories and the rules for coding, coupled with the opportunity to use the observation instrument in real-life situations, can have a positive effect in terms of the degree of agreement achieved when one evaluates intra- and inter-observer reliability. Several authors, including Arias et al. (Apunts, 4:40–45, 2009) and Medina and Delgado (Motricidad: Revista de Ciencias de la Actividad Física y del Deporte, 5:69–86, 1999) , have put forward proposals for the process of basic and applied training in this context. Reid and De Master (ORI Res Bull, 12:1–13, 1972) focuses on the observer’s performance and how to maintain the acquired skills, it being argued that periodic checks are needed after initial training because an observer may, over time, become less reliable due to the inherent complexity of category systems. The purpose of this subsequent training is to maintain acceptable levels of observer reliability. Various strategies can be used to this end, including providing feedback about those categories associated with a good reliability index, or offering re-training in how to apply those that yield lower indices. The aim of this study is to develop a performance-based index that is capable of assessing an observer’s ability to produce reliable observations in conjunction with other observers.

when it comes to the use of observation instruments (Sánchez-Algarra and Anguera 2013), the reliability of which depends to a considerable extent on the skill of the observer and the training that he or she has received. Although the first studies in this field suggested that the quality of observational registers depended less on the observer's training and more on his or her personal characteristics (Sweeney and Cottle 1976), subsequent research has demonstrated the value of training for observers, as well as highlighting certain ways of improving their skills.
To our knowledge, there are no published descriptions of the knowledge and/or characteristics which observers should have (Arias et al 2009). However, there are certain characteristics that should be taken into account as they may lead the observer to make errors, thereby reducing the reliability of the register. This observer bias can take different forms. Firstly, it may involve mechanical errors in completing the register, that is, misinterpretations of the category system, due either to a mistake on the observer's part, which is then accepted, or simply to the complexity of the category system. Secondly, there may be perceptual errors, such as those related to spatial or temporal location, to stimulus duration, to a poor choice of attentional focus, or to centering or assimilation effects. Finally, an observer's reliability may also be limited by his or her personal characteristics, in other words, errors due to biosocial, biographical, psychosocial, situational, or expectancy effects.
For a number of reasons, observational competence in itself has not been widely studied. On occasions it has been associated with learning processes, in other words, refreshing the observer's skills through training. Alternatively, it has been regarded as being synonymous with success (Anguera et al. 1999), or related to different kinds of observational skills. Whatever the case, the study of observational competence should be based on the observer's capacity to learn and to develop his or her skills, the maintenance of which needs to be monitored over time. Therefore, the process of applying observational methodology should include the adequate selection and training of an observer. In terms of where in the process this should occur, it clearly needs to be after construction of the observation instrument (regardless of whether this is based on a category system or field format) and before the observational register is produced. The need to plan the prior preparation required by the observer will help to provide an overview of the whole process and will reduce the likelihood of mistakes being made during its execution (Arias et al 2009). It also helps to ensure the quality of any data that will subsequently be obtained.
From an operational point of view it is important to distinguish between the observer's basic training, his or her applied training, and the subsequent maintenance stage. The basic training will focus on understanding and learning to use the observation instrument, a stage that involves cognitive processes and the development of the skills required to apply the categories. In the next stage, applied training, the observer will perform several trials with the observation instrument in the context that is to be studied, using it to describe different behaviors (Anguera 2003;Medina and Delgado 1999;Reid and Master 1972) so as to ensure that the registers produced are of high quality. Finally, the maintenance stage refers to monitoring of the observer's performance once a sufficient level of reliability has been achieved during the applied training stage. In what follows, each of these three stages will be described in greater detail. For the basic and applied training stages, we propose a series of objective criteria for assessing the observer's performance, similar to what would occur in a process of evaluation (Chacón et al. 2006).

Basic training
According to Anguera (2003) the basic training should enable the observer to gain detailed knowledge about the stages and key aspects of the observational process that would be followed in a given study. The process begins with basic training being given to various observers, using the same instrument and unit of observation, and making successive observations of increasing complexity. Once this initial training is complete, the registers produced by these observers are subjected to an analysis of data quality based on the index of inter-observer reliability. Medina and Delgado (1999) describe a two-stage basic training process based on the work of Heyns and Zander (1972). The first of these stages is also divided into two parts: theoretical training, in which the target behavior is presented, and the development of the observer's practical skills. In the second stage the observer produces progressively complex registers of the target behavior until an optimum index of reliability is achieved.
The basic training stage, similar to the preparatory phase of training described by Medina and Delgado (1999), involves learning about the behavior to be observed and understanding its operational definition. One way of doing this is through observation exercises involving video recordings of part of the behavior to be observed. The aim here is to learn and memorize the categories to be applied, and this can be done with the aid of visual prompts (photographs, drawings, logos, diagrams, etc.) accompanied by explanatory texts. The final step of this process would involve discussing the categories/codes so as to help the trainee observer to understand and register them.
Another approach to basic training would be to simulate reality through the use of artificial scenarios in which the observer can develop the skills required Anguera (2010). These scenarios, and the different instruments applied in them, will vary depending on the skills to be acquired. This type of basic training will always include a feedback session in which the observer and the researcher analyze together the registers obtained, the aim being to identify strong points and those aspects that need to be improved. The sequential use of different types of simulation can form a learning cycle that ends with an evaluation of the skills acquired. Due to its effectiveness and speed, this method yields an excellent learning curve, and it also boosts the confidence of observers.

Applied training
According to Anguera (2003) the applied training stage consists in understanding the basic aspects of the observational process and how it works. Reid and Master (1972) stresses the need for training to continue until an acceptable skill level is achieved. Obviously, determining whether the skill level is acceptable requires an appropriate indicator of the observer's progress in using the observation instrument. In the case of an individual observer who performs a series of registers at different time points, this aspect can be studied by calculating the index of intra-observer reliability, taking a criterion value of 80 % (Remmert 2003).
In the applied training stage the observer performs a series of registers using the definitive category system. When determining the number and duration of training sessions required, it is important to take into account the characteristics of the observer and the complexity of the category system. This training process will be applied individually, and will be followed by an evaluation of intra-observer reliability. The process is complete when the observer has acquired the necessary level of competence. This can be tested by means of Cohen (1960) kappa index of agreement; the specific form of kappa that is used (e.g., classic kappa, Fig. 1 Training stage alignment kappa, or time-unit kappa with tolerance) will depend on the nature of the data (Bakeman et al. 2009). Obviously, the final result must be a value that is accepted by the scientific community (e.g., 0.80).
We will now present a quantitative approach to evaluating an observer's progress during the applied training stage. The premise of this proposal is that it is not only the outcome of applied training (i.e., κ = 0.8) that is important, but also how the observer's reliability has evolved during the process. In this regard, one would assume that the observer would become increasingly reliable, in other words, that later registers would resemble one another much more than would those produced at the start of training. In order to illustrate the proposed method, let us consider a case in which four sessions are observed (Fig. 1).
With four sessions the number of times that the index of agreement between pairs of registers can be calculated is six, as follows: The comparisons of the four registers (sessions) yield six different kappa indices, since the index of agreement is calculated between 1-2, 1-3, 1-4, 2-3, 2-4, and 3-4. This leads us directly to the problem of determining the most suitable order for checking whether the observer has achieved the desired level of competence through the training. The training process can be monitored by considering all the possible orders of magnitude of the six kappas: 6! = 6·5·4·3·2·1 = 720. On this basis, the optimum order would be: In other words, the greatest consistency is achieved at the end of the applied training stage, which in this example corresponds to the agreement between registers 3 and 4, whereas the lowest level of agreement is that between the first and last registers. Were this to be the order observed we would have evidence that the applied training stage had progressed adequately, since it yields increasingly consistent registers. Other desirable orders would be: All these orders imply that the greatest consistency is achieved between the final two registers, and also that the subsequent kappa values in the sequence correspond to contiguous registers (i.e., the agreement between registers 2 and 3, and that between registers 1 and 2). This first block contains the four orders that can be considered optimum. The probability of obtaining one of these four orders by chance is 4/720 = 0.0056 (Fig. 2).
Among the 720 possible orders there are others which can be regarded as indicative of an adequate-although not optimum-training process. These are as follows: This second block can be considered as "appropriate orders" (although not optimum ones). In all of them the greatest agreement is that between the final two registers, followed by contiguous registers (either 2 and 3 or 1 and 2). Adding the 4 optimum orders to the 32 appropriate orders yields a total of 36 orders of kappa values, and the probability of obtaining one of these orders by chance is 36/720 = 0.05. This probability value corresponds to the nominal significance level that is commonly used in statistical tests, and it can therefore be used in the same way as a p value. In fact, it would be a p value obtained in a similar way to that which would be obtained through a permutation test Manly (2007), although it is necessary to highlight an important difference. In permutation tests the sample distribution is based on the statistics (in ascending order of magnitude) obtained for each of the possible permutations; each permutation would correspond to one of the orders. By contrast, the present proposal does not use a statistical test, but rather decides in advance which orders are the most appropriate, from among all those that could be obtained by chance. These orders can be classified according to their degree of adequacy, as we have done in the above example. As was shown, it is possible to calculate the probability of obtaining the appropriate orders of kappa values. Therefore, a small p value would be used as evidence that the applied training process is adequate, it being highly unlikely that such a value would be obtained by chance (i.e., were gradual learning not to be taking place). In summary, the proposed method for evaluating the applied training stage involves complementing the reliability criterion obtained at the end of the stage (i.e., κ = 0.80) with a study of how the process has evolved. In this way, one examines whether the degree of similarity between the different registers follows a logical order, one that would unlikely be observed were the observer not to be performing progressively better over time.

Maintenance
The final stage described by Reid and Master (1972) refers to the maintenance of performance, which includes improving the observation instrument by providing new definitions for those categories that did not show good reliability. The present proposal, by contrast, defines maintenance as the situation whereby the degree of reliability achieved by the end of the applied training stage is broadly maintained and only varies within acceptable limits.
In this section we propose an objective and quantitative rule for evaluating whether the maintenance stage is evolving as expected. The aim is to determine whether the observer continues to achieve an adequate index of reliability, in other words, that kappa values remain close to the one achieved at the end of the applied training stage, it being assumed that the final kappa value obtained in that stage (κ 34 in the above example) was considered appropriate, as otherwise the applied training stage would not have been complete.
In order to verify numerically whether the kappa values remain within acceptable limits we need a rule that can detect when a given value deviates too much from the value achieved during training (e.g., κ 34 ), such that we can quantify the variability. Before describing the proposed method in more detail, let us consider some of the procedures most commonly used to assess variability. We will also discuss the criteria on which our proposal is based, as well as the context in which these criteria were developed.
In contrast to the applied training stage, where the aim is to achieve increasingly greater reliability, the objective in the maintenance stage is that the kappa index that was achieved and regarded as sufficient during training does not change. In other words, the kappa values should vary only minimally from the final value achieved by the end of applied training. One of the classical tools for evaluating statistically whether a series of data presents excessive variability is through statistical process control, also known as Shewhart charts, which were originally applied in the context of management and organizational research (Mawhinney 1992). Statistical process control is also considered to provide an objective basis for decision making in studies that take longitudinal measures of an individual (Callahan and Barisa 2005;Hantula 1995). This technique combines graphical representation of the data with quantitative criteria, based on two of the main descriptive statistics: the mean and the standard deviation. Specifically, limits are established above and below the mean, corresponding to one, two, or three standard deviations. The basis for deciding how many values must fall outside these limits for the variability to be considered excessive is the knowledge one has of how many values are contained within an interval defined by a given standard deviation above and below the mean in the normal distribution (e.g., the interval μ ± 2 σ contains 95.44 % of values). Consequently, it must be assumed that the variable of interest is normally distributed. A further point is that the aim of statistical process control is to detect large deviations and, therefore, its use is not appropriate in a context where the aim is to achieve minimum variability in the kappa values obtained by an observer.
An alternative approach, which also relies on certain assumptions, is exploratory data analysis Tukey (1977). One of the functions of exploratory data analysis is to identify individual values that are far removed from the rest (i.e., anomalous values). These values are usually identified by means of the rule Md ± 1.5I Q R, where Md represents the median and IQR is the interquartile range, this being the basis of the box plot in exploratory data analysis. In other words, instead of using indices based on moments (mean and standard deviation), this approach uses resistant indicators based on the ordering of data (i.e., the quartiles). In the case that concerns us here, however, there are problems with using quartiles and standard deviations, since a certain amount of data is required to obtain both these indicators. For example, obtaining the first, second, and third quartiles would not be very informative when only three measures are available (i.e., three kappa values). The utility of exploratory data analysis therefore depends on the situation to which it will be applied, and in the case of intra-observer reliability values obtained during the maintenance stage, one would not expect to have enough measures to justify this approach.
Given the likelihood of a small number of kappa values, we require a measure based on an absolute criterion that does not depend on the data. In the context of the present proposal, one such criterion is the stability envelope (Gast and Spriggs 2009), which is used in the visual analysis of measures obtained over time. The original idea here involves establishing two lines either side of the median, each at a distance of 0.1 Md. Therefore, the range of values that indicates stability is defined as Md ± 0.1 Md . It can be seen that this criterion does not depend on the variability that is present in the data, since it is fixed prior to the data being obtained. Note too that although the stability envelope was originally applied in the field of single case designs, which usually have a clinical or educational purpose, the kappa values obtained in relation to an observer's registers are consistent with the idea of longitudinal measures of a single subject.
Finally, before setting out our proposal, mention should also be made of another criterion that is similar in form to the stability envelope, but which comes from a different area of application. In simulation studies designed to test whether a statistical technique (e.g., the ttest) is robust in the event of its assumptions being violated (e.g., homogeneity of variances, normality), one usually compares the empirical and nominal rates of Type I error. Thus, having set the nominal significance value (α), the test's robustness will be demonstrated if the rate with which a correct null hypothesis is rejected falls within the interval α ± 0.1 α. This rule is known as Bradley (1978) stringent criterion. In this case we are not dealing with longitudinal measures, but what is used is still a relative criterion of variation around the reference value, namely 10 % above and below.
Our proposed way of establishing an objective criterion for determining whether or not the maintenance stage has evolved correctly is based on the latter two ideas, even though their original sphere of application was not observational methodology. However, instead of the point of reference that is used to construct the stability envelope (i.e., the median) or the interval that corresponds to robustness (i.e., nominal alpha), the present proposal takes as its reference point the last kappa value obtained during the applied training stage (e.g., κ 34 ). It is also important to note another difference with respect to the stability envelope. As originally conceptualized, this latter criterion requires that 80 % of the data fall within Having defined the range of acceptable values, it is then necessary to compare each kappa value obtained during the applied training stage with the upper and lower limits. The reference kappa value at the start of the maintenance stage will be the final value obtained during the applied stage. When deviation occurs, its direction must be established, since only in the case of values that fall below the lower limit would the observer require further applied training. Conversely, if the deviation is beyond the upper limit, this indicates an increase in the kappa value and, therefore, an improvement in data quality.
Finally, it should be emphasized that both kinds of criteria, namely those based on distances in the form of standard deviations and those on which our proposal is founded, are arbitrary. However, we believe that having a conventionally accepted rule is better than relying on subjective judgment, even if the question of which rule is the most appropriate is one that requires continued debate.
In conclusion, it has to be highlighted that the that the processes of learning, training, and maintenance are an essential part of observational recording, given that they help increasing external control and the data efficacy when using Observational Methodology.
The training of the observers following this procedure would guarantee the stability and constancy of the recordings through the whole research.
Finally, as regards inter-observer (or between observers) reliability as a means of gaining evidence on the quality of the recordings, if the observers have followed the process described here, the concordance between them is likely to be higher (Fig. 3).