Use of morphometric characters of a fish species to predict its location ; a statistical approach

Precise taxonomic identification is the preliminary requirement in a study of an organism/specimen. Correct identification however gives only the identity of the specimen. The value of the correctly identified specimen as a study material becomes low when the habitat/location of its collection is unknown. Knowing the exact place of collection, enables to gather information on the distribution of the organism, possible environmental conditions that the organisms encounter and to describe the variations found in morphological and genetic features of the organism. Present study therefore, aimed on to develop a statistical rule to predict place of collection (river which is unknown) of a given Puntius dorsalis (a freshwater fish species) specimen using its morphometric characters. Fifty two individuals were collected from four major rivers (Mahaweli, Kelani, Kalu, Nilwala) in Sri Lanka and 23 morphometric characters were measured from each specimen. Those individuals were categorized into 4 groups according to the river from which they were collected. Measured morphometric characters were used as independent variables of the model to predict unknown group membership (river) of a given Puntius dorsalis specimen. In the case of re-substitution, 82.7% of the Puntius dorsalis specimens were successfully classified or predicted with respect to the place of collection (river) using their posterior probabilities. The process had a hit ratio of 69.2% when generalized, as a valid tool to classify fresh Puntius dorsalis specimen of unknown group membership. It was also discovered that linear classification function could be used to predict unknown place of collection of a fish. The paper concludes with some suggestions to move into nonparametric approach like Classification and Regression Trees (CART) and Neural Networks.


Introduction
There are two aspects in discriminant analysis.They are Predictive Discriminant Analysis (PDA) and Descriptive Discriminant Analysis (DDA).Dissimilarities between these two analyzes are not well understood by most researchers [10].Predictive discriminant analysis focuses on classifying subjects into one of several groups (or to predicate group membership), whereas in descriptive discriminant analysis, the focus is on revealing major differences among the groups.Hence, PDA is appropriate when the researcher is interested in assigning units (individuals) to groups based on composite scores on several predictor variables [28,25,11,15].The accuracy of such prediction can be assessed by examining "hit rates" as against chance [12].
Discriminant functions (DFs) are linear combinations of independent variables [15] and the first DF is that which maximally separates the groups and the second DF, orthogonal to the first, maximally separates the groups on variance not yet explained by the first DF [20,30].Classification functions (CFs) are again linear combinations where coefficients are different from coefficients of DFs [15].In fact, there will be k classification functions and s = min(p, k − 1) discriminant functions, where k is the number of groups and p is the number of variables.DFs are used for both PDA and DDA aspects [1] and CFs are used for PDA.In many cases we do not need all DFs to effectively describe group differences (DDA aspects), whereas all k classification functions must be used in assigning observations to groups [25,30].
In biological experiments and research, knowledge on place of collection of an organism is one of the preliminary requirements to study about that organism.Place of collection could provide information on the habitat, environmental conditions, adaptations, morphological and genetic variability of the specimen.In addition, place of collection is important for the taxonomic studies too.
In museum collections, there are many specimens with unknown identity of place of collections, have lowered their value as a biological specimen.
Morphometric characters are measurable linear measurements of a fish, and are known to vary with the factors like river, altitude range, environmental conditions of the habitats [19,16,21,7,9].Puntius dorsalis is a fish species commonly found in freshwater bodies of Sri Lanka.Puntius dorsalis have also shown distinct morphometric heterogeneity with respect to rivers, altitude and environmental factors [8].This morphological variability present in Puntius dorsalis enabled us to use it as a test model for the present study.Objective of the present study is to develop a statistical rule to predict the place of collection (river which is unknown) of a Puntius dorsalis specimen using its morphometric characters.

Data collection
A sample of 52 Puntius dorsalis specimens were collected from seventeen sites of four major rivers Nilwala, Kalu, Kelani and Mahaweli, basins in Sri Lanka such that 10, 8, 17 and 17 specimens representing those rivers respectively (Table .1).Fish were caught using gape nets, cast nets and scoop nets.Twenty three morphometric characters were recorded from each specimen (Figure . 1).
Linear measurements were made using vernier calipers to the nearest 0.01 millimeter.All morphometric variables were standardized to remove the effect of individual size.In that case, eye diameter (ED) and post orbital length (POL) were divided by head length (HL) and all other variable shown in (Figure .1) were divided by standard length (SL) to remove the effect of individual size."Log" and "Arcsine square-root" transformations were carried out separately for skewed and proportional morphometric characters respectively [4,22].
Method of Stepwise Discriminant Analysis was used to find variables made significant independent and combined contributions [13,29].These were the caudal fin length (CFL), standard length (SL), head length (HL), length of the caudal peduncle (LCPD), anal fin length (AFL), prevental length (PVL) and eye diameter (ED).

Data analysis
The aim of the study was to build up a statistical rule to predict place of collection (river which is unknown) of a given Puntius dorsalis specimen.In this case, given Puntius dorsalis specimen whose group membership (river) unknown is to be classified into one of four groups formed based on their place of collection.Unknown group membership can be predicted using posterior probability [5,2], Table 1.River, name of the location and altitude range of P. dorsalis collected .CFs [24], CART [3] and Neural networks [17,20] etc.The data were processed using SPSS statistical software package [13,22], and it classifies subjects into predicted groups using posterior probability.
Bayes' rule is used for posterior probability method.

Bayes' rule
If there are k groups, the Bayes' rule is to assign the object to group G i , where for all j = i and i, j = 1, 2, ..., k. ( We want to know the probability p(Gi|D) that an object belongs to group G i , given the values D on each of the DFs.The subject is then classified (predicted) to be in the group with the higher posterior probability [13].There is a relationship between the two conditional probabilities that well known as Bayes Theorem: Prior probability p(G i ) is an estimate that belongs to a particular group when no information about it is available.However, to use the Bayes rule directly, it is hard [2] to calculate p(D|G i ).
If we assume that all groups have the same covariance matrix, then we get another classification technique which is called Fisher's Linear Classification [25,11].

Multivariate test of significance
Box's M test [22] was used to evaluate homogeneity of covariance matrix and test statistic is not significant under 0.01 significance level (

Linear classification analysis
We use samples from each of the k groups to find the sample mean vectors y 1 , y 2 , ..., y k .As a univariate approach can not address any joint effect (interactions) of variables, each individual was considered to be a single multivariate observation in the analysis [32,6].Specimen whose group membership is unknown, one approach is to use a distance function [12] to find the mean vector that y (independent variables measured for a fresh Puntius dorsalis specimen) is closest to and assign specimen to the corresponding group.We can estimate the common population covariance matrix [25] by a pooled sample covariance matrix, where n i and S i are the sample size and covariance matrix of the i th group, E is the error matrix from one-way MANOVA, and N = i n i .We compare y to each y i , i = 1, 2, ..., k by the distance function, and assign y to the group for which D We can obtain simplified form as a linear classification rule by expanding (4), The first term y S −1 pl y can be neglected because it does not change from group to group.But second term is a linear function of y, and the third does not involve y.So by multiplying − 1  2 and deleting first term we get linear classification function and denote by L i (y), Assign y to the group for which L i (y) is a maximum [15] (sign reversed when multiplying by -1/2 ).
To highlight the linearity of (6) as a function of y, we can express it as, where c i = y i S −1 pl and c i0 = − 1 2 y i S −1 pl y i .To assign y to a group using this procedure, we calculate c i and c i0 for each of the k groups, evaluate L i (y), i = 1, 2, ..., k, and allocate y to the group for which L i (y) is largest.This will be the same group for which D 2 i (y) in ( 4) is smallest, that is, the group whose mean vector y i is closest to y.In order to use the prior probabilities, the density functions for the two populations, f (y|G 1 ) and f (y|G 2 ), must also be known.Then the optimal classification rule [31] that minimizes the probability of misclassification is: Assign y to G 1 if, and to G 2 otherwise.Note that f (y|G 1 ) is a convenient notation for the density when sampling from the population represented by G 1 .It does not represent a conditional distribution in the usual sense.
For the case of several groups, the optimal rule in (8) extend to, Assign y to the group for which p i f (y|G i ) is maximum.
The probability of misclassification is minimized with this rule.If we assume normality with equal covariance matrices and with prior probabilities of group membership p 1 , p 2 , ..., p k , then f (y|Gi) = N p (µ i , Σ), and the rule in ( 9) becomes (with estimates in place of parameters) [24]: Calculate Table 4. Classification result for analysis and cross validated samples.and y is assigned to the group with maximum value of L i (y).If p 1 = p 2 = ... = p k , then (10), which optimizes the classification rate for the normal distribution, reduces to (6), which was based on the heuristic approach of minimizing the distance of y to y i .

Results and Discussion
SPSS statistical software package uses posterior probability method for classification.In that case, calculates the posterior probabilities of being in each of the four groups and a subject is then classified (predicted) to be in the group with the higher posterior probability (Table .4).The misclassification rate for each group is the proportion of sample observations in that group that are misclassified.
In the case of re-substitution [23,1], 82.7% of Puntius dorsalis specimens were correctly classified into their respective rivers.But, it is not an unbiased estimator for actual correct classification rate.
Because, the data set used to compute the DFs are also used to evaluate them [25,14,18,12].But, cross validation [23,1] treats n − 1 out of n training observations as a training set.It determines the DFs based on these n − 1 observations and then applies them to classify the one observation left out.This is done for each of the n training observations.For this study, cross validated correct classification rate is 69.2%.It is nearly an unbiased estimate but with a relatively large variance [25,14,18,12].Both analysis and cross validated correct classification rates are not so good.One reason for that is, violation of multivariate normality and it can be avoided by increasing the sample sizes and removing unnecessary variables from the study [5,14,25].
After doing transformation for data, we should reassess model assumptions to see the effect of it to the distribution of data.In this study, we did "Log" and "Arcsine square-root" transformations for skewed and proportional variable respectively.But, reassessment confirmed that those transformations have no any acceptable effect to increase distributional characteristics and it merely changed our measured variables.Therefore, "Log" and "Arcsine square-root" transformations are not necessary for these data.
The combined groups plot [15] (Figure .3) can be used to see where each specimen (in training data set) falls in the space defined by the first two DFs and it emphasizes good separation of groups formed based on place of collection (rivers).But our intention is to classify future observation, that is a fresh (not a member of training data) Puntius dorsalis specimen caught from any of four rivers (Mahaweli, Kalu, Kelani and Nilwala) and exactly from which river is unknown.
In its original form proposed by Fisher, the method assumes equality of population covariance matrices, but does not explicitly require multivariate normality.However, optimal classification performance of Fisher's discriminant function can only be expected when multivariate normality is present as well, since only good discrimination can ensure good allocation [27].That means, when a parametric classification criterion (linear) is derived from a non normal (critical violation) population, the probability of misclassification is high [25].
If the covariance matrices are unequal but the joint distribution of the variables is normal, then the quadratic classification rule is the optimum one.However, if the covariance matrices are not too dissimilar, the linear classification performs quite well, especially if the sample sizes are small [13].In this study sample sizes are small and main assumption of linear classification analysis that is SPSS statistical software package outputs the coefficient [15] for linear classification functions and they can be used to construct following four CFs.By similar substitution of y 3 into L 3 (y) and L 4 (y), we get L 3 (y 3 ) = 1808.687386and L 4 (y 3 ) = 1814.801304.Maximum one is L 4 (y 3 ) and that is the classification function corresponding to Mahaweli (R-4) river.Therefore, a fish of Kelani is misclassified into Mahaweli river.Using the above procedure, unknown place of collection (river) of fresh Puntisu dorsalis specimen can be found easily.
Only four rivers were considered in developing the statistical rule and therefore a P. dorsalis caught from any of these four rivers (Mahaweli, Kalu, Kelani and Nilwala) could only be assigned correctly to its place of collection.

Conclusions
The overall percentage of correct classifications for both the analysis and hold-out samples, which are measure of predictive ability, shows that discriminant analysis can be used to predict unknown place of collection (river) of fish specimens.We can also conclude that morphometric characters of a fish can be used to predict its unknown place of collection.The prediction of place of collection of a fresh specimen using its posterior probability is a difficult task.Because, it is hard to calculate posterior probability of specimen belong to a particular group, given each of the discriminant scores.
But, linear classification functions can be used to do such a prediction easily.
In non parametric classification, distributional assumptions are not necessary.Therefore, it motivates us to try techniques like CART and Neural Networks for further research on this area.

Figure 2 :
Figure 2: Visual inspection of multivariate normality for four groups.
[12]e.2).Therefore, covariance matrix homogeneity is acceptable.The mshapiro-wilk test[26]was used to evaluate multivariate normality and test is significant for each groups under 0.01 significance level (Table.3).So, as a statistical test multivariate normality is violated for each of the four groups and p values indicate that violation is higher in groups one and two than other two groups.That is also confirmed from visual inspection (Figure.2),based on the fact that straight lines indicate multivariate normally distributed data and deviation from the straight line is a measure of deviation from multivariate normality[12].
, represent the fish collected from Kelani and Mahaweli rivers respectively (Table.2).Suppose two fresh (not used to calculate CFs) Puntius dorsalis specimens, one came from Mahaweli (R-4) and the other one from Kelani (R-3) rivers were considered and following are the two observation vectors corresponding to them respectively.We need only to measure and do relevant standardization for morphometric variables which are appeared in CFs.By similar substitution of y 4 into L 3 (y) and L 4 (y), we get L 3 (y 4 ) = 1841.846301and L 4 (y 4 ) = 1843.042745.Maximum one is L 4 (y 4 ) and that is the classification function corresponding to Ma-