Evaluation of uncertainty in the measurement of sense of natural language constructions

The task of evaluating uncertainty in the measurement of sense in natural language constructions (NLCs) was researched through formalization of the notions of the language image, formalization of artificial cognitive systems (ACSs) and the formalization of units of meaning. The method for measuring the sense of natural language constructions incorporated fuzzy relations of meaning, which ensures that information about the links between lemmas of the text is taken into account, permitting the evaluation of two types of measurement uncertainty of sense characteristics. Using developed applications programs, experiments were conducted to investigate the proposed method to tackle the identification of informative characteristics of text. The experiments resulted in dependencies of parameters being obtained in order to utilise the Pareto distribution law to define relations between lemmas, analysis of which permits the identification of exponents of an average number of connections of the language image as the most informative characteristics of text.


Introduction
The complexity of the tasks of semantic analysis of text information is considered to be one of the main barriers to building artificial intelligence in general, and to resolving with appropriate levels of quality a considerable range of problems relating to computer linguistics in particular.Ontogeny is intrinsic to how a person learns and acquires new knowledge all their life, therefore each natural intelligence is a unique and dynamic phenomenon capable of improving and embodying a good understanding of their own kind.Therefore, construction of linguistic knowledge bases should be based on such principles, and the problems in obtaining new formal methods of semantic analysis of natural language constructions, based upon knowledge bases, are quintessential.Formal approaches to the study of artificial cognitive systems need to be determined.Such systems should be able to simulate human activity in the processes of understanding, refining meaning, and the effective use of input text information.
In [1,2], it was proposed and justified that the introduction of a measurement unit of imaginative sense 1 with syntactic associative weighting (SAW) to solve problems of computer linguistics related to the creative thinking of humans.But in the process of such modelling is necessary to take into account the dynamic nature and subjective cognitive ontogenesis, including speech activity.Formally, this can be done in various ways, one of which is to assess the uncertainty of the measurement result of the sense of separate natural language constructions (NLC), the texts, and artificial cognitive systems (ACS) in general, at a given time.It is known [3] that the uncertainty of measurement is a parameter associated with measurement results, characterized by the dispersion of values that can be quite reasonably attributed to the measured value.But it is important that the value that is directly used to express uncertainty should be internally consistent, directly derived from that components that comprise it, and should not be dependent on the grouping of these components and their subdivision into sub-components [4].In source references known to us, which consider standard uncertainty of measurement types A and B, the concept of uncertainty was not applied as well as the basic requirements needed to solve problems of semantic text analysis.
The subject chosen to be studied is the process of building knowledge bases for linguistic cognitive systems, with the focus of the research on assessment of the uncertainty of sense of NLC formal characteristics.The purpose is to obtain values of measurement uncertainty of the sense of NLCs, as components of an ACS.To achieve this goal it is necessary to formally define the concept of an ACS, justify the method used to measure NLC sense based on fuzzy relationships, and obtain and interpret formal assessment of the uncertainty of the measurement results of the sense of the NLC.

Formulation of the problem
On entering any system S i with known quantities nt, a flow X = {x 1 , x 2 , …} as at time t L may be defined by a Berge graph G Q (V, E) with a corresponding adjacency matrix A Q with dimensions L Â L. We also know that in a sparse matrix A Q the number of non-zero lj À x elements equals m and each of them acquires the value k lj .It is necessary to obtain values for the uncertainty s of the results of observations k lj of each system S i and to calculate the standard uncertainty of type A -u A ðXÞ and type B -u B ðXÞ for all systems.Given the purpose of the study it is necessary to interpret and analyze the formal results in terms of the domain of computer linguistics.

Literature review
Consider the fundamental requirements for the notion of uncertainty of measurement as set out in [4,5].The ideal method for determining the uncertainty of measurement results should be universal, suitable for all kinds of measurements and for all types of input data used in the measurements.The internal consistency of the values directly used to express uncertainty, allows the direct use of uncertainty of one result as a component to determine the uncertainty of another component, which uses the first result.
The uncertainty of the measurement result generally consists of several components, which can be grouped into two categories, depending on the method of evaluation of their numerical value: type A components that are evaluated by statistical methods, and type B components measured by other methods.Each detailed statement of uncertainty must include the full list of components and each of them show the method used in the preparation of each numerical value.
The components of category A are generally characterized by their estimated variances S 2 i (or their estimated "standard deviations" S i ) and a number of degrees of freedom.If necessary, their covariance should be indicated.Components of category B should be characterized by values U 2 j , which can be regarded as approximations to the corresponding variances, the existence of which is allowed.U 2 j values can be viewed as variances and U j as a standard deviation.If necessary, the covariance should be treated similarly.
The combined uncertainty should be characterized by a numerical value obtained when applying the usual method for mapping variances.The combined uncertainty and its components should be expressed in the form of "standard deviations".If in some cases the total uncertainty is obtained when the combined uncertainty is multiplied by a coefficient, then that factor should always be specified.In general terms, the word uncertainty means doubt, and thus, in the broadest sense "uncertainty of measurement" means doubt in the veracity of uncertainty measuring.
Consequently, the uncertainty of the measurement result does not necessarily show the probability that the measurement result is close to the value of the measured value; it appears only as evaluations of the proximity of a measurement result to the best value that corresponds to the currently available information.The introduction of the concept of the "uncertainty of measurement" is a necessary measure to obtain uniform and simplified assessment of the reliability of the evaluation of measuring authenticity, since its definition is based on obtained measurement results, known conditions of the measurement, and the characteristics of the equipment, and not on the unknown actual value of a measured value [6].
To evaluate the input variable Х i that was not obtained as a result of repeated observations, the estimated variance u 2 (х i ) and the standard uncertainty u(х i ) associated with it must be determined based on a scientific judgment that relies on all available information about possible variability Х i .That is, the type B standard uncertainty is obtained from the presupposed function of the density probability that is based on a degree of confidence that the event will happen (this probability is often called subjective probability).
Since information that enables the evaluation of measurement uncertainty can comprise the data of previous measurements discussed in [2], our approach enables a measurement process of the NLC sense based on fuzzy measures.Thus [1], the fuzzy binary relationship, set on the same base population of language images (or universe) I, is defined as the fuzzy ratio where m Q (〈i l , i j 〉) is the function of dependency of the binary fuzzy ratio, defined as the representation m Q : In the expression (1), a sequence of two elements is defined through 〈i l , i j 〉, where i l ∈ I, i j ∈ I.If the carrier Q s of the fuzzy relationship Q is finite, then the power of this fuzzy ratio is numerically equal to the number of sequences of its carrier and is defined as card(Q s ).
If binary fuzzy relation ( 1) is a basic cognitive feature of the ACS, then the functional dependency m Q (〈i l , i j 〉) should be considered as a natural numerical measure of sense.The value m Q (〈i l , i j 〉) = 1, according to [1], is given the sense value of one SAW unit.In general, the function of the dependency of the fuzzy ration of the sense for a pair of language images at the basic level is defined as: where k lj is the number of fixed ACS connections between the l th and the m th images at the moment of time t L .The value of k lj is not difficult to obtain, by calculating the number of fixed ACS sequences 〈i l , i j 〉, based on the technological capabilities of modern linguistics software packages, which allow, for the first time, the application and justification of the concept of measurement uncertainty of the NLC sense.Let us consider a system S which henceforth will be called an ACS, Artificial Cognitive System, from the point of view of the process accumulating its knowledge base.Let S have the ability to identify images of infinite population I = {i 1 , i 2 , ..., i nt , ...} and perceive associative links between pairs of images as elements of the population v ∈ V, where V ⊆ I Â I , space ordered pairs.To determine an image construction, we will apply the notion Fsigma algebra (s-algebra) of subsets of V. Further assume that this subset g ⊆ V is a language construction that has the property g ∈ Ϝ.In accordance with the properties of s-algebra [7] the populations A, B ∈ Ϝ, the combination, overlapping and difference between A and B in the theoretical-population sense, also belongs to Ϝ. Suppose that the system S communicates information with the outside world as a black box exclusively as language constructions, of which we differentiate a sequence of incoming events X = {x 1 , x 2 , …} and a set of image responses of the system Y = {y 1 , y 2 , …}, where x i ∈ Ϝ, y i ∈ Ϝ. Figure 1 shows a diagram of an abstract model of cognitive activity, which includes an external "black box" and internal ACS, which receives as an input a continuously set of images of events in the form of an X stream.The ACS output images appear as Y, which is a response of this system to the external situation X according to the modelling approach to human image thinking [2].
Farther will now use the Ontogenetic Principle to build an ACS.The cognitive resource V of the system S, which determines the sense of its functionality, can be obtained exclusively through successive accumulation of sequential parameters v from an external "black box" and further selfimprovement of the set V. Formally, the ontogenetic principle is reflected in the fact that the knowledge base system S is built with x i , where m 0 is overall number of input image constructions accepted by the system at a given time.
In order to solve applied problems of computer linguistics, let us interpret the components of a derived abstract model of cognitive activity.For an ACS linguistic construct, we will consider image i to be a language image that is approximately defined by a lexeme or a word form [8]. Then the analogous association between pairs of images v is a phrase, and the image construction g is a sentence or an utterance À in general an NLC.Accumulated ACS cognitive resources V are shown as a processed set of texts, and the result is the building of a linguistic knowledge base C.
Unlike the existing models of knowledge in computer linguistics, where the vocabulary of word forms is combined with a multitude of morphological, syntactic and semantic rules, in our case the basis for the knowledge base C is formed exclusively with associative knowledge about the combinability of language images i.This gives grounds for unified evaluation of the unit of sense and the quantity of sense of the NLC.

Measurement method for NLC sense based on fuzzy relationship
Under the proposed approach [9] we will detail the dependency function that generates a binary fuzzy relationship of sense (1) for the following 3 successive levels, built on the basic level (2): 1.The level of probabilistic forecasting À to standardise the dependency functions in the range [0, 1] provide for the calculation of the statistical evaluation l (mathematical expectation), if known for nt for the given ACS at the time k lj , аnd m is the number of all nonzero sequences 〈i l , i j 〉, then l = k S /m where in this case we apply the known sigmoid function [10] As a result of the standardisation there appears a characteristic property of the dependency function which is obtained by the proposed method with average value 2. The level of incorporation of emotional state.Introduce the opportunity to incorporate a binary model of emotion for the ACS [9] with the help of the indicator m = {…, À2, À1, 1, 2, …} , where In the case of m = À1 ∨ 1 , emotions do not affect sense in the functioning of the ACS, and the dependency function (4) regresses to the function (3).The increase in the indicator m symmetrically smoothes the sigmoid function as shown in Figure 2.
3. The level of incorporation of motivation components based on image centre of needs.It is proposed that the consideration of the image centres of needs j 0 be undertaken as a model of ACS motive at a given time t L , as well as calculating the variance and mean-square differentiability of the results of observations k lj as The obtained value s will now be considered as the uncertainty that is conditional on the imprecision of the ACS motive model.The uncertainty is characterized in particular by the imperfection of basic dependency (3), on the basis of which it is proposed to take into account the motivational component based on the image centres of needs.
Depending on the degree of approximation r to the pair of images 〈i l , i j 〉, function (4) can shift to the left along the x-axis by reducing the mathematical expectation for the pair l lj = l À r ⋅ s , where r = {0, 1, 2, 3} which results in: The issue of constructing a separate algorithm to determine the degree of proximity r of the pair 〈i l , i j 〉 to the image-needs j0 and the introduction of additional level of consideration of reflexes and results of the external tuition is considered in [9].Note that, unlike (3) and (4), the dependency function related to sense (6) resulting from local shifts in mathematical expectation, the property m Q ¼ 0:5 disappears.The authors consider this to be evidence of proper formal interpretation of the known facts of psychology and physiology on contradictions between generally accepted (statistically average) sense and actions influenced by strong motives.

Uncertainty of measurement results of NLC sense
The approach to the measurement of sense corresponds to the linguistic knowledge base of one ACS, the output data of which can be either separate text or a unique set of texts.It should be understood that every text reflects a unique worldview of an author, depicted in their language.To solve the problem of identifying informative text attributes it is important to define the reliability of the knowledge base in general and the meaning of a pair of images m Q (〈i l , i j 〉) as a basic component of the knowledge base in particular.In as much as this actually refers to the measurement of sense, it is proposed that in order to assess reliability will apply the concept of uncertainty of results of multiple measurements of NLC sense.
In the first approximation, assume that a subjective estimate of the amount of sense of one pair of language images is embodied in a number of statistical arrays of numerical values N for different ACSs.Thus, for an arbitrary sequence 〈i l , i j 〉 the value Y = m Q (〈i l , i j 〉) as measured according to (3), is functionally dependent on the results of repeated measurements X 1 , X 2 , …, X N for different ACSs and, in general, is as follows: The evaluation of the measured value Y indicated henceforth as y, is obtained from the general equation ( 7) using input values x 1 , x 2 , …, x N for N numerical values X 1 , X 2 , …, X N .Thus, the output assessment y, which is the result of a measurement, is expressed as follows: The baseline assessment of mathematical expectation or expected value m Q of value q, that is randomly changing, is the arithmetic mean or average value q of n observations The experimental standard deviation characterizing the variability values of q k , or more specifically, their dispersion s 2 about the mean values q is calculated by formula [6] As the average value q is taken as the result of multiple measurements, it is important to determine the dispersion.The best estimate s 2 ðqÞ ¼ s 2 =n of the dispersion of the mean value u 2 A ðqÞ may be expressed as: Experimental dispersion average u 2 A ðqÞ and the experimental standard deviation of the mean value u A ðqÞ, equal to the positive square root of the dispersion value u 2 A ðqÞ, quantitatively determine how well q determines the expectations m k of the value q.Given the expressions ( 9) and (10) the experimental standard deviation of the average value u A ðqÞ is calculated by formula [6] u For a deeper consideration of the subjective nature of the measured sense of the sequences in function ( 7) applied components of standard uncertainty type B, which are usually determined on the basis of information on the upper and lower boundaries [a À ; a + ] predictable (specified a priori) of the distribution law or with interval U, which has given a given confidence level p.
To determine the type B standard uncertainty, need to take the positive square root of the product of the confidence level of each value and the square of the deviation of this value and all products of this type should be added.As a result, a general view of the formula for calculating standard uncertainty of type B in the case of discrete data is of the form: As we can determine the upper and lower limits [a À ; a + ] for value X i , then the type B standard uncertainty in assumptions about the possible shape of the distribution law can be determined by formulas [4][5][6] (a) for the triangular distribution law (b) for the exponential distribution law where x is the expected value, and l is the distribution parameter; (c) for the Pareto distribution law where x m is the initial value, and k the distribution parameter (the density for x m ); (d) for the uniform distribution law For given intervals U p with a known level of confidence p where the standard distribution law is assumed, the type B uncertainty is given by the formula: where k p is the coverage coefficient, which for the standard distribution law is equal to 1.64; 1.96; 2.58 and 3 for confidence levels 0.9; 0.95; 0.99 and 0.9973 [11].
In the absence of information about the usability of laws ( 13)-( 16) for the distribution of the input value X i for symmetrical boundaries ±a i , standard uncertainty of type B is determined by the formula: which can be applied at an early stage of experimental research into the ACS.

Experiments
The leading linguistic package DKPro Core, which is based on the platform of Apache UIMA framework [12], was used in order to verify by experiment the results of the evaluation of measurement uncertainty of the NLC sense as a component of ACSs, using the proposed method.To implement this series of experiments an additional Java application program was developed, which not only uses but also improves the collection of software components to process natural language by DKPro Core [13].A feature of the program as developed that focuses on Java/Maven/ Eclipse technology, is the definition of the list of the Lemmas of a text and their complex dependencies, as described in [14], between these lemmas as a list of m links.As an experimental basis, three famous open source literary works from the Project Gutenberg [15], were selected, namely English copyright versions of 4 texts of different volumes: "Alice in Wonderland" (Lewis Carroll À one excerpt of 4204 words, and a second, being the full version of 26690 words).The third text was "White Fang" by Jack London comprising 48907 words, and the fourth being "Three Men in a Boat (To Say Nothing of the Dog)" by Jerome K. Jerome of 67328 words.The purpose of the series of experiments was to study basic characteristics of uncertainty of each of the 4 texts, and to obtain values of uncertainty of the set of pairs of language images 〈i l , i j 〉 common to all four texts, according to the proposed method.

Results
The results of the research formalized and interpreted for the subject area of computer linguistics the notion of artificial cognitive systems, incorporating the basic O.V. Bisikalo and O.M. Vasilevskyi: Int.J. Metrol.Qual.Eng. 8, 6 (2017) ontogenetic principle of constructing an ACS.Formal characteristics of the method of creating binary relationships of the image sense Q of the ACS S Q were obtained by modelling the notions of motivational goals and emotional state.The principles of successive multilevel construction of the dependency function m Q (〈i l , i j 〉) that generate a fuzzy relationship Q were proposed, and a characteristic feature m Q ¼ 0:5 of the method of measuring the NLC sense was defined.
In accordance with this, the task of identifying informative features of the text resulted in formal theoretical values of uncertainty s of the results of observations k lj for each ACS S i were obtained, in addition to calculation of standard uncertainty of type A u A ðXÞ and type B u B ðXÞ for all ACSs.
With the help of the DKPro Core-based package, the software program developed in [13] produced results by processing the four chosen English texts, which may be interpreted as being four different ACSs.The basic results of processing as defined in (5) are presented in Table 1, where the last three columns contain the following data: percentage s of the mean square deviation of the mathematical expectation l; the number of lemmas in the text identified by DKPro Core; the mean number of different links for one lemma in the text.
The resulting histogram of experimental density distribution laws showed a significant resemblance to a Pareto distribution law, which is shown by the example of a comparison of experimental results for text 1 (Carroll_part) with the theoretical Pareto density distribution with a value parameter k = 2108 (Fig. 3).
Analysis of the language-pair images 〈i l , i j 〉, sorted as a descending list k lj , revealed four common pairs at the top of the list, output data and assessment results q according to (8) and uncertainty of types A and B, in accordance with (11) and (12), are presented in Table 2.

Discussion
The results obtained from the experiments of numerical values of uncertainty of the measurement results of the sense of language-pair images yielded new information about the texts analyzed.Presentation of each text as a separate ACS shows that the experimental density distribution law for the characteristics k lj of the pairs of language images is very similar to Pareto distribution.However, this conclusion does not correspond to the mathematical expectation values l, which should have been diminishing and moving closer to 1 (l Pareto = (k ⋅ x m )/ (k À 1)) with an increase on the number of pairs [16], as well as the mean square deviation s which is too large for a Pareto distribution.For example, text 1 according to (5), However, analysis of the data in Table 1 provides a formal basis for advancing the hypothesisthe most informative characteristics of an ACS lie in the average number of links for a single Lemma (language image).The justification for this is the Pearson correlation coefficient for columns containing l and the 'number of Lemma' for all 4 ACSs which equals 0.198, but pairs of columns l and the 'average number of connections' equalling 0.945.
Simultaneously for pairs of columns s and "average number of links", the correlation coefficient is 0.984, and pairs of columns, namely "%" and "average number of links" equals 0.996.This suggests that the distribution law is only Pareto-like, but the uncertainty of the sense of ACSs (parameter s) is directly proportional to the mean number of links.Further advancement of the hypothesis requires further large-scale experimental verification and clarification.
The data in Table 2 shows a high degree of sense similarity in accordance with the approach put forward for 4 selected pairs of language images, which is used by 3 different authors.The general trend is that the values of type A uncertainty u A ðXÞ are lower than the corresponding type B values u B ðXÞ for all ACSs by approximately 1.5.At the same time, the percentage of uncertainty does not exceed 4% of the value of the mathematical expectation q for all pairs m Q (〈i l , i j 〉), other than the pair "know-I" (up to 22.03%), which has an understandable explanation, given the selected excerpt 1 in the text by Lewis Carroll, this pair being found relatively more often than in the whole book 2 (Alice in Wonderland) in general.These results allow us to hope that the proposed approach will improve the quality of problem-solving in automatic semantic analysis of texts, in particular, the identification of authors.However, it is likely that a similar comparison of pairs which are at the bottom of sorted lists which are rarely found, may demonstrate high uncertainty.
Further research is also required to define the laws of the distribution of experimental values m Q (〈i l , i j 〉) and to obtain subjective characteristics for an ACS knowledge base to enable dynamic uncertainty measurement.

Conclusion
The research resulted in solving the task of obtaining values of the uncertainty of sense for NLCs as components of ACSs, which is directly related to the problem of understanding the sense of textual information.Further, a method for measuring sense in an NLC was further developed based on fuzzy relationships, which, unlike the existing methods, is based on two formal terms of artificial cognitive systems and linguistic image that enables output statistical data to be obtained, in order to evaluate the results of uncertainty measurement of types A and B. For the first time we obtained and interpreted formal values of the uncertainty of measurement results of the sense of NLCs that enable us to take into account information on links between lemmas of a text to solve the tasks of identifying informative features of a text.
The practical significance of the results is to obtain software technology to produce tools based on the DKPro Core linguistics package, which allows us to implement our proposed method for semantic analysis of English-language texts.The results of a series of experiments revealed that the distribution law links between lemmas of a text is Pareto-like, but has significant differences from a formal and classical Pareto distribution, including significantly higher values of mathematical expectation l (up to 46.3%) and mean square deviation s (by several orders).
In terms of this proposed approach to determining the sense of NLCs, the augmentation of the size of a text by number of words and, accordingly, the size of its vocabulary by the number of lemmas does not affect the parameters of the distribution law and uncertainty of the sense of each individual ACS.Analysis of the results obtained suggests that the parameter of the average number of links of a language image be considered as the most informative characteristic of the text, since the Pearson correlation coefficient between it and the parameters related to the uncertainty of the sense is greater than 0.945.
Comparison of uncertainty values for 4 pairs of language images, used by 3 different authors, showed a high degree of similarity in the sense of such pairs according to the approach put forward.This type A uncertainty The results of research that were obtained were, among others, formal parameters for the uncertainty of sense and the average number of links of language pairs, which provide potential improvement in resolving the tasks in semantic analysis of NLCs, including clustering, classification and definition of authorship of texts.

Nomenclature
u A (X) evaluation of type A uncertainty u B (X) evaluation of type B uncertainty l; q mathematical expectation s Standard quadratic deviation (SCR) NLC natural language construction ACS artificial cognitive system

Fig. 1 .
Fig. 1.Diagram of an abstract model of cognitive activity.

Table 2 .
Results of uncertainty assessment of the 4 selected language-pair images..V. Bisikalo and O.M. Vasilevskyi: Int.J. Metrol.Qual.Eng.8, 6 (2017)values u A ðXÞ are proportionally lower than the corresponding type B values B ðXÞ for all ACSs by about 1.5 times, which allows us to only obtain a single value u A ðXÞ for uncertainty. O