# Data Management

Method

In this experiment, we took a detailed look at Edward Bloom's Big Fish. In particular, we sampled 10 pages of the book, and from each of those pages, examined the number of lines starting with various types of words, and types of letters as well. In order to randomly select 10 pages from the book, we used the Vasser Stats randomizer to generate 10 random page numbers. We then went through each of the 10 randomly selected pages and recorded the number of lines that started with a noun, a verb, an adjective, a vowel, and a consonant. When recording the number of lines starting with nouns and verbs, we also made sure to divide the results into those that began with vowels and those began with consonants. The data was entered into a spreadsheet in Excel and then transferred to JMP IN. In dealing with the data, we treated each of the ten pages as an individual and so we had 10 values for the number of lines starting with each of our different word and letter types. Once the data was entered into JMP IN, we constructed 5 histograms to show the frequency distribution for the number of lines starting with each of our word and letter types. Once our histograms were created, we took a look at the descriptive statistics for each of our histograms and summarily grouped the 5 sets of data in a table. The descriptive statistics we chose to include were; the mean, median, maximum, minimum, upper quartile, lower quartile, 95% confidence intervals, and sample size. The histograms and statistics were calculated and created, respectively, in JMP. We then entered the number of noun and verb lines that started with vowels and consonants in JMP. From this spreadsheet, we used JMP to produce a contingency table. This was done to determine whether or not there exists a statistically significant relationship between the type of word and the type of letter that word begins with. Once the contingency table was created, JMP performed a Pearson chi-square test on the data.

Results

Figure 1. Histogram of lines starting with nouns, on 10 pages, in Edward Bloom's, Big Fish

Figure 2. Histogram of lines starting with verbs, on 10 pages, in Edward Bloom's, Big Fish

Figure 3. Histogram of lines starting with adjectives, on 10 pages, in Edward Bloom's, Big Fish

Figure 4. Histogram of lines starting with vowels, on 10 pages in Edward Bloom's, Big Fish

Figure 5. Histogram of lines starting with consonants, on 10 pages, in Edward Bloom's, Big Fish

Table 1. Descriptive statistics of data sets for lines starting with nouns, verbs, adjectives, vowels, and consonants, on 10 pages, in Edward Bloom's, Big Fish
Values for Lines Starting With
Statistic Nouns Verbs Adjectives Vowels Consonants
Mean 4.6 4.5 2.5 5.4 21.6
Median 4.5 4 2.5 4.5 22.5
Upper Quartile 5 6.25 4 7.25 23
Lower Quartile 3 3 1 4 19.75
Maximum 10 8 4 9 24
Minimum 2 2 1 3 18
95% Upper Confidence Interval 6.152689 5.90059 3.407999 6.83864 23.03864
95% Lower Confidence Interval 3.047311 3.09941 1.592001 3.96136 20.16136
Sample Size 10 10 10 10 10

Figure 1 shows the histogram for lines starting with nouns, on 10 randomly selected pages, in Edward Bloom's, Big Fish. Figure 2 shows the histogram for lines starting with verbs, on 10 pages, in Edward Bloom's, Big Fish. Figure 3 shows the histogram for lines starting with adjectives, on 10 pages, in Edward Bloom's, Big Fish. Figure 4 shows the histogram for lines starting with vowels, on 10 pages, in Edward Bloom's, Big Fish. Figure 5 shows the histogram for lines starting with consonants, on 10 pages, in Edward Bloom's, Big Fish. The descriptive statistics for Figures 1 through to 5 are contained in Table 1.
There is no statistically significant relationship between the type of word, and the type of letter that word starts with (chi-square test, Pearson chi-square = 0.735, n = 91, p = 0.3914). From the results of the chi-square analysis, I came to the conclusion that the letter a word begins with, does not depend on the type of word it is. To be more precise, because the p value of our test was greater than 0.05, we are able to say with confidence that there is no significant relationship between the two variables being analyzed.

Discussion
In conducting this study, we chose to use the Vasser Stats randomizer to select our sample. We did this to ensure that the 10 pages we chose were completely random. Since the randomizer generated completely arbitrary numbers based on the total number of pages in Big Fish, we did in fact acquire a truly random sample, and in doing so eliminated any biases. I say this because the only way to truly eliminate bias is by choosing a completely random sample of the population we studied. The population we analyzed in this experiment is the sum total of pages in Edward Bloom's big fish. The frequency distributions, descriptive statistics, and relationship calculations in our results apply solely to Big Fish because the sample we used to complete our analyses is only representative of this particular book. This is due to the fact that when using the Vasser Stats randomizer, the population we used was the number of pages in Big Fish alone. In order to improve this study, we would recommend increasing the sample size in order to make it even more representative of the population. Furthermore, we feel that by finding a way to tabulate the number of lines starting with each type of word or letter, using a computer, we could circumvent any possibility of human error.