Amazon’s take on the Book of Mormon has an algorithm for noting the “Statistically Improbably Phrases” in any given book. The idea is to look for word combinations that are uncommon generally but common in the book in the hope that this provides potential buyers some insight into what the book is about. Here are the ones for the Doubleday edition of the Book of Mormon:

hearts upon riches,
lievest thou,
exceeding faith,
mine epistle,
more wicked part,
nowise inherit,
hath covenanted,
abominable church,
land northward,
labor exceedingly,
angel spake,
choice above all other lands,
plain unto,
thou hast beheld,
continual peace,
beareth record,
time cometh,
exceedingly great joy,
made manifest unto,
land southward,
whoso believeth,
soul delighteth,
stiffnecked people,
secret abominations,
salvation cometh

Not a bad list all considered, although where is “and it came to pass”? I guess that phrase isn’t improbable enough. Nor apparently, are the various references to deity which, although ubiquitous in the Book of Mormon are not uncommon in other books. The link is here, although, you know, if you’re reading this I bet you already have a copy of the Book of Mormon.

16 comments for “Amazon’s take on the Book of Mormon

  1. March 8, 2007 at 9:28 am

    What a philosophy we could construct from this list! It’s a relief that “secret abominations” and “hearts upon riches” do not form the subject of much discussion, but how sad that “continual peace” and “exceedingly great joy” are so “improbable.” How would the tourist board of Arizona change its advertising if it realized that “land southward” is probably read/said more in Utah than elsewhere? And how will the statistical frequency of “abominable church” change in the Bible Belt during the Romney campaign?

  2. March 8, 2007 at 10:28 am

    Hmmmm… Since those phrases are all less than 5 words in length, I wonder what algorithm they’re using? Maybe bi-grams something or other? Wonder what it would look like if you used long snippets…. n-grams where n = 10? n = 20?

  3. March 8, 2007 at 11:44 am

    Their algorithms also offer unrelated books as recommended buys…

  4. Frank McIntyre
    March 8, 2007 at 2:04 pm

    A NM,

    I confess complete ignorance, but on the other hand surely huge chunks of the book stand out as unique if you make the chunks long enough (with the obvious exception being the overlap with the Bible).


    That is weird, because the Doubleday link in the post you mention goes to a different page than the one I linked to. The Doubleday book I link to suggests buying The Pearl of Great Price (which seems surprisingly good). Such are the vagaries of computer algorithms.

    Ardis, I think you should contact Arizona and New Mexico.

  5. DKL
    March 8, 2007 at 3:18 pm

    Well, my favorite “improbable phrase” would have to be more wicked part. However improbable one might suppose that phrase to be, a sizable part of Christianity has historically been given over to controlling (in some measure) our more wicked part.

  6. Gavin Guillaume
    March 8, 2007 at 5:20 pm

    Most n-gram tools don’t go above n=5, sadly. I agree, though, that it would be interesting to go n=20 or so. The problem isn’t reprocessing the BoM with n=20; it’s whether or not the original source material (for comparison) has been processed with n=20 (which isn’t hasn’t, probably).

  7. Frank McIntyre
    March 8, 2007 at 6:16 pm

    Gavin/Nonny Mouse,

    Is the n keeping track of the maximum number of words in each phrase or is it more esoterically related to the number of words in a phrase?

  8. March 8, 2007 at 6:19 pm

    Frank: bigrams look at the statistical probabilities of the text in 2 word chunks, tri-grams in 3 word chunks n-grams in n word chunks. I went back and looked over their description of the algorithm (pretty vague, but still interesting) and it looks like what the do is come up with the satistical probabilties of all the words (or sets of n words) in their entire corpus of scanned text, and then look at which ones appear statistically more often or less often in the target book compared to the rest of the corpus. Which is definitely interesting, even on the 2 word level.

    Gavin’s right: we don’t need just to BOM 5-grams or what have you, but you have to re-process the entire corpus. It’s just that 2 word phrases tend to be pretty small, you know? :)

  9. March 8, 2007 at 6:25 pm

    I was writing my post while you asked your question :) So, for n-grams usually what you do is just look in successive n-word windows. So, you count up all the probabilities for the first sentence from this comment like this, where n = 2: “I was”, “was writing”, “writing my”, “my post”, “post while”, “while you”. Etc. And you figure out what the chance of each of those combinations in the document is. Then you reference that versus the chance of that occuring in the overall corpus. If it just occurs once, then you know it’s pretty unique to your volume, and can be an “interesting phrase”.

    That’d be the easiest way to generate that info, but it’s too hard to tell form their description if that’s what they’re really doing.

  10. Gavin Guillaume
    March 8, 2007 at 7:10 pm

    This is, at a very generic level, one of the techniques they used to catch the Unabomber.

  11. Mike Parker
    March 8, 2007 at 7:21 pm

    Is it just me, or do those phrases look like examples of subject lines from spam email?

  12. DKL
    March 9, 2007 at 12:07 am

    Mike, I noticed that, too. Just add the word “viagra” and they cover most of the spams currently in my inbox:

    viagra. exceedingly great joy
    thou hast beheld viagra
    soul delighteth viagra
    secret abominations viagra
    viagra, more wicked part
    land southward, viagra
    viagra, salvation cometh

  13. March 9, 2007 at 1:29 pm

    DKL: This looks like the high priests’ variation on the old game played by the priests, adding “in a bathtub” to the names of the hymns.

  14. DKL
    March 9, 2007 at 8:12 pm

    LOL. So that’s where all that spam is coming from.

  15. Sarah
    March 11, 2007 at 9:44 pm

    I guess “for behold, I say unto you nay,” was too long to count.

    Meanwhile, the Book of Mormon, as sold by the Church for $2, has to be a ridiculously good value, since the Doubleday version at $17 comes out to 16,889 words per dollar. Though, depending on the index used, as few as 16% of all books with the Search Inside features enabled are “harder” to read.

  16. Paul
    March 12, 2007 at 12:12 pm

    “Statistically Improbably Phrases”
    What are the odds someone wishing to address statistically improbable phrases would mis-spell \”improbable\”?

    semper et ubique

Comments are closed.