The more spam a user gets, the less likely he will be to notice one innocent mail sitting in his spam folder. And strangely enough, the better your spam filters get, the more dangerous false positives become, because when the filters are highly effective, users will be more likely to ignore everything they catch. The methods of spamming are improving together with the methods of spam filtering. Spammers try to attack filters, namely to decrease filtering effectiveness. We can categorize several common attacks on spam filters as following:
The original text passed around reads as follows:
The actual
origin of the theory appears to be an unpublished PhD thesis [9].
Aoccdrnig to rscheearch at Cmabrigde uinervtisy, it
deosn’t mttaer waht oredr the ltteers in a wrod are, the
olny iprmoetnt tihng is taht thefrist and lsat ltteres are
at the rghit pclae. The rset can be a tatol mses and
you can sitll raed it wouthit a porbelm. Tihs is bcuseae
we donot raed ervey lteter by it slef but the wrod as a
wlohe.
After the technique became known on the internet, some people have written simple scripts 4 that can randomize text excluding the first and last letters.
Nowadays, spammers are constantly searching for new means to
obfuscate a word in order to fool spam filters, and this technique often
appeared in spam text:
Do you wnat to l00k c00l and w3althy but do not have
the m0ney to aff0rd a= sweeeeet n3w R0lex wtach? Get
a 98% L00kalike R0lex watc_h here! We have replika_s of
all the fines_t bran_d watc_hes. Check the_m out here!
|
Given a word w, we are trying to choose the most likely spelling correction for that word . Our goal is to find the correction c, out of all possible corrections, that maximizes the probability of c given the original word w: P(c|w). This is equivalent of saying ”What is the most likely spelling correction c, if user types a word w”. (which could mean itself, if no better correction can be found)

By Bayes’ Theorem this is equivalent to:
| argmax cP(c|w) | = argmax c![]() | ||
| =∝ argmax cP(w|c)P(c) | (Since P(w) is the same for every possible c, we can ignore it) |