You are viewing [info]serafeim's journal

 

Still up and running!

About Recent Entries

Blog moved! Jun. 11th, 2005 @ 09:39 pm
I've moved this blog to http://noisy.compsoc.man.ac.uk/~szanik/blog. Well, not a terribly fancy url but I get more freedom in terms of hosting.
Current Music: Ο παραμυθάς (Μίλτος Πασχαλίδης)

Blog hosting needed May. 28th, 2005 @ 02:16 am
I'm fed up with livejournal, it has way too many limitations. I only get to choose from predefined options, and I'm not allowed to directly edit the code. Also, I can't add a list of links to external blogs. Well I can, but only up to five!

I've tried blogspot.com but it doesn't play well with konqueror :-/

Suggestions please?

Surviving spam May. 23rd, 2005 @ 01:13 am
No doubt, spam is a pain. I receive tenths of spam mails per day and used to delete them
manually. Until I discovered bogofilter, which came along with Kubuntu. Bogofilter is a
spam filter that implements an improved version of Paul Graham's bayesian-based plan for spam.

Bogofilter keeps track of the spaminess of words considering the number of
ham (ie, legitimate) and spam mails in which they appear (that is, based on user
feedback). A mail is classified as ham or spam based on approx. the 15 least
neutral (ie, very hammy or very spammy) words it contains. Thus, a spam with an
article excerpt thrown in will typically not confuse bogofilter.

I enabled bogofilter at the beginning of the month, fed it a corpus
of about 550 spams, and trained it with two weeks' incoming spam. (Unfortunately
I haven't kept the data from the training period). The following plot shows (i)
the total number of spams I received during the third week, and the subset of
those that were correctly identified by (ii) bogofilter and (iii) my mail provider's filter
(which adds "possible spam" in spams' subject).


detection of spam over time


The plot shows that my mail provider's filter typically detects much less
than half of the spam I receive, and bogofilter almost all of them. This isn't
surprising, because bogofilter benefits from user feedback and access to my
address book.

It's also worth noting that my provider's filter classifies wrongly (ie, false
positive) a particular class of legitimate mails. That is, empty mails with
image/office attachments I receive from friends. Bogofilter would give false
positives only in the unlikely case that someone I don't know were to send me a
mail using very spammy words.

Here's the list of filters I use in kmail:

filter 1 (bogofilter check): if size <= 256000 | pipe mail through "bogofilter -p -e -u" filter 2 (the people I know are okay): if From header is in address book, mark as ham filter 3 (train bogofilter if a friend's mail isn't classified as ham): if mail is marked as ham (by filter 2) and X-Bogosity header doesn't contain "Ham,", pipe through "bogofilter -S" filter 4 (I don't want to see mails that are certainly spam): if mail is not marked as ham (by filter 2) and X-Bogosity header contains "Spam,", mark as spam, mark as read, and move it into spam folder


The above filters are checked during the receipt of new mails. I also have a
couple of filters for training bogofilter on false negatives, ie, spams which
weren't identified as such, and false positives (thankfully, none yet). These
filters are meant to be invoked manually.

filter 5 (mark selected mails as spam):
pipe mail through "bogofilter -Ns", mark as spam, and move into spam folder

filter 6 (mark selected mails as ham):
pipe mail through "bogofilter -Sn" and mark as ham


Any comments on how you cope with spam or suggestions are most welcome!
And BTW here's the (messy) script that generates the plot.
Current Music: Εικόνες στα σύννεφα (Υπόγεια ρεύματα)

A puzzle for the bash experts out there! May. 19th, 2005 @ 11:24 pm
I'm trying to figure out what's wrong in a bash script. Here's a simplified (and
meaningless) version of the buggy part:


 1 aa=0
 2 bb=0
 3 echo -e "a\nb" | while read outer; do
 4 for inner in a b; do
 5  #echo -e "a\nb" | while read inner; do
 6   bb=`expr $bb + 1`
 7  done
 8  aa=`expr $aa + 1`
 9  echo "$aa $bb"
10 done




This works as expected:

$ sh a.sh
1 2
2 4


However, changing the inner for with an apparently equivalent
while (the one in line 5) results in:

$ sh a.sh
1 0
2 0


Running the script with bash -x suggests that bb is increased
within the inner loop as expected, but for some reason is zero when
echo is reached. Any idea what's wrong?
Current Mood: puzzled :)
Current Music: Αϊβαλί (Λέκκας)

How powerful is your favourite programming language? May. 14th, 2005 @ 07:36 pm
I recently read Paul Graham's Hackers and Painters. In one of the essays he points out a list of features that were introduced by Lisp (in '58), to illustrate why/how Lisp is a really powerful language.

Graham argues that a language is powerful to the extent that it is succint (in the sense of abstractions, as opposed to, say, merely having short keywords or short constructs). A high level of abstraction means more productivity (less code to write, read, debug and maintain*). In this perspective, powerfulness has
nothing to do with run-time performance: assembly can deliver the best possible performance but noone uses it unless it's required. After all what's more valuable, your time or a computer's?

Having in mind how sensitive people can be when it comes to languages, I though it'd be interesting to classify some common languages according to powerfulness :). Here's a brief description of the features (for completeness), followed by the actual classification.


Conditionals

Support for an if-then-else construct


A function type

Can a function be stored in a variable or passed as an argument?


Recursive functions



Dynamic typing

All variables are pointers to literals. It's the literals that have a
type, not variables


Garbage collection



Programs made exclusively of expressions

(Instead of expressions and statements)


A symbol type

(For storing and comparing strings)


A notation for code using trees of symbols and constants



Manipulate code as data

Does the language provide the means for a program to read, write, compile
and run other programs?



language conditionals function type recursion dynamic typing garbage collection programs made of expressions symbol type tree notation manipulate code
COBOL y                
BASIC y   y            
Fortran y   y            
Pascal y   y            
C/C++ y y y            
Java y   y   y        
PERL y y y y y        
Prolog y   y y y y      
Python y y y y y   y    
Lisp y y y y y y y y y


Please let me know if I missed something or you'd like a language added. Anyone to provide the details for Ruby and tcl?

* For the productivity benefits of abstraction, you may have a look at Scripting: Higher-Level Programming for the 21st Century, by John Ousterhout (Tcl/Tk creator).
Current Music: High hopes (Pink Floyd)
Other entries
» Tangled up in Python namespaces
Python's a very elegant language but it's namespace behaviour can be confusing. Let's warm up with a snippet:

01: #!/usr/bin/env python
02: 
03: class Foo:
04:         # the following statement runs only at Foo's first instantiation
05:         j = -1 # class variable
06:         def __init__(self,name):
07:                 self.name = name
08:                 self.j = 1 # instance variable
09:                 j = 10 # local variable
10:         def bar(self):
11:                 print "%s: self.j=%d, Foo.j=%d" % (self.name, self.j, Foo.j)
12:                 self.j += 2 # applies only to current instance of Foo
13:                 Foo.j += 3 # applies to all instances of Foo
14: 
15: if __name__=='__main__':
16:         a = Foo('a')
17:         a.bar()
18:         b = Foo('b')
19:         b.bar()
20:         c = Foo('c')
21:         c.bar()

Running this illustrates the difference between instance and class variables:
$ python Foo.py
a: self.j=1, Foo.j=-1
b: self.j=1, Foo.j=2
c: self.j=1, Foo.j=5
Nothing special here. But let's re-run the snippet after commenting out the initialisation of self.j at line 8:
$ python Foo.py
a: self.j=-1, Foo.j=-1
b: self.j=2, Foo.j=2
c: self.j=5, Foo.j=5
Being uninitialised, self.j falls back to Foo.j. But wait a minute, if self.j and Foo.j refer to the same thing then every invocation of bar() should increase j by 2+3! A little more investigation is needed, let's add a second print at the end of bar():

01: #!/usr/bin/env python
02: 
03: class Foo:
04:         # the following statement runs only at Foo's first instantiation
05:         j = -1 # class variable
06:         def __init__(self,name):
07:                 self.name = name
08: #                self.j = 1 # instance variable
09:                 j = 10 # local variable
10:         def bar(self):
11:                 # self.j is not initialised so falls back to the class variable
12:                 print "%s: self.j=%d, Foo.j=%d" % (self.name, self.j, Foo.j)
13:                 self.j += 2 # applies only to current instance of Foo
14:                 Foo.j += 3 # applies to all instances of Foo
15:                 print "%s: self.j=%d, Foo.j=%d" % (self.name, self.j, Foo.j)
16: 
17: if __name__=='__main__':
18:         a = Foo('a')
19:         a.bar()
20:         b = Foo('b')
21:         b.bar()
22:         c = Foo('c')
23:         c.bar()


This yields:

$ python Foo.py
a: self.j=-1, Foo.j=-1
a: self.j=1, Foo.j=2
b: self.j=2, Foo.j=2
b: self.j=4, Foo.j=5
c: self.j=5, Foo.j=5
c: self.j=7, Foo.j=8
self.j still falls back to the class variable Foo.j during the first print (because it isn't yet initialised). However self.j becomes an instance variable after the addition at line 13. Let me rephrase this: the same variable name (self.j) within the same scope (the bar method) refers to two different things! The official python tutorial is far from informative on this issue. For a clear explanation check out this note at python's reference manual. PS. welcome to my new blog ;)

Top of Page Powered by LiveJournal.com