A semantic experiment for separate good from bad.
Yesterday was sunday and I came up with a fascinating idea: what happens if I use wordnet to measure the distance between two words ? By assigning weights to all the relation types and by navigate this relations graph I thought to be able to measure the distance between a word and the others in terms of the minimum sum of weights of the edges between each pair made of the chosen word and another.
So I tried to assign weights using the relation type as discriminator, to make an example take the word ‘sword’ and its relations:
related-term relation type assigned weight
weapon: hypernym 5
backsword: hyponym 2
blade: part_meronym 3
broadsword: hyponym 2
cavalry sword: hyponym 2
cutlas: hyponym 2
Excalibur: instance_hyponym 2
falchion: hyponym 2
fencing sword: hyponym 2
foible: part_meronym 3
forte: part_meronym 3
haft: part_meronym 3
hilt: part_meronym 3
rapier: hyponym 2
point: part_meronym 3
The weight I choose for each of the relation types tried to follow the statement
‘The more the words are related the less greater the number is’; so weapon is less
related to sword than broadsword because the first express a concept broader than sword
(also a nuclear bomb is a weapon); the second instead detail the word ‘sword’ and
make true the statement ‘A broadsword is always a sword’ so it’s more related to the
chosen word.
By following this general rule I associated a weight to each of the most common relation types and wrote down a few lines of code in order to compute weights by navigate the relation graph:
def compute_distances(weights = :default, max_depth_allowed = 6)
# retrieve the list of the weights associated to each relation type
# (its just an hash {:relation_type => weight})
weights = CONFIG_FILE['distance']["#{weights}"]
data = Words::Wordnet.new
# get a list of sysnsets as a starting point (eg: red, crimson)
synsets_to_analyze = self.synsets.map{|s| [s.synset_id,0,0]}
synsets_to_store = []
# process the first element of the list
# until the sysets_to_analyze stack is empty
while(sys = synsets_to_analyze.shift) do
sys_id,dis,dep = *sys; next if dep >= max_depth_allowed
sys = Words::Synset.new(sys_id,data.wordnet_connection,nil) rescue next;
# save the current sysnset words into an output array
sys.words.each {|w| synsets_to_store.unshift([sys_id,w,dis])}
# put each of the sysnset related to this into the stack unless they
# are already present
sys.relations.each do |r|
synsets_to_analyze.unshift(
[r.destination.synset_id, dis + weights["#{r.relation_type}"],dep + 1]
) if r.is_semantic? and
!synsets_to_store.find{|s| s.first == r.destination.synset_id}
end
end
# now in sysnsets_to_store you have an array of the words each of them
# with the weight that separe it from the starting synsets.
# (now I store them on a db, but is just because the context is the same as the Abacus gem)
synsets_to_store.each do |s|
a_id = ArticleKey.find_by_the_key(s[1]).id rescue next
self.distances.find_or_create_by_article_key_id( a_id, :distance => s[2])
end
end
Here some of the results for ‘sword’ with depth = 3:
sword: 0
brand: 0
steel: 0
broadsword: 2
rapier: 2
tuck: 2
backsword: 2
fencing sword: 2
falchion: 2
Excalibur: 2
cutlas: 2
sabre: 2
cavalry sword: 2
saber: 2
cutlass: 2
foible: 3
blade: 3
hilt: 3
forte: 3
tip: 3
peak: 3
point: 3
helve: 3
haft: 3
claymore: 4
scimitar: 4
saber: 4
sabre: 4
foil: 4
epee: 4
arm: 5
basket hilt: 5
head: 5
weapon system: 5
knife blade: 5
weapon: 5
widow's peak: 5
cusp: 5
razorblade: 5
cutting edge: 6
pommel: 6
knob: 6
knife edge: 6
fire ship: 7
shaft: 7
slasher: 7
missile: 7
Greek fire: 7
missile: 7
weapon of mass destruction: 7
light arm: 7
WMD: 7
gun: 7
flamethrower: 7
pike: 7
brass knucks: 7
knucks: 7
brass knuckles: 7
knuckles: 7
W: 7
tomahawk: 7
hatchet: 7
lance: 7
knuckle duster: 7
bow and arrow: 7
projectile: 7
sling: 7
bow: 7
stun baton: 7
spear: 7
stun gun: 7
convexity: 8
cutting implement: 8
convex shape: 8
portion: 8
part: 8
handle: 8
grip: 8
hold: 8
handgrip: 8
reap hook: 9
knife: 9
sticker: 9
dagger: 9
axe: 9
file: 9
awl: 9
lawn mower: 9
mower: 9
scissors: 9
ax: 9
sickle: 9
cone shape: 9
conoid: 9
cone: 9
reaping hook: 9
pencil: 9
arrowhead: 9
knife: 9
pair of scissors: 9
spatula: 9
spatula: 9
alpenstock: 9
instrument: 10
weapons system: 11
implements of war: 11
arms: 11
munition: 11
weaponry: 11
Now, as you may notice, there still a lot of tuning to do; for example it is pretty strange that ‘weapon of mass destruction’ is more semantically related to ‘sword’ than ‘dagger’ :-).
Anyway I’m pretty pleased of the results of this small experiment thus I’m still far from my initial idea: calculate the weight of each word of the dictionary in relation to ‘good’ and ‘bad’ and use these weights to estimate the ‘mood’ of some common trends in twitter.
Tags: Semantic Relations, wordnet