Biological Event Extraction
Our team participated in the BioNLP'09 Shared Task on Event Extraction. Our system is described in
György Móra, Richárd Farkas, György Szarvas, Zsolt Molnár: Exploring ways beyond the simple supervised learning approach for biological event extraction. In Proceedings of BioNLP'09 (NAACL workshop).
Abstract
Our paper presents the comparison of a
machine-learnt and a manually constructed
expert-rule-based biological event extraction
system and some preliminary experiments to
apply a negation and speculation detection
system to further classify the extracted events.
You can find several supplementary information here.
The feature set
Two types of features were used:
a)
token based features
trigger word (keyword)
every token of the keyword was a feature suffixed by "_kw" and the index of the token in the trigger expression
binding_kw_0, specificities_kw_1: the features from the trigger expression "binding specificities"
every token indexed with its relative position to the trigger words
examples_l_2: the token "examples" is the second token from the trigger word
the first two characters of the POS code of the token word
NN_p_r_3: the POS code of the third token right from the trigger word is NN
the base form of the token word
example_b_l_2: the base form of the second token from the trigger word is "example"
For every token annotated in the A1 files as Protein an extra feature was added, where the name of the protein was replaced with the "$protein" sequence.
When a token was tagged by the Genia Tagger as protein, protein complex or protein family (independently from the A1 annotation), then the "$genia_protein" extra feature was added as a suffix to the position of the token.
b)
distance based features
These numeric types of features are for determining the participant proteins of an event, and adding information about the nearest proteins to the trigger expression.
$left_first_protein: the distance of the nearest left side protein to the trigger word (distance measured in token number)
$right_first_protein: the distance of the nearest right side protein to the trigger word
$theme_protein: the distance of the theme candidate protein from the trigger expression. If it is negative, then the protein is on the left handside of the keyword.
Parameters of the statistical system
The features were filtered by info-gain ranking, and the best result was obtained with the best 300 features.
The parameters of the J48 were 0.2 confidence factor and 5 minimum number of object.
Expert rules
Each rule consists of tab separated parts:
the name of the pattern (for identification purposes),
the second element describes which biological event the rule matches,
the recognising patterns which are similar to Regular Expressions.
Types of patterns
"keyword": the expression indicating the event. Every pattern must contain exactly one trigger expression.
"word" the tokens between "" characters are matched as normal text
_ The underline character matches any word
$protein Matches every token that is annotated as protein in the A1 files
Every piece of the pattern can be suffixed (except the keyword) with RegExp like modifiers. This gives flexibility and power to the patterns.
Suffixes
? Zero or one occurrence of the pattern.
* Zero or more occurrences of the pattern.
+ One or more occurrences of the pattern.
{a,b} Minimum a, maximum b occurrences of the pattern.
For example:
rule: binding5 Binding
_? "of" _{2,3} $protein
rule name: binding5
rule class: Binding
matches: The trigger expression "binding" is followed by one or zero arbitrary word, then comes the word "of".
The text continues with minimum two maximum three arbitrary words and the theme protein.
Contribution of individual rules
Event matches calculated for matching trigger expressions on the Shared Task's development set.
tp: true positive matches
fp: false positive matches
event: Phosphorylation
pattern: Phosphorylation3 Phosphorylation $protein
tp: 17
fp: 0
event: Phosphorylation
pattern: Phosphorylation1 Phosphorylation _{0,4} $protein
tp: 61
fp: 5
event: Gene_expression
pattern: Gene_expression1 Gene_expression $protein
tp: 78
fp: 0
event: Gene_expression
pattern: Gene_expression7 Gene_expression $protein _?
tp: 363
fp: 74
Comparison of the expert rules and the statistical system
The events matched by the rules are different from the events extracted by the statistical system as the precision and recall values of the combined system were nearly the sum of the individual subsystems for the two involved event classes.
This can be the effect of the different structure of the events found by the two subsystems. The statistical system contains features with the exact token position of the words. On the other hand, the rule based patterns can also express only the order and the relative position of the tokens.
If the position of a trigger token (or several tokens) shows a great variance, then the statistical system was unable to learn its positions, but simple rules can be constructed to find the event.
The next simple example can be detected by both the rule based and the statistical systems:
binding of STAT1 alpha
The enumerations are typically hard to handle for a distance-based statistical system, e.g.:
binding of Ets-1, PU.1, or the muE3-binding protein TFE3
Detailed results of experiments
| Method | Features | Training set | All Event (R/P/F) | Gene_expression (R/P/F) | Phosphorylation (R/P/F) |
|
VSM | 100 | train | 15.05 31.74 20.42 | 33.66 41.40 37.13 | 74.81 37.97 50.37 |
|
VSM | 300 | train | 16.78 31.73 21.95 | 36.43 41.61 38.85 | 74.81 37.97 50.37 |
|
VSM | 500 | train | 16.18 33.38 21.80 | 35.60 42.06 38.56 | 73.33 38.37 50.38 |
|
VSM | 100 | train+devel | 15.62 34.02 21.41 | 32.27 41.53 36.32 | 74.81 37.97 50.37 |
|
VSM | 300 | train+devel | 16.69 33.86 22.36 | 37.81 41.24 39.45 | 74.81 37.97 50.37 |
|
VSM | 500 | train+devel | 16.72 32.44 22.07 | 37.81 41.24 39.45 | 74.81 37.97 50.37 |
|
pattern | - | - | 5.41 81.13 10.14 | 20.64 86.63 33.33 | 17.04 57.50 26.29 |
|
VSM+pattern | 100 | train | 19.70 37.43 25.82 | 53.05 51.41 52.22 | 80.74 39.78 53.30 |
|
VSM+pattern | 300 | train | 21.53 36.99 27.21 | 56.23 51.20 53.60 | 80.74 39.78 53.30 |
|
VSM+pattern | 500 | train | 20.93 38.90 27.22 | 55.40 51.81 53.55 | 79.26 40.23 53.37 |