LASSO:
Ligand
Activity
in Surface
Similarity
Order
|
 |
Frequently
Asked Questions
|
- What does the
‘rank’/LASSO-score output mean in
a LASSO run?
- Are
there any ‘confidence’ measures as to the adequacy of training? Are
there any ‘metrics’ that provide measurements of ‘standard error’ or
‘reliability of the neural network training’? Since the LASSO score is
giving me a 'relative' similarity of compounds in my chemical databases
to my known actives
- Are there any ways I
may validate my current
trained LASSO-Neural Net to gauge reliability
in extracting novel ligands from my databases that have an 'enriched'
probability of being active? Since the LASSO-score is gauging
'similarity' of the LASSO-descriptor vectors
- When I seed my
‘unknown’ database with
‘actives’, as a test of the LASSO-training and filtering---some of my
‘known actives’
I seed in the database are given some moderately high scores e.g.
>0.7 and then some have LASSO scores <0.1. Does this mean
the training is insufficient?
- When
I seed my ‘unknown’ database with ‘actives’, as a test of the
LASSO-training and filtering---even my ‘known actives’ I seed are
given small LASSO scores (i.e. close to zero). Does this mean that
the LASSO-training hasn’t adequately picked up “facets” of the
Active ligands are important for ‘binding’ and activation of my
protein/receptor target?
- Given the LASSO scores where some of the seeded actives receive high
LASSO scores and some at lower values, should I prioritize for
synthesis/testing those near the higher-scoring 'known actives'?
Frequently Asked Questions
Q-1: What
does the ‘rank’/LASSO-score output mean in
a LASSO run?
A: As indicated by the diagram below the
LASSO-SCORE
gives an appraisal of the ‘similarity of those features of the
ligand’ to those determined in the neural network ‘training’ to
be those in common in the “active” ligands in the training.
LASSO
first determines for each ligand the nature of the surface point
types, corresponding to 23-distinct chemical categories, forms a
‘vector’ of the numbers of each of the types of features in each
ligand.
Each
of the 23-distinct inputs from each Ligand are input into the 5-input
nodes of the feed-forward hidden layer of the Neural network, each
with a corresponding ‘weight’ that gets trained during
LASSO-training, the output of the hidden layer is a single output,
which gets ‘normalized’, so as to provide an output between [0,1]
during the ‘filter’ segment portion of a LASSO run. The larger
the LASSO-score of each of the ligands scored from a test-data-base
the more similar their ‘feature vectors’ to those of the
‘actives’ (or dissimilar to the decoys) used in training.
Q-2: Are
there any ‘confidence’ measures as to the adequacy of training? Are
there any ‘metrics’ that provide measurements of ‘standard error’ or
‘reliability of the neural network training’? Since the LASSO score is
giving me a 'relative' similarity of compounds in my chemical databases
to my known actives
A: There
are two:
- The MSE and SSE values output to the screen
during the LASSO Neural
Net training step gives you an indication of how the ‘active’
range of the LASSO score should be viewed. The SSE and MSE values are
indications of how good the training was and that is one of the
reason they are printed on the screen.
- SSE
is the "sum of squared errors" and MSE is SSE per pattern
(i.e., "mean of squared errors"). So in the case when we
have a vector of surface point counts of an active molecule (i.e.,
the
expected output is 1
so if as a part of training the neural net we only get 0.9 from the
output of neural net then this adds (0.1)^2 to SSE.
So let’s say take an example to make this
concrete: An MSE value of
say 0.05 means that on average the squared error for each pattern is
0.05. In other words decoys will be assigned a value around 0.25 on
average
and actives a values of 0.75 (in an active/decoy training scenario).
Note this provides you with an additional guideline to how to view the
LASSO-score relevant to the 'particular' neural net training you will
then use for screening. This is important because:
Note
that the LASSO score range corresponding to 'probable' actives will
differ from with different trained neural nets! So if you 'retrain'
each month, such
that the identity of the training set actives and decoys change, then
the precise LASSO score assigned to a known active compound can
change because the LASSO-NN (LASSO neural net) assigned 'similarity'
is different due to the change in the 'learned' neural net perceived
'similarity..
Bottom
Line: The LASSO score is a relative (and not an absolute)
gauge of
ISPT surface vector similarity (i.e. the LASSO score) of your
lead-data-base compounds to known actives--- but the metrics
discussed above give you an indication of ranges likely to correspond
to 'actives' as well as 'error/uncertainty' metrics.
Q-3: Are
there any ways I may validate my current
trained LASSO-Neural Net to gauge reliability
in extracting novel ligands from my databases that have an 'enriched'
probability of being active? Since the LASSO-score is gauging
'similarity' of the LASSO-descriptor vectors
A: The recovery of
‘known’ actives seeded in your test-database with high scores instills
confidence in the NN to retrieve ‘actives’ a significant part of the
time---which is what we hope for in screening for the sake of
enrichment. So if you edit your database you will be screening and
insert some known actives (with some 'labels' e.g. ACTIVE-1,ACTIVE-2
etc) you can see how these 'actives' are 'perceived' as regards
LASSO-scores.
Note: these actives that you seed into
your database---should not be one of those actives that you used for
training the LASSO neural net—it must be a different set in order for
its extraction with high score to be a confidence measure.
One does not expect all of the ACTIVES you seeded
in the test-database
to be in the
top-ranking. If for example you had 25-50% of the known ‘actives’ you
seeded into your test database get a large LASSO score this would still
be a useful NET with which to filter ligands in compound databases to
discover new leads/scaffold hop.
Say you have 20
known actives. Use � of them in the training and use the other � to
test the model.
Q-4: When
I seed my ‘unknown’ database with
‘actives’, as a test of the LASSO-training and filtering---some of my
‘known actives’
I seed in the database are given some moderately high scores e.g.
>0.7 and then some have LASSO scores <0.1. Does this mean
the training is insufficient?
A: No this is OK We had, in
fact, a concrete example during our working session---the output of
the test-problem provided gives ‘scores’ known DHFR_ACTIVES in
two ranges (I have color coded these with BLUE being the ‘actives’
found at high LASSO scores and RED---seeded actives found at low
LASSO scores):
4; 0.90912; DHFR_ACTIVE_03814902
1487;
0.83384; ZINC00580098
13;
0.83384;
DHFR_ACTIVE_03814911
12;
0.83384;
DHFR_ACTIVE_03814910
11;
0.83384;
DHFR_ACTIVE_03814909
10;
0.83384;
DHFR_ACTIVE_03814908
14;
0.83372;
DHFR_ACTIVE_03814912
5;
0.82617;
DHFR_ACTIVE_03814903
0;
0.82509;
DHFR_ACTIVE_03814896
...
18;
0.03714; DHFR_ACTIVE_03814916
16;
0.03714; DHFR_ACTIVE_03814914
889;
0.01856; ZINC00333604
264;
0.01463; ZINC00059762
849;
0.01359; ZINC00298998
1375;
0.01254; ZINC00536826
Q-5:
When
I seed my ‘unknown’ database with ‘actives’, as a test of the
LASSO-training and filtering---even my ‘known actives’ I seed are
given small LASSO scores (i.e. close to zero). Does this mean that
the LASSO-training hasn’t adequately picked up “facets” of the
Active ligands are important for ‘binding’ and activation of my
protein/receptor target?
A: YES---getting all low scores for known active
compounds that you seeded into your
test database does raise flags (i.e. that is a problem). Let's discuss
why and the solution to this problem.
LASSO
training uses only half of the given "training" set for
training (let's call this set A1), the other half (let's call that
A2) is used for internal testing (validation) Therefore, those
molecules that actually participated in training (A1) are SUPPOSED to
get high scores. If they do not then indeed there is a problem. If
the other half (A2) gets low scores, i.e. lower than most of the
decoys, that is also not good. It indicates that A1 and A2 do not
have enough similarity that the NN could recognize. Similarly, if you
test an A3 active set against another D3 decoy set and do not get
good ranking, that also indicates that there wasn't enough similarity
between A1 and A3 for the NN to work.
Of
course, with any selection you may end up having some actives that
are very different from the ones in the training set and therefore
will get a low score, i.e. missed by LASSO. The only defense against
such cases is to include in the training set representatives of all
kinds of actives (based on ISPT vector similarity).
If
one has enough actives and wants to select a very good training set,
it is not always the best idea to throw them all at the NN for
training. Instead, one should select a fairly diverse, representative
set. How can that be done ? We have recently written a little awk
script (attached- called rmsd-min.awk) to compute
the RMSD between ISPT descriptor vectors. You pass 2 files
containing ASCII descriptors, first one for actives and the second
for decoys, it will compute the minimum and average RMSD of each decoy
from all
actives. Repeated use of this script can help you select a diverse
active set:
-
take a random molecule, put its count file (*.desc) into file
a1.count, put all other actives into d1.count
- run: awk -f rmsd_min.awk a1.count d1.count |
sort
-nk 25 | tail -1
>a2.count
- cat a2.count >>a1.count
- Repeat steps 2 and 3 six more times, now you
got
an a1.count file
with the 8 most diverse actives.
If
you train with them and a well selected decoy set (that does not
contain accidental actives!) then you will likely get a very good net
file that is capable of differentiating actives from decoys and
scaffold hopping too.
Q-6:
Given the LASSO scores where some of the seeded actives receive high
LASSO scores and some at lower values, should I prioritize for
synthesis/testing those near the higher-scoring 'known actives'?
A: YES. Ignore those that have low scores beside the low scoring known
actives. Neighborhood in the rank order does not have any similarity meaning.
For example in the LASSO test case printed above those actives ( e.g.
DHFR_ACTIVE_03814907 and DHFR_ACTIVE_03814905 ) were NOT
similar to the training set actives. The ZINC entries near them in
the list were also NOT similar to the training actives, but that does
not mean they are similar to each other. Let's use a metaphor to
explain it better: Say we are looking for cities that are close to
New York. We get a low score for Los Angeles (because it is far) and
we also get a low score for London, because that is also far from New
York. Does that mean LA and London are close to each other ? Of
course, not. And the cities are layed out in a 2D space (distance
measured on the surface of a sphere not in 3D), while the descriptor
space is 23 dimensional, so same distance there has even less chance
to be close to each other.
|