CompanyProductsScienceSupportWhatsnew
[Product Releases]
Index
[Blog]

Most recent post

[News]

Virtual Screening by Flexible Docking on a PlayStation 3
Apr, 2008

EPA's ToxCastTM project will use SimBioSys' eHiTS as docking engine
Nov, 2007

Researchers from Merck release retrospective screening analysis for 11 targets, study includes eHiTS
Jun, 2007

[Events]

CHI's SBDD
June 25-27, 2008
Boston, MA, USA
booth,
SimBioSys RoundTable

236th ACS
Aug 17-21, 2008
Philadelphia, PA, USA
booth #817
see >> more

Index

LASSO:
Ligand Activity in Surface Similarity Order

Frequently Asked Questions


  1. What does the ‘rank’/LASSO-score output mean in a LASSO run?
  2. Are there any ‘confidence’ measures as to the adequacy of training? Are there any ‘metrics’ that provide measurements of ‘standard error’ or ‘reliability of the neural network training’? Since the LASSO score is giving me a 'relative' similarity of compounds in my chemical databases to my known actives
  3. Are there any ways I may validate my current trained LASSO-Neural Net to gauge reliability in extracting novel ligands from my databases that have an 'enriched' probability of being active? Since the LASSO-score is gauging 'similarity' of the LASSO-descriptor vectors
  4. When I seed my ‘unknown’ database with ‘actives’, as a test of the LASSO-training and filtering---some of my ‘known actives’ I seed in the database are given some moderately high scores e.g. >0.7 and then some have LASSO scores <0.1. Does this mean the training is insufficient?
  5. When I seed my ‘unknown’ database with ‘actives’, as a test of the LASSO-training and filtering---even my ‘known actives’ I seed are given small LASSO scores (i.e. close to zero). Does this mean that the LASSO-training hasn’t adequately picked up “facets” of the Active ligands are important for ‘binding’ and activation of my protein/receptor target?
  6. Given the LASSO scores where some of the seeded actives receive high LASSO scores and some at lower values, should I prioritize for synthesis/testing those near the higher-scoring 'known actives'?

Frequently Asked Questions

Q-1: What does the ‘rank’/LASSO-score output mean in a LASSO run?

A: As indicated by the diagram below the LASSO-SCORE gives an appraisal of the ‘similarity of those features of the ligand’ to those determined in the neural network ‘training’ to be those in common in the “active” ligands in the training.

LASSO first determines for each ligand the nature of the surface point types, corresponding to 23-distinct chemical categories, forms a ‘vector’ of the numbers of each of the types of features in each ligand.

Each of the 23-distinct inputs from each Ligand are input into the 5-input nodes of the feed-forward hidden layer of the Neural network, each with a corresponding ‘weight’ that gets trained during LASSO-training, the output of the hidden layer is a single output, which gets ‘normalized’, so as to provide an output between [0,1] during the ‘filter’ segment portion of a LASSO run. The larger the LASSO-score of each of the ligands scored from a test-data-base the more similar their ‘feature vectors’ to those of the ‘actives’ (or dissimilar to the decoys) used in training.

Q-2: Are there any ‘confidence’ measures as to the adequacy of training? Are there any ‘metrics’ that provide measurements of ‘standard error’ or ‘reliability of the neural network training’? Since the LASSO score is giving me a 'relative' similarity of compounds in my chemical databases to my known actives

A:  There are two:

  1. The MSE and SSE values output to the screen during the LASSO Neural Net training step gives you an indication of how the ‘active’ range of the LASSO score should be viewed. The SSE and MSE values are indications of how good the training was and that is one of the reason they are printed on the screen.
  2. SSE is the "sum of squared errors" and MSE is SSE per pattern (i.e., "mean of squared errors"). So in the case when we have a vector of surface point counts of an active molecule (i.e., the expected output is 1 so if as a part of training the neural net we only get 0.9 from the output of neural net then this adds (0.1)^2 to SSE.

So let’s say take an example to make this concrete: An MSE value of say 0.05 means that on average the squared error for each pattern is 0.05. In other words decoys will be assigned a value around 0.25 on average and actives a values of 0.75 (in an active/decoy training scenario). Note this provides you with an additional guideline to how to view the LASSO-score relevant to the 'particular' neural net training you will then use for screening. This is important because:

Note that the LASSO score range corresponding to 'probable' actives will differ from with different trained neural nets! So if you 'retrain' each month, such that the identity of the training set actives and decoys change, then the precise LASSO score assigned to a known active compound can change because the LASSO-NN (LASSO neural net) assigned 'similarity' is different due to the change in the 'learned' neural net perceived 'similarity..

Bottom Line: The LASSO score is a relative (and not an absolute) gauge of ISPT surface vector similarity (i.e. the LASSO score) of your lead-data-base compounds to known actives--- but the metrics discussed above give you an indication of ranges likely to correspond to 'actives' as well as 'error/uncertainty' metrics.

Q-3: Are there any ways I may validate my current trained LASSO-Neural Net to gauge reliability in extracting novel ligands from my databases that have an 'enriched' probability of being active? Since the LASSO-score is gauging 'similarity' of the LASSO-descriptor vectors

A: The recovery of ‘known’ actives seeded in your test-database with high scores instills confidence in the NN to retrieve ‘actives’ a significant part of the time---which is what we hope for in screening for the sake of enrichment. So if you edit your database you will be screening and insert some known actives (with some 'labels' e.g. ACTIVE-1,ACTIVE-2 etc) you can see how these 'actives' are 'perceived' as regards LASSO-scores.

Note: these actives that you seed into your database---should not be one of those actives that you used for training the LASSO neural net—it must be a different set in order for its extraction with high score to be a confidence measure.

One does not expect all of the ACTIVES you seeded in the test-database to be in the top-ranking. If for example you had 25-50% of the known ‘actives’ you seeded into your test database get a large LASSO score this would still be a useful NET with which to filter ligands in compound databases to discover new leads/scaffold hop.

Say you have 20 known actives. Use � of them in the training and use the other � to test the model.

Q-4: When I seed my ‘unknown’ database with ‘actives’, as a test of the LASSO-training and filtering---some of my ‘known actives’ I seed in the database are given some moderately high scores e.g. >0.7 and then some have LASSO scores <0.1. Does this mean the training is insufficient?

A: No this is OK We had, in fact, a concrete example during our working session---the output of the test-problem provided gives ‘scores’ known DHFR_ACTIVES in two ranges (I have color coded these with BLUE being the ‘actives’ found at high LASSO scores and RED---seeded actives found at low LASSO scores):

4; 0.90912; DHFR_ACTIVE_03814902
1487; 0.83384; ZINC00580098
13; 0.83384; DHFR_ACTIVE_03814911
12; 0.83384; DHFR_ACTIVE_03814910
11; 0.83384; DHFR_ACTIVE_03814909
10; 0.83384; DHFR_ACTIVE_03814908
14; 0.83372; DHFR_ACTIVE_03814912
5; 0.82617; DHFR_ACTIVE_03814903
0; 0.82509; DHFR_ACTIVE_03814896
...
18; 0.03714; DHFR_ACTIVE_03814916
16; 0.03714; DHFR_ACTIVE_03814914
889; 0.01856; ZINC00333604
264; 0.01463; ZINC00059762
849; 0.01359; ZINC00298998
1375; 0.01254; ZINC00536826

Q-5:  When I seed my ‘unknown’ database with ‘actives’, as a test of the LASSO-training and filtering---even my ‘known actives’ I seed are given small LASSO scores (i.e. close to zero). Does this mean that the LASSO-training hasn’t adequately picked up “facets” of the Active ligands are important for ‘binding’ and activation of my protein/receptor target?

A: YES---getting all low scores for known active compounds that you seeded into your test database does raise flags (i.e. that is a problem). Let's discuss why and the solution to this problem.

LASSO training uses only half of the given "training" set for training (let's call this set A1), the other half (let's call that A2) is used for internal testing (validation) Therefore, those molecules that actually participated in training (A1) are SUPPOSED to get high scores. If they do not then indeed there is a problem. If the other half (A2) gets low scores, i.e. lower than most of the decoys, that is also not good. It indicates that A1 and A2 do not have enough similarity that the NN could recognize. Similarly, if you test an A3 active set against another D3 decoy set and do not get good ranking, that also indicates that there wasn't enough similarity between A1 and A3 for the NN to work.

Of course, with any selection you may end up having some actives that are very different from the ones in the training set and therefore will get a low score, i.e. missed by LASSO. The only defense against such cases is to include in the training set representatives of all kinds of actives (based on ISPT vector similarity).

If one has enough actives and wants to select a very good training set, it is not always the best idea to throw them all at the NN for training. Instead, one should select a fairly diverse, representative set. How can that be done ? We have recently written a little awk script (attached- called rmsd-min.awk) to compute the RMSD between ISPT descriptor vectors. You pass 2 files containing ASCII descriptors, first one for actives and the second for decoys, it will compute the minimum and average RMSD of each decoy from all actives. Repeated use of this script can help you select a diverse active set:

  1. take a random molecule, put its count file (*.desc) into file a1.count, put all other actives into d1.count
  2. run: awk -f rmsd_min.awk a1.count d1.count | sort -nk 25 | tail -1 >a2.count
  3. cat a2.count >>a1.count
  4. Repeat steps 2 and 3 six more times, now you got an a1.count file with the 8 most diverse actives.

If you train with them and a well selected decoy set (that does not contain accidental actives!) then you will likely get a very good net file that is capable of differentiating actives from decoys and scaffold hopping too.

Q-6:  Given the LASSO scores where some of the seeded actives receive high LASSO scores and some at lower values, should I prioritize for synthesis/testing those near the higher-scoring 'known actives'?

A: YES. Ignore those that have low scores beside the low scoring known actives. Neighborhood in the rank order does not have any similarity meaning. For example in the LASSO test case printed above those actives ( e.g. DHFR_ACTIVE_03814907 and  DHFR_ACTIVE_03814905 ) were NOT similar to the training set actives. The ZINC entries near them in the list were also NOT similar to the training actives, but that does not mean they are similar to each other. Let's use a metaphor to explain it better: Say we are looking for cities that are close to New York. We get a low score for Los Angeles (because it is far) and we also get a low score for London, because that is also far from New York. Does that mean LA and London are close to each other ? Of course, not. And the cities are layed out in a 2D space (distance measured on the surface of a sphere not in 3D), while the descriptor space is 23 dimensional, so same distance there has even less chance to be close to each other.



[LASSO Links]

Copyright � 2008 SimBioSys Inc., All rights reserved.