CompanyProductsScienceSupportWhatsnew
[Product Releases]
Index
[Blog]

Most recent post

[News]

Can we trust docking results?
Sept 2010

IBM Systems and Technology Group releases a white paper with eHiTS and Cell
Oct 2008

EPA's ToxCastTM project will use SimBioSys' eHiTS as docking engine
Nov, 2007

[Events]

243rd ACS
Mar 25-29, 2012
San Diego, CA
see >> more

Index

 

CLiDE:
Chemical Literature Data Extraction

CLiDE Standard CLIDE Professional CLiDE Batch

List of Abstracts

  1. Chemical Structure Recognition and Generic Text Interpretation in the CLiDE project
  2. P. Ibison, F. Kam, R.W. Simpson, C. Tonnelier, T. Venczel and A.P. Johnson

    Proceedings on Online Information 92, 1992, London, England

    Abstract: Chemical information, especially that concerning chemical reactions, is becoming increasingly available in a variety of computer-readable databases. However, the creation of these databases is a time-consuming and expensive process. CLiDE (Chemical Literature Data Extraction) is a new software project to help solve the problem of building substance and reaction databases. CLiDE uses a combination of imaging and artificial intelligence techniques to recognize a range of chemical diagrams and extract the information they contain. The steps necessary to transform a chemical structure drawing into a computer-readable output are detailed. The interpretation of the generic structures is discussed.
     

  3. Chemical Literature Data Extraction. Bond Crossing in Single and Multiple Structures
  4. F. Kam, R.W. Simpson, C. Tonnelier, T. Venczel and A.P. Johnson

    Proceedings of the 1992 Chemical Information Conference, 1992, Annecy, France

    Abstract: The procedure to convert a scanned image of a page of chemical structure diagrams (with accompanying text) into a set of connection tables is one of the primary aims of the CLiDE project. These connection table can be used in a variety of computer-based applications such as building and maintaining databases. The image is decomposed into component graphics and text which are further analysed to find the lines, wedges, and chemical text strings. In an interpretation phase the connection tables for the molecules are built from these items. The correct interpretation of chemical bonding in the image is often hampered by the constraints of representing a three-dimensional molecule in two dimensions where one bond may be drawn over another. A method of identifying and successfully dealing with these situations is described. A related situation where a bond is drawn crossing a ring implying an undetermined point of attachment is also solved. Examples are presented to illustrate these situations and the rules implemented to handle these structures within the CLiDE program discussed.
     
     

  5. Chemical Literature Data Extraction: The CLiDE Project
  6. P. Ibison, M. Jacquot, F. Kam, A. G. Neville, R.W. Simpson, C. Tonnelier, T. Venczel and A.P. Johnson

    Journal of Chemical Information Computer Science, vol. 33, no. 3, pp: 338-344, 1993

    Abstract: Chemical information, especially that concerning chemical reactions, is becoming increasingly available in a variety of computer-readable databases. However, the creation of these databases is a time- consuming and expensive process. CLiDE (Chemical Literature Data Extraction) is a new software project to help solve the problem of building substance and reaction databases. CLiDE uses a combination of imaging and artificial intelligence techniques to recognize a range of chemical diagrams and extract the information they contain. The steps necessary to transform a chemical structure drawing into a computer-readable output are detailed. Several examples are given to illustrate the scope of the current work.
     

  7. (Chem)DeTeX Automatic Generation of a Markup Language Description of (Chemical) Documents from Bitmap Images
  8. Aniko Simon, Jean-Christope Pret and A. Peter Johnson

    Proc. of the Third International Conference on Document Analysis and Recognition (ICDAR'95)
    vol. I, pp: 458-462, 1995, Montreal, Canada

    Abstract: This paper presents a novel view of document processing, as being the reverse process to TeX. This concept simplifies the analysis of the physical structure of documents, and also suggests the use of a style file for layout recognition. An algorithm is given for both phases, layout analysis and layout recognition. The bottom-up layout analysis method employed is based on the Kruskal's algorithm and uses the distances between the components to construct the physical page structure. The algorithm is linear with respect to the number of the connected components. For layout recognition, a document style description language (DSDL) is introduced. This helps a fault-tolerant, recursive parsing algorithm to label the blocks of the document. The presented methods were designed to be used for scientific publications (papers, reports, books), but could be applied to a broader range of documents.
     

  9. Recent Advances in the CLiDE Project: Logical Layout Analysis of Chemical Documents
  10. Aniko Simon and A. Peter Johnson

    Journal of Chemical Information Computer Science, vol. 37, no. 1, pp: 109-116, 1997

    Abstract: The CLiDE system for chemistry document image processing consists of three major steps: physical layout analysis, recognition of the primitives, and logical layout analysis. This paper presents the new methods for logical layout analysis: role assignment to the elements of the document with a use of a style description language. The results are illustrated by application to generic reaction interpretation.
     

  11. A Fast Algorithm for Bottom-Up Document Layout Analysis
  12. Aniko Simon, Jean-Christope Pret  and A. Peter Johnson

    IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 3, pp: 273-277, 1997

    Abstract: This paper describes a new bottom-up method for document layout analysis. The algorithm was implemented in the CLiDE (Chemical Literature Data Extraction) system, but the method described here is suitable for broader range of documents. It is based on Kruskal's algorithm and uses a special distance-metric between the components to construct the physical page structure. The method has all the major advantages of the bottom-up systems:  independence from different text spacing and independence from different block alignments. The algorithms computational complexity is rediced to linear by using heuristics and path-compression.

  13. CLiDE Pro: the latest generation of CLiDE, a tool for optical chemical structure recognition.
  14. Aniko T. Valko and A. Peter Johnson

    J Chem Inf Model. 2009 Apr;49(4):780-787

    Abstract: We present CLiDE Pro, the latest version of the output of the long-term CLiDE project for the development of tools for automatic extraction of chemical information from the literature. CLiDE Pro is concerned with the extraction of chemical structure and generic structure information from electronic images of chemical molecules available online as well as pages of scanned chemical documents. The information is extracted in three phases, first the image is segmented into text and graphical regions, then graphical regions are analyzed and where possible the connection tables are reconstructed, and finally any generic structures are interpreted by matching R-groups found in structure diagrams with the ones located in the text. The program has been tested on a large set of images of chemical structures originating from various sources. The results demonstrate good performance in the reconstruction of connection tables with few errors in the interpretation of the individual drawing features found in the structure diagrams. This full test set is presented for use in the validation of other similar systems.



[CLiDE Links]

Copyright © 2011 SimBioSys Inc., All rights reserved.