Autism spectrum disorder detection from semi-structured and unstructured medical data
© The Author(s) 2017
Received: 25 May 2016
Accepted: 6 January 2017
Published: 1 February 2017
Autism spectrum disorder (ASD) is a developmental disorder that significantly impairs patients’ ability to perform normal social interaction and communication. Moreover, the diagnosis procedure of ASD is highly time-consuming, labor-intensive, and requires extensive expertise. Although there exists no known cure for ASD, there is consensus among clinicians regarding the importance of early intervention for the recovery of ASD patients. Therefore, to benefit autism patients by enhancing their access to treatments such as early intervention, we aim to develop a robust machine learning-based system for autism detection by using Natural Language Processing techniques based on information extracted from medical forms of potential ASD patients. Our detecting framework involves converting semi-structured and unstructured medical forms into digital format, preprocessing, learning document representation, and finally, classification. Testing results are evaluated against the ground truth set by expert clinicians and the proposed system achieve a 83.4% accuracy and 91.1% recall, which is very promising. The proposed ASD detection framework could significantly simplify and shorten the procedure of ASD diagnosis.
KeywordsAutism spectrum disorder Distributed representation Medical forms Classification
Autism spectrum disorder (ASD) is a general classification for a broad range of disorders with a variety of issues stemming from complications with neurological development. Symptoms of ASD are of varied severity involving difficulties with verbal and nonverbal communication, repetitive behaviors, and typical social interaction. The defining features of ASD are deficits in reciprocal social communication and frequent or intense repetitive or restrictive behaviors . ASD has a prenatal or early childhood onset and a chronic course. Although previously considered rare, ASD is now estimated to occur in approximately 1 in 68 individuals, a threefold increase reported in prevalence in 10 years . No conclusions have been reached on whether the rising prevalence estimates reflect an actual increase in prevalence or are just an artifact of changes in screening and diagnosis. Studies also show that ASD is about four times more common among boys than girls.
Currently, no laboratory test for ASD exists, and the process of diagnosing the disorder is highly complex and labor-intensive, requiring extensive expertise. Clinicians diagnose ASD based on a variety of factors including a review of medical records, medical and neurological examinations, standardized developmental tests, and behavioral assessments, such as the Autism Diagnosis Observation Schedule . Because of the resources and skill required to assemble and integrate this information, few centers offer ASD diagnostic evaluations, and these centers have lengthy waiting lists, ranging from 2–12 months for an initial appointment.Waiting is not only stressful for children with ASD and their families, but it delays their access to early intervention services, which have been shown to improve outcomes dramatically in many cases . Furthermore, symptoms of ASD is very similar and can be easily confused with other mental illnesses whose treatment procedures are very different such as depression. To simplify the diagnostic process and shorten the waiting time, a computerized method for detecting ASD that requires little or no expert supervision would be a major advance over current practice.
We propose a robust machine learning approach to tackle a challenging problem that involves mining from semi-structured and unstructured medical data in hand-written format.
We convert semi-structured and unstructured medical forms into de-identified text contents in a ready-to-use format and same converting procedures can be used to extract information from confidential hand-written forms in large scales.
We apply different word embedding models including the state-of-the-art distributed representations and establish a promising baseline for automated ASD detection on such a dataset.
2 Related work
We are in an era of exploring data of all domains such as multimedia data from social networks, forms and videos from biomedical domain, and taking advantage of such data to benefit human lives. For example, researchers have used social multimedia data to monitor human’s mental health condition or emotion status. Other researchers have successfully recognized human sentiments based on recorded voice . Yuan et al.  researched on analyzing users’ sentiment changes over time based on massive social multimedia data including texts and images from Twitter, and found strong correlation on sentiments expressed in textual contents and visual contents. Zhou et al.  integrated unobtrusive multimodal sensing such as head movement, breathing rate, and heart rate for mental health monitoring.
Much research has focused on medical applications and has involved machine-learning techniques. Compared with traditional biomedical diagnostic procedures which are usually time-consuming, labor-intensive, and limited to a small scope, new adoption of machine learning techniques into practical medical applications has advantages in terms of efficiency, scalability and reliability. For example, Devarakonda and Tsou developed a machine learning framework to automatically generate an open-end medical problem list for a patient using lexical and medical features extracted from a patient’s Electronic Medical Records . Hernandez et al.  explored the feasibility of monitoring user’s physiological signals using Google glass and showed promising results. For diagnosing ASD, the most relevant data are observations of the child’s social communication and repetitive behavior. To obtain these data, we focus our research only on previously acquired records of potential patients, as these records contain comments about children’s behavior. The most similar work to ours is from , who analyzed digital early intervention records to detect ASD based on bigram and unigram features. Another research perspective on automated ASD assessment is to extract patterns from deficits in semantic and pragmatic expression [11, 12].
Another family of related work is on learning representations of texts, which embed words or documents into vector space of real numbers in a relatively low dimensional space such as . Lexical features include Bag-of-Words (BoW), n-grams (typically bigram and trigram), and term frequency-inverse document frequency (tf-idf). Topic models such as Latent Dirichlet Allocation (LDA) are also used as features in document classification problems and researches show that topic model outperforms lexical features in some cases such as sentiment analysis [14, 15]. Recent word embedding algorithms are driven by the development of deep learning techniques. Distributed representations are obtained from a recurrent neural net language model [16, 17] by exploring the skip-gram model with subsampling of the frequent words and achieved a significant speedup and obtained more accurate representations of less frequent words.
3 ASD detection framework
The biggest challenges in applying machine learning algorithms to medical studies are limited data scale, data labeling, and domain knowledge. Patients’ and non-patients’ data are more difficult to obtain compared with social media data due to the fact that fewer public biomedical data resources exist. For example, one video for medical use would require hours of recording and the participation of a doctor who has special expertise in such an area. These data are also kept strictly confidential unless patients expressly authorize release. In contrast, we can easily crawl thousands of tweets from Twitter about a certain topic in one hour. Additionally, data labeling and result evaluation would be another issue after data collection. Though crowd-sourcing techniques such as Amazon Mechanical Turk have been widely used for labeling in machine learning and computer vision tasks, they are not feasible for our case because we cannot reveal personal information to the crowd. We depend instead on reviews by expert clinicians for data labeling.
3.1 Data collection
We have collected semi-structured and unstructured medical forms of children who have been referred for an evaluation of possible ASD. We first scan all the medical forms into digital format (tif) and go through preprocessing. In the next step, we incorporate the OCR software for recognizing text contents from the scanned documents. Hand-writing recognition is a well-established problem and we have experimented with different resources including Omnipage Capture SDK , Captricity , and ABBYY , which to the best of our knowledge are among the best tools in the market for recognizing hand-written letters and have been widely used in recognizing and transforming documents into usable digital forms . Even so, the results are not satisfactory in some cases. We then inspect and manually correct data for all the medical documents that have been processed through OCR process in this case, which makes data collection much more time-consuming and labor-intensive.
In this study, we have digitized forms for 199 patients, with 56 children diagnosed as actual ASD patients (positive samples) and 143 non-ASD patients (negative samples). The medical forms we analyzed include: referral form from primary care physician, parent and teacher questionnaire, preschool and early intervention questionnaire, and additional forms including phone intake by social workers. All the forms for each potential patient are concatenated together and treated as one document for the classification. Ground truth labels of patients (ASD or not ASD) are obtained from clinical reports.
3.2 Data preprocessing
A new problem arises by document scanning since sometimes the scanned forms are skewed. Additionally, the scanned forms contain personal information such as names, phone numbers, address and so on. Therefore, we go through preprocessing procedures including de-skewing, and de-identification. Such preprocessing procedure is necessary because OCR SDKs such as Captricity do not have embedded de-skewing option and their process involves recognizing documents slice by slice horizontally. Our preprocessing process improves the generated results significantly in most cases. By applying preprocessing, OCR process and manual correction afterwards, we are able to reduce the time of data collecting and converting by about 80%.
4 Learning document representation and performing document classification
Learning good representations of documents to capture the semantics behind text contents is central to a wide range of NLP tasks such as sentiment analysis, and document classification as in our case.
4.1 Lexical features
Lexical features are widely used in NLP tasks including Bag-of-Words model, n-gram model and tf-idf. These features capture the occurrences of words or phrases and usually contribute to high dimensional feature space of ten thousands depending on the dataset.
Bag-of-Words and N-Gram Model: Bag-of-Words (BoW) model is a common way to represent documents in matrix form. A sentence or a document is represented as a vector of which the number of entities as the dictionary and each entity indicates the occurrence of that word in the input sentence or document. However, BoW model captures neither the ordering nor the semantic meanings of words. N-gram model is similar to BoW model with an extension from a bag of single words to a bag of, typically two-words or three-words phrases, know as bigram and trigram. N-gram model preserves ordering of the words and captures a better sense of semantics than BoW model.
4.2 Latent Dirichlet Allocation (LDA)
Assuming that each document is a mixture of latent topics, LDA is a probabilistic model which learns P(t|w), the probability of word w belongs to a certain latent topic t in a topic set T (usually with a pre-defined number of topics) . By normalizing each word vector from a sentence or a document based on the probabilities of word-topic, we obtain the sentence or document vector for the topic distribution and thus embed the target document into a vector based on LDA model. Compared with lexical features mentioned above, the document representation learned by LDA model indicates the distribution of topics given the input word or document which is in a lower dimension and focusing more on the latent semantic meaning of the input texts.
4.3 Distributed Representation (Doc2vec)
Upsampling: Since our dataset is imbalanced in that we have more negative samples, we upsample the positive samples before the training process. Our experimental results show an improvement over results without upsampling which will be discussed later in Section 5. For each pair of positive samples, we compute their Euclidean distance, and then find the nearest positive neighbours for each positive sample. Artificial positive samples are generated randomly between each positive sample and its nearest positive neighbours.
5 Results and discussion
In this section, we demonstrate preprocessing results and evaluate our proposed ASD detection framework.
5.1 Preprocessing Results
5.2 Classification results
Number of extracted lexical feature
Number of Features
Classification results without upsampling
All Lexical Features
Classification results with upsampling
All Lexical Features
As the results show that our proposed framework was toned towards a better performance on recall while maintaining a decent precision and accuracy because we don’t want to miss out any potential ASD patients. Comparisons between lexical features shows that the combination of all lexical features yields the best performance. Both BoW and tf-idf features perform similarly and n-gram features alone is very close to the combination of all three. On the other hand, features extracted using the LDA model show some improvements in precision and recall over all combinations of lexical features, but are neither significant nor as good as doc2vec features. The reason could be due to the fact that LDA emphasis on modeling topics from documents, and the use of LDA has no guarantee to generate robust document representations . According to our experiment results, distributed representations provide the best classification results, and the best performance is obtained when the number of dimensions is 150. Additionally, as it is demonstrated in Fig. 4, more dimensions for LDA and doc2vec gain little improvements on the performance, if any. For the LDA model we expect there are not too many topics extracted from the data considering our data scale, and larger number of topics will render the effectiveness of the LDA model and add noise in the learned document vector. The reason for the doc2vec model is because the doc2vec model is expected to learn a decent semantic embedding of the documents within low dimensions, and in our case adding more dimensions will increase the risk of overfitting considering the scale of our dataset.
Top 10 selected features with the largest weights
Vocalizes vowel sounds
The reported prevalence of ASD has risen sharply over the past 25 years and the diagnosis of ASD is highly time-consuming and labor-intensive. Our proposed ASD detection algorithm has demonstrated high promise for detecting ASD based on the patients’ medical forms. Our method could significantly shorten the waiting time of the ASD diagnosis procedure and benefit the patients by facilitating potential early intervention services which have been proven to be very useful in many cases. Although the main focus of this paper is on ASD detection, the proposed NLP based framework can be potentially extended to other types of health related issues such as depression, anxiety, etc. For future work, we are working on computerized generation of an index for ASD patients indicating the severity of the patients based on their medical data, so that it can be used to monitor their progress over time. Furthermore, changes in the index could potentially serve as an outcome measure in trials of different therapies.
We are grateful for the funding from the New York State through the Goergen Institute for Data Science, and the University Multidisciplinary Research Award.
JY contributed to the main algorithm design, experiments and results analysis. CH worked mainly on data collection, data cleaning and data correction. TS contributed to data collection, establishing ground truth and providing multi-discipline view from the biomedical perspective. JL instructed the algorithm, experiments and analysis from a data-driven machine learning perspective. All authors have contributed to the writing of this paper. All authors read and approved the final manuscript.
The authors’ declare that they have no competing interests.
Consent for publication
We only publish de-identified information. We have IRB approval on record.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- American Psychiatric Association, Committee on Nomenclature and StatisticsDiagnostic and Statistical Manual of Mental Disorders, Third Edition (American Psychiatric Association, Washington, DC, 1980).Google Scholar
- M Wingate, RS Kirby, S Pettygrove, C Cunniff, E Schulz, T Ghosh, C Robinson, L-C Lee, R Landa, J Constantino, Prevalence of autism spectrum disorder among children aged 8 years-autism and developmental disabilities monitoring network, 11 sites, united states, 2010. Surveill. Summ. 63(SS02), 1–21 (2014).Google Scholar
- CP Johnson, SM Myers, Identification and evaluation of children with autism spectrum disorders. Pediatrics. 120(5), 1183–1215 (2007).View ArticleGoogle Scholar
- B Reichow, Overview of meta-analyses on early intensive behavioral intervention for young children with autism spectrum disorders. J. Autism Dev. Disord.42(4), 512–520 (2012).View ArticleGoogle Scholar
- N Yang, R Muraleedharan, J Kohl, I Demirkol, W Heinzelman, M Sturge-Apple, in Proceedings of the 4th IEEE Workshop on Spoken Language Technology. Speech-based emotion classification using multiclass svm with hybrid kernel and thresholding fusion (IEEEMiami, 2012), pp. 455–460.Google Scholar
- J Yuan, Q You, J Luo, in Multimedia Data Mining and Analytics. Sentiment analysis using social multimedia (Springer, 2015), pp. 31–59.Google Scholar
- D Zhou, J Luo, V Silenzio, Y Zhou, J Hu, G Currier, H Kautz, in Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence. Tackling mental health by integrating unobtrusive multimodal sensing (Austin, 2015), pp. 1401–1408.Google Scholar
- M Devarakonda, C-H Tsou, in Proceedings of the 27th Conference on Innovative Applications of Artificial Intelligence (AAI). Automated problem list generation from electronic medical records in ibm watson (Austin, 2015), pp. 3942–3947.Google Scholar
- J Hernandez, Y Li, JM Rehg, RW Picard, in Wireless Mobile Communication and Healthcare (Mobihealth), 2014 EAI 4th International Conference On. Bioglass: Physiological parameter estimation using a head-mounted wearable device (IEEEAthens, 2014), pp. 55–58.Google Scholar
- M Liu, Y An, X Hu, D Langer, C Newschaffer, L Shea, in Bioinformatics and Biomedicine (BIBM), 2013 IEEE International Conference On. An evaluation of identification of suspected autism spectrum disorder (asd) cases in early intervention (ei) records (IEEEShanghai, 2013), pp. 566–571.View ArticleGoogle Scholar
- M Rouhizadeh, E Prud’hommeaux, B Roark, J Van Santen, in Proceedings of the Conference. Association for Computational Linguistics. North American Chapter. Meeting, 2013. Distributional semantic models for the evaluation of disordered language (NIH Public AccessAtlanta, 2013), p. 709.Google Scholar
- E Prud’hommeaux, E Morley, M Rouhizadeh, L Silverman, J van Santeny, B Roarkz, R Sproatz, S Kauper, R DeLaHunta, in Spoken Language Technology Workshop (SLT), 2014 IEEE. Computational analysis of trajectories of linguistic development in autism (IEEESouth Lake Tahoe, 2014), pp. 266–271.View ArticleGoogle Scholar
- J Turian, L Ratinov, Y Bengio, in Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Word representations: a simple and general method for semi-supervised learning (Association for Computational LinguisticsUppsala, 2010), pp. 384–394.Google Scholar
- DM Blei, AY Ng, MI Jordan, Latent dirichlet allocation. J. Mach. Learn. Res. 3:, 993–1022 (2003).MATHGoogle Scholar
- AL Maas, RE Daly, PT Pham, D Huang, AY Ng, C Potts, in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Learning word vectors for sentiment analysis (Association for Computational LinguisticsPortland, 2011), pp. 142–150.Google Scholar
- T Mikolov, I Sutskever, K Chen, GS Corrado, J Dean, in Advances in Neural Information Processing Systems 26. Distributed representations of words and phrases and their compositionality (Neural Information Processing Systems (NPIS) Conference, 2013), pp. 3111–3119.Google Scholar
- QV Le, T Mikolov, in Proceedings of International Conference on Machine Learning (ICML), 14. Distributed representations of sentences and documents (Beijing, 2014), pp. 1188–1196.Google Scholar
- OmniPage Capture SDK.http://www.nuance.com/for-business/by-product/omnipage/csdk/index.htm. Accessed March 2015.
- Captricity, Unprecedented Data Accessed at Your Service. https://captricity.com/. Accessed March 2015.
- ABBYY Recognition Server. http://www.abbyy.com/recognition-server/. Accessed March 2015.
- J Shin, in Big Data (Big Data), 2014 IEEE International Conference On. Investigating the accuracy of the openfda api using the fda adverse event reporting system (faers), (2014), pp. 48–53. doi:http://dx.doi.org/10.1109/BigData.2014.7004412.
- R Rehurek, P Sojka, in Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. Software framework for topic modeling with large corpora, (2010).Google Scholar
- R-E Fan, K-W Chang, C-J Hsieh, X-R Wang, C-J Lin, Liblinear: A library for large linear classification. J. Mach. Learn. Res. 9:, 1871–1874 (2008).MATHGoogle Scholar