publications | Yonsei DataLab

2026

Prototype learning with structural-semantic alignment for interpretable molecular relational learning

Knowledge-Based Systems, Mar 2026

Abs HTML

Molecular Relational Learning (MRL) is a critical technology driving the intelligent development of molecular science research, with the objective of capturing rich molecular representations. The chemical prior knowledge of functional group types in molecular images has attracted significant attention in molecular representation. However, the conformational sensitivity of molecular images leads to inconsistencies between their structural and visual semantic representations, which undermines the interpretability of MRL. To address this limitation, we propose Prototype Learning with Structural-Semantic Alignment (PSSA) to improve the interpretability of MRL while maintaining predictive performance. PSSA achieves this improvement by integrating Chemically-Structured-Aware Molecular Representation Alignment (ChemAlign) and Representation Clustering-based Prototype Learning (RCProto). In ChemAlign, the chemical structure-masked attention mechanism is designed to fuse molecular image patch-level information with substructure topological representations via a masking matrix. This approach consistently represents molecular structural information and semantic features. Subsequently, we introduce a dynamic weighting strategy to align cross-modal representations between molecular images and graph-based structural topologies. Building on molecular representations, we adopt unsupervised clustering to derive prototypes of molecular representations. Moreover, RCProto merges and fine-tunes prototypes with the similarity between molecular embeddings and prototypes, better matching the prototypes to the molecular representations and improving the interpretability of MRL. The experimental results on two types of MRL tasks in nine datasets demonstrate that PSSA outperforms the competitive baselines and provides satisfactory interpretability for MRL.
The effect of gender diversity on scientific team impact: A team roles perspective

Journal of Informetrics, Mar 2026

Abs HTML

The influence of gender diversity on the success of scientific teams is of great interest to academia. However, prior findings remain inconsistent, and most studies operationalize diversity in aggregate terms, overlooking internal role differentiation. This limitation obscures a more nuanced understanding of how gender diversity shapes team impact. In particular, the effect of gender diversity across different team roles remains poorly understood. To this end, we define a scientific team as all coauthors of a paper and measure team impact through five-year citation counts. Using author contribution statements, we classified members into leadership and support roles. Drawing on more than 130,000 papers from PLOS journals, most of which are in biomedical-related disciplines, we employed multivariable regression to examine the association between gender diversity in these roles and team impact. Furthermore, we apply a threshold regression model to investigate how team size moderates this relationship. The results show that (1) the relationship between gender diversity and team impact follows an inverted U-shape for both leadership and support groups; (2) teams with an all-female leadership group and an all-male support group achieve higher impact than other team types. Interestingly, (3) the effect of leadership-group gender diversity is significantly negative for small teams but becomes positive and statistically insignificant in large teams. In contrast, the estimates for support-group gender diversity remain significant and positive, regardless of team size.
Predicting and explaining recurrent child abuse using interpretable machine learning: Evidence from national-level child abuse data in South Korea (2017–2020)

Social Science & Medicine, Jan 2026

Abs HTML

Objective: This study aimed to develop machine learning models to predict the risk of each type of recurrent child abuse using national-level child abuse data. Additionally, it sought to identify key factors based on abuse type and explain recurrence risk in individual cases.Methods: The analysis employed a broad range of factors related to child, perpetrator, the initial episode of child abuse, and service, covering 51,517 abused children, 46,497 perpetrators, and 64,774 reported cases. Distinct predictive models were developed for each type of recurrent abuse: recurrent physical abuse (RPA), recurrent emotional abuse (REA), recurrent sexual abuse (RSA), and recurrent neglect (RN). Results: The highest AUC-ROC score was 0.793 for RN, followed by 0.749, 0.702, and 0.700 for RSA, REA, and RPA, respectively. Younger perpetrator age likely increased the risk of all types of recurrent child abuse. Planning or completing counseling for abused children and their perpetrators was an important factor in mitigating the risk of RPA, REA, and RSA. The initial episode of abuse being of the same type was the most influential factor for RPA, REA, and RN, while the perpetrator being man was the most significant factor for RSA. Additionally, factors associated with parenting and family environment—including inappropriate parenting attitudes, a lack of parenting knowledge and skills, and conflicts with family members—were crucial factors for RSA. Conclusion: These findings can assist child protection agencies and related organizations in facilitating the early prevention of child abuse recurrence and designing interventions tailored to each abuse type.
From Bureaucracy to Startups: How Workplace Bullying Is Talked About on Online Employee Communities

Cyberpsychology, Behavior, and Social Networking, Jan 2026

Abs HTML

Workplace bullying is increasingly recognized as a serious social and organizational problem with detrimental effects on employees’ well-being and productivity. While previous research has primarily focused on individual-level experiences or legal/policy frameworks, relatively little attention has been paid to how organizational characteristics—such as size and sector—shape the nature and discourse of workplace bullying. This study addresses this gap by analyzing user-generated content on Blind, a widely used anonymous platform for Korean employees. Using topic modeling techniques, we examine bullying-related discourse across four organizational types: public institutions, large companies, midsize companies, and small firms/startups. Our findings indicate that workplace bullying manifests differently across organizational types. In public institutions, it is characterized by hierarchical pressure and institutional rigidity; in large companies, by a sense of systemic helplessness and trauma following reporting; in midsize companies, by structurally embedded abuse rooted in informal power networks; and in small firms, by interpersonal conflict, gendered divisions of labor, and dissatisfaction with broader Korean workplace culture. These findings highlight the structural and cultural dimensions of workplace bullying and demonstrate how anonymous digital platforms can amplify marginalized employee voices. The study offers a more differentiated understanding of organizational variations in bullying discourse and underscores the need for tailored interventions based on workplace context.

2025

Multi-Category Fusion Contrastive Learning with Core Data Selection for Robust RGB Image-based Dental Caries Classification

Information Fusion, Dec 2025

Abs HTML

Dental caries represents one of the most prevalent diseases affecting humankind, particularly among adolescent populations. RGB images offer a convenient and cost-effective method for dental caries detection. However, the image data captured may suffer from blurriness, which, together with label errors introduced during manual annotations, can degrade the performance of the model learned for dental caries detection. To address this problem, we propose the Multi-Category Fusion Contrastive Learning with Core Data Selection (M3C) to improve the predictive performance of dental caries classification models. Instead of fine-tuning the backbone network structure, M3C focuses on improving the robustness of model to label errors from a novel perspective by identifying core data that is highly relevant to the dental caries category. We analyzed and validated that M3C has better robustness in dental caries detection from model architecture representation, theoretical analysis, and mutual information computation. Specifically, M3C quantifies the average mutual information between dental caries images and dental caries category centers based on Jensen-Shannon Divergence (JSD), which is then used for selecting the core data to mitigate the impact of label errors on model performance. Furthermore, we design inter-category contrastive learning to enhance the performance of the model in distinguishing the categories of dental caries by improving the feature representation for samples of different categories. With theoretical justification, we jointly optimized model training using prediction loss and confusion contrastive loss. Extensive experiments demonstrate that M3C significantly surpasses comparative data selection methods in dental caries detection on dental caries RGB image datasets. More excitingly, M3C achieves superior predictive performance using only 50% of the core data compared to state-of-the-art dental caries detection methods using the entire dataset.
Examining the Changes in Bullying Discourse on Reddit: A Comparative Analysis Before and After the COVID-19 Pandemic

Cyberpsychology, Behavior, and Social Networking, May 2025

Abs HTML

Bullying, a type of power abuse, deserves to be addressed, and this study examines bullying-related discussions on Reddit before and after the pandemic to better understand its dynamics during this time. We analyzed 8,720 posts and 21,607 comments from the r/Bullying subreddit using static and dynamic topic modeling (DTM) to understand the major topics discussed in the subreddit. Based on static topic modeling, we discovered that before the COVID-19 pandemic, the topics surrounding bullying focused on bullying in the school context, cyberbullying, and help-seeking, but changed to bullying against minority groups, workplace bullying, relationships and communication, and coping strategies. The long-term impact of bullying has emerged for both periods, implying that more efforts to prevent and combat bullying are needed to reduce the negative impacts throughout an individual’s lifecycle. We also discovered that the proportion of cyberbullying/antibullying, negative emotions, and self-esteem increased following the COVID-19 pandemic, according to DTM. Our findings suggest that following the pandemic, victims and places of bullying became more distinct. In addition to the widely studied and disseminated bullying research and policies concerning children at school, more assistance is needed to prevent and assist bullying victims who are racial and religious minorities in various settings.
Comparing social media engagement between women with suicidal ideation and those who have attempted suicide

Death Studies, Apr 2025

Abs HTML

This study examined differences in social media engagement between women with suicidal ideation only (SI-only) and those who had attempted suicide (SA). We used content, statistical, and time series analyses on 3,510 tweets from 41 women in South Korea who had ideated and/or attempted suicide. Most tweets focused on everyday life and interests (38.0%, 1,355 tweets), while others expressed mental health distress and challenges (0.6%, 22), school or work-related stress (3.4%, 120), social relationship stress (1.5%, 52), and explicit suicidal statements (0.6%, 22). SI-only users posted the most on Saturdays, while SA users peaked on Sundays. Closer to the time of suicide, SA users increasingly posted explicit suicidal content, whereas SI-only users expressed more negative emotions. Our findings could help identify individuals at risk of suicide on social media, distinguishing between SI-only and SA users to inform better interventions.

2024

Understanding Double Stigma Toward Minority Groups on Social Media in the COVID-19 Pandemic: Findings from South Korea

Journal of Asian Sociology, Sep 2024

Abs HTML

This study aims to compare social perceptions of four minority groups in South Korea categorized by religion, sexual orientation, and occupation. Using data collected from Naver News, a prominent Korean website, between February and June 2020, dynamic topic modeling was conducted on over 200,000 data points. The findings revealed that stigma-related topics such as labeling, negative stereotypes, separation, and status loss emerged in discussions about religious and sexual minority groups, subjecting them to the double stigma linked to COVID-19 and AIDS. In contrast, non-stigma-related topics, such as sympathy, criticism of government actions, and COVID-19 prevention, appeared in discussions about occupational minority groups. Over time, blame toward religious minorities increased, while sympathy towards occupational minorities increased. This study suggests that interventions on social media platforms can enhance the awareness of double stigma, contributing to its reduction.
Testing a Predictive Model to Identify the Risks of Online Sexual Victimization Among Korean Female Adolescents Using Machine Learning Algorithms

Jungtae Choi, Yongjun Zhu, and Kihyun Kim

Victims & Offenders, Sep 2024

Abs HTML

Few studies have yet to explore what factors most likely contribute to OSV among female adolescents when all possible levels of factors are included in one model. Using machine learning algorithms, we investigate which factors are relatively more important predictors of OSV. We conducted and collected surveys and crawled data from social media (Twitter and Instagram) in 2020, and 472 female adolescents participated in the study (mean = 16.7 years old). Information about demographic characteristics, online behaviors and experiences, offline victimization, and psychological characteristics was collected. We employ several machine learning algorithms as an exploratory analysis to identify the top ten most important predictors of OSV among 51 variables. Results show that offline victimization (offline sexual victimization and ACEs), online behaviors and experiences (negative experiences on social media, talking with someone met online, disclosure of personal information, online social support, and number of negative comments), and psychological factors (social assurance and social connectedness) are found to be important predictors of OSV. These findings suggest that using machine learning algorithms to identify the most important predictors of OSV will provide an opportunity to understand the complex phenomenon of OSV among female adolescents.
Do more heads imply better performance? An empirical study of team thought leaders’ impact on scientific team performance

Yi Zhao, Yuzhuo Wang, Heng Zhang, Donghun Kim, Chao Lu, Yongjun Zhu, and Chengzhi Zhang

Information Processing & Management, Apr 2024

Abs HTML

Thought leadership plays a crucial role in boosting team performance; thus, teams with more thought leaders may perform better. However, the impact of the number of thought leaders on team performance in a scientific context remains understudied. In this study, we consider the authors of a publication as a scientific team and define the authors responsible for conceptual tasks, i.e., “conceived and designed the experiments” (one of the tasks described in the PLOS contribution statements classification system), as thought leaders. Leveraging more than 140,000 papers from PLOS journals, we examine the relationship between the number of thought leaders and two aspects of team performance (i.e., team impact and team disruptiveness) from both correlational and causal perspectives. The results showed that (1) an inverted U-shaped relationship exists between the number of thought leaders and the team’s impact, and (2) teams with more thought leaders tend to produce less disruptive ideas. We also explored the impact of international collaboration, team size, and gender diversity together with the number of thought leaders on team performance and found that (3) international collaboration improves team impact but lowers the disruptiveness of team outputs. This study advances scholarly understanding of thought leadership in scientific teams and provides valuable insights for policymakers and team managers.
Stalking Discourse on Reddit: A Comparative Analysis of Pre- and Post-COVID-19 Pandemic Using Big Data

Sou Hyun Jang, Donghun Kim, Yongjun Zhu, and Kim Chunrye

Cyberpsychology, Behavior, and Social Networking, Jun 2024

Abs HTML

Stalking, a widespread and distressing phenomenon, has recently garnered considerable attention. The advent of digital platforms has revolutionized the landscape of stalking, presenting new avenues and challenges for research. However, the impact of the coronavirus disease (COVID)-19 pandemic on stalking remains underexplored, despite extensive studies on similar crimes such as intimate partner violence and domestic violence. To address this gap, our study focused on Reddit, a prominent online platform with a diverse user base and open discussion. Through an analysis of posts from the subreddit (https://www.reddit.com/r/Stalking/), we sought to compare the discourse on stalking before and after the COVID-19 pandemic. We found notable shifts in stalking-related posts before and after the COVID-19 pandemic, particularly with the emergence of new topics centered on cyberstalking. We also observed that the experiences of stalking victims have significantly changed following the COVID-19 pandemic. Based on our findings, we discussed the implications for policies to help stalking victims.
Understanding and comparing risk factors and subtypes in South Korean adult and adolescent women’s suicidal ideation or suicide attempt using survey and social media data

Donghun Kim, Ting Jiang, Ji Hyun Baek, Sou Hyun Jang, and Yongjun Zhu

Digital Health, Apr 2024

Abs HTML

Objective: This study aimed to investigate the similarities and differences in risk factors for suicide among adult and adolescent women in South Korea and identify subtypes of suicidal ideation or suicide attempt in each group. Methods: Multifaceted data were collected and analyzed by linking survey and social media data. Interpretable machine learning models were constructed to predict suicide risk and major risk factors were extracted by investigating their feature importance. Additionally, subtypes of suicidal adult and adolescent women were identified and explained using risk factors. Results: The risk factors for adult women were primarily related to mental disorders, while those for adolescent women were primarily related to interpersonal experiences and needs. Two subtypes of suicidal adult women were one with high psychiatric symptoms and mental disorders of them and/or their families and the other with excessive social media use and high online victimization. Two subtypes of suicidal adolescent women were one with high psychiatric symptoms, high ACEs, and high social connectedness, and the other with frequent social media use, high online sexual victimization, and high social assurance. Conclusions: These findings enable a stratified and targeted understanding of suicide in women and help develop customized suicide prevention plans in South Korea.
Rhetorical structure parallels research topic in LIS articles: a temporal bibliometrics examination

Wen Lou, Jiangen He, Qianqian Xu, Zhijie Zhu, Qiwen Lu, and Yongjun Zhu

Library Hi Tech, Apr 2024

Abs HTML

Purpose: The effectiveness of rhetorical structure is essential to communicate key messages in research articles (RAs). The interdisciplinary nature of library and information science (LIS) has led to unclear patterns and practice of using rhetorical structures. Understanding how RAs are constructed in LIS to facilitate effective scholarly communication is important. Numerous studies investigated the rhetorical structure of RAs in a range of disciplines, but LIS articles have not been well studied. Design/methodology/approach: In this study, the authors encoded rhetorical structures to 2,216 articles in the Journal of the Association for Information Science and Technology covering a period from 2001 to 2018 with the approaches of co-word analysis and visualization. The results show that the predominant rhetorical structures used by LIS researchers follow the sequence of Introduction-Literature Review-Methodology-Result-Discussion-Conclusion (ILMRDC). Findings: The authors’ temporal examination reveals the shifts of evolutionary pattern of rhetorical structure in 2008 and 2014. More importantly, the authors’ study demonstrates that rhetorical structures have varied greatly across research areas in LIS community. For example, scholarly communication and scientometrics studies tend to exclude literature review in articles. Originality/value: The present paper offers a first systematic examination of how rhetorical structures are used in a representative sample of a LIS journal, especially from a temporal perspective.
Dependency, reciprocity, and informal mentorship in predicting long-term research collaboration: A co-authorship matrix-based multivariate time series analysis

Yongjun Zhu, Donghun Kimg, Ting Jiang, Yi Zhao, Jiangen He, Xinyi Chen, and Wen Lou

Journal of Informetrics, Feb 2024

Abs HTML

In this study, we examine the roles of dependency, reciprocity, and informal mentorship in the prediction of long-term research collaboration in five disciplines. We use co-authorship matrix-based multivariate time series features and interpretable machine learning to train long-term collaboration prediction models and interpret the feature importance of trained models. Overall, long-term research collaboration that is defined using various standards was rare across the examined disciplines, and the prediction results were moderate to good. We found dependency, reciprocity, and informal mentorship to have different roles in different disciplines. Among the three, informal mentorship was important in predicting long-term research collaboration in Agriculture, Geology, and Library and Information Science. Reciprocity, which measures the interdependence between two researchers was important to prediction in the fields of Agriculture and Geology. Finally, dependency was important in all the disciplines with varying degrees of importance.
NCH-DDA: Neighborhood contrastive learning heterogeneous network for drug–disease association predictions

Peiliang Zhang, Chao Che, Bo Jin, Jingling Yuan, Ruixin Li, and Yongjun Zhu

Expert Systems with Applications, Mar 2024

Abs HTML

Exploring new therapeutic diseases for existing drugs plays an essential role in reducing drug development costs. However, existing methods for predicting drug–disease association (DDA) lack fusion to multi-neighborhood information, which limits their ability to generalize and forces them to rely on prior knowledge. To this end, we propose a novel DDA model called the Neighborhood Contrastive Learning Heterogeneous Networks (NCH-DDA). NCH-DDA uses both single-neighborhood and multi-neighborhood feature extraction modules to extract important features of drugs and diseases in parallel from multiple potential spaces, such as heterogeneous networks and similarity networks. NCH-DDA fuses single-neighborhood and multi-neighborhood features using contrastive learning to enhance information interaction in different neighborhood spaces, ultimately obtaining universal domain features of drugs and diseases. NCH-DDA uses a combination of predictive loss and triplet loss to reduce dependence on prior knowledge. In different partition schemes of multiple datasets, NCH-DDA achieved the best performance in predicting DDA, outperforming several current state-of-the-art methods. Moreover, NCH-DDA demonstrated better performance in experiments on data sparsity and drug repositioning for Alzheimer’s disease, indicating its greater potential in DDA prediction with sparse omics data and drug repositioning applications.

2023

Support behind the scenes: the relationship between acknowledgement, coauthor, and citation in Nobel articles

Wen Lou, Jiangen He, Lingxin Zhang, Zhijie Zhu, and Yongjun Zhu

Scientometrics, Aug 2023

Abs HTML

Acknowledging individuals in research articles is known to be a personal and private expression of appreciation compared to other types of acknowledgment, such as financial support. Early studies have demonstrated the significant relationship between acknowledgement, coauthor, and citation. Little did we know to what extent of these relationships and which prompt what to some degree among them. We adopt a series of multivariate analyses, Bayes’ theorem, statistical analysis, and “before and after” matched-group studies to illustrate the acknowledgement patterns in 6323 research articles of 196 Nobel Prize laureates (NPL) from 2008 to 2018. Acknowledgment is consistently proved to significantly relate to co-authorship and citation where co-authorship and citing have an approximately 10% increasing effect on acknowledgement behavior. Our study is the first to state the order of such triangle: acknowledgement is significantly ahead of co-authorship and arguably occurs before citing behavior. Moreover, acknowledgement strengthens more than half of NPL on their co-authorship for 11% and citation for 72% after they acknowledge others. We verify the substantive possibility of co-authorship and citing behavior from acknowledgement and introduce a formation of a new norm of scholarly communication. This will greatly contribute to the matter of evaluation metrics and social network detection.
CariesFG: A fine-grained RGB image classification framework with attention mechanism for dental caries

Hao Jiang, Peiliang Zhang, Chao Che, Bo Jin, and Yongjun Zhu

Engineering Applications of Artificial Intelligence, Aug 2023

Abs HTML

Dental caries is one of the most prevalent oral diseases, and deep learning methods have been used for caries diagnosis in large populations by leveraging RGB images. The existing attention-based fine-grained image classification methods have the problem of underutilization of features and easy interference by background and irrelevant information. To tackle these issues, we propose a fine-grained RGB image classification framework with attention mechanism for dental caries (CariesFG). Specifically, it consists of 4 components: (1) Multi-Spectral channel Attention Module (MSAM), which can retain the useful frequency components in the feature map. (2) Position Attention Module (PAM), which captures feature dependencies in the spatial dimension. (3) Discriminative Point Selection strategy (DPS), which can find the most discriminative feature points. (4) Graph Convolution and Aggregation module (GCA), which aims to aggregate discriminative feature points at different scales of feature maps. To enhance the ability to extract the discriminative features, PAM and MSAM are integrated into the backbone network to consist of feature extraction networks incorporating attention mechanism. Discriminative feature points at different scales of feature maps are extracted by DPS and aggregated as global discriminative features by GCA. By testing on a caries fine-grained classification dataset, CariesFG achieved an accuracy of 68.36%, an f1-score of 66.77% and a specificity of 84.17%, respectively, significantly outperforming state-of-the-art methods. Moreover, visualization results on attention parts show that CariesFG can effectively learn discriminative features and discriminative parts.
IEA-GNN: Anchor-aware graph neural network fused with information entropy for node classification and link prediction

Peiliang Zhang, Jiatao Chen, Chao Che, Liang Zhang, Jin Bo, and Yongjun Zhu

Information Sciences, Jul 2023

Abs HTML

Graph neural networks are essential in mining complex relationships in graphs. However, most methods ignore the global location information of nodes and the discrepancy between symmetrically located nodes, resulting in the inability to distinguish between nodes with homogeneous network neighborhoods. We propose an Anchor-aware Graph Neural Network fused with Information Entropy (IEA-GNN) to capture the global location information of nodes in the graph. IEA-GNN first calculates the information entropy of nodes and constructs candidate sets of anchors. We define the calculation method of the distance from any node to the anchor points and incorporate the relative distance information between nodes at initialization. The nonlinear distance-weighted aggregation learning strategy based on the anchor points of candidate sets is used to obtain the nodes’ feature information, which can be captured more effectively by fusing the global location information to the node representation with the selected anchor points. Selecting anchor points based on information entropy avoid the aggregation of anchor points in the graph, highlighting the positional differences between nodes and making it easier to distinguish homogeneous neighborhood nodes. Experimental results of node classification and link prediction on five datasets show that IEA-GNN outperforms the baseline model.
Interpretable machine learning-based approaches for understanding suicide risk and protective factors among South Korean females using survey and social media data

Donghun Kim, Lihong Quan, Mihye Seo, Kihyun Kim, Jae-Won Kim, and Yongjun Zhu

Suicide and Life-Threatening Behavior, Jun 2023

Abs HTML

Objective: We aimed to identify and understand risk and protective factors for suicide among South Korean females by linking survey and social media data and using interpretable machine learning approaches. Materials and Methods: We collected a wide range of potential factors including the material, psychosocial, and behavioral data from a detailed survey, which we then linked to data from social media. In addition, we adopted interpretable machine learning approaches to (1) predict the suicide risk, (2) explain the relative importance of factors and their interactions regarding suicide, and (3) understand individual differences affecting suicide risk. Results: The best-performing machine learning model achieved an AUC of 0.737. Adverse childhood experiences, social connectedness, and mean positive sentiment score of social media posts were the three risk factors that had a monotonic or unimodal relationship with suicide, and satisfaction with life, narcissistic self-presentation, and number of close friends on social media were the three protective factors that had a monotonic or unimodal relationship with suicide. We also found several meaningful interactions between specific psychiatric symptoms and narcissistic self-presentation. Conclusions: Our findings can help governmental organizations to better assess female suicide risk in South Korea and develop more informed and customized suicide prevention strategies.
An Exploratory Study of Medical Journal’s Twitter Use: Metadata, Networks, and Content Analyses

Donghun Kim, Woojin Jung, Ting Jiang, and Yongjun Zhu

Journal of Medical Internet Research, Jan 2023

Abs HTML

Background: An increasing number of medical journals are using social media to promote themselves and communicate with their readers. However, little is known about how medical journals use Twitter and what their social media management strategies are. Objective: This study aimed to understand how medical journals use Twitter from a global standpoint. We conducted a broad, in-depth analysis of all the available Twitter accounts of medical journals indexed by major indexing services, with a particular focus on their social networks and content. Methods: The Twitter profiles and metadata of medical journals were analyzed along with the social networks on their Twitter accounts. Results: The results showed that overall, publishers used different strategies regarding Twitter adoption, Twitter use patterns, and their subsequent decisions. The following specific findings were noted: journals with Twitter accounts had a significantly higher number of publications and a greater impact than their counterparts; subscription journals had a slightly higher Twitter adoption rate (2%) than open access journals; journals with higher impact had more followers; and prestigious journals rarely followed other lesser-known journals on social media. In addition, an in-depth analysis of 2000 randomly selected tweets from 4 prestigious journals revealed that The Lancet had dedicated considerable effort to communicating with people about health information and fulfilling its social responsibility by organizing committees and activities to engage with a broad range of health-related issues; The New England Journal of Medicine and the Journal of the American Medical Association focused on promoting research articles and attempting to maximize the visibility of their research articles; and the British Medical Journal provided copious amounts of health information and discussed various health-related social problems to increase social awareness of the field of medicine. Conclusions: Our study used various perspectives to investigate how medical journals use Twitter and explored the Twitter management strategies of 4 of the most prestigious journals. Our study provides a detailed understanding of medical journals’ use of Twitter from various perspectives and can help publishers, journals, and researchers to better use Twitter for their respective purposes.
Predicting coauthorship using bibliographic network embedding

Yongjun Zhu, Lihong Quan, Pei-Ying Chen, Meen Chul Kim, and Chao Che

Journal of the Association for Information Science & Technology, Apr 2023

Abs HTML

Coauthorship prediction applies predictive analytics to bibliographic data to predict authors who are highly likely to be coauthors. In this study, we propose an approach for coauthorship prediction based on bibliographic network embedding through a graph-based bibliographic data model that can be used to model common bibliographic data, including papers, terms, sources, authors, departments, research interests, universities, and countries. A real-world dataset released by AMiner that includes more than 2 million papers, 8 million citations, and 1.7 million authors were integrated into a large bibliographic network using the proposed bibliographic data model. Translation-based methods were applied to the entities and relationships to generate their low-dimensional embeddings while preserving their connectivity information in the original bibliographic network. We applied machine learning algorithms to embeddings that represent the coauthorship relationships of the two authors and achieved high prediction results. The reference model, which is the combination of a network embedding size of 100, the most basic translation-based method, and a gradient boosting method achieved an F1 score of 0.9 and even higher scores are obtainable with different embedding sizes and more advanced embedding methods. Thus, the strengths of the proposed approach lie in its customizable components under a unified framework.
Structured abstract summarization of scientific articles: Summarization using full-text section information

Hanseok Oh, Seojin Nam, and Yongjun Zhu

Journal of the Association for Information Science & Technology, Feb 2023

Abs HTML

The automatic summarization of scientific articles differs from other text genres because of the structured format and longer text length. Previous approaches have focused on tackling the lengthy nature of scientific articles, aiming to improve the computational efficiency of summarizing long text using a flat, unstructured abstract. However, the structured format of scientific articles and characteristics of each section have not been fully explored, despite their importance. The lack of a sufficient investigation and discussion of various characteristics for each section and their influence on summarization results has hindered the practical use of automatic summarization for scientific articles. To provide a balanced abstract proportionally emphasizing each section of a scientific article, the community introduced the structured abstract, an abstract with distinct, labeled sections. Using this information, in this study, we aim to understand tasks ranging from data preparation to model evaluation from diverse viewpoints. Specifically, we provide a preprocessed large-scale dataset and propose a summarization method applying the introduction, methods, results, and discussion (IMRaD) format reflecting the characteristics of each section. We also discuss the objective benchmarks and perspectives of state-of-the-art algorithms and present the challenges and research directions in this area.
Suicidality Detection on Social Media Using Metadata and Text Feature Extraction and Machine Learning

Woojin Jung, Donghun Kim, Seojin Nam, and Yongjun Zhu

Archives of Suicide Research, Jan 2023

Abs HTML

In this study, we implemented machine learning models that can detect suicidality posts on Twitter. We randomly selected and annotated 20,000 tweets and explored metadata and text features to build effective models. Metadata features were studied in great details to understand their possibility and importance in suicidality detection models. Results showed that posting type (i.e., reply or not) and time-related features such as the month, day of the week, and the time (AM vs. PM) were the most important metadata features in suicidality detection models. Specifically, the probability of a social media post being suicidal is higher if the post is a reply to other users rather than an original tweet. Moreover, tweets created in the afternoon, on Fridays and weekends, and in fall have higher probabilities of being detected as suicidality tweets compared with those created in other times. By integrating metadata and text features, we obtained a model of good performance (i.e., F1 score of 0.846) that can assist humans in the real-world setting to detect suicidality social media posts.

2022

Recommendations with residual connections and negative sampling based on knowledge graphs

Yuanyuan Liu, Zhaoqian Zhong, Chao Che, and Yongjun Zhu

Knowledge-Based Systems, Dec 2022

Abs HTML

A knowledge graph (KG) contains a large amount of well-structured external triple information that can effectively solve the problems of poor interpretability in collaborative filtering. Recently, recommendation system (RS) models relying on graph neural networks (GNNs) have been widely developed, but the increase of GNN layers inevitably leads to over-smoothing problems. Meanwhile, most of the current KG-based negative sampling strategies randomly collect negative samples from unobserved data to train RS models. However, these strategies are insufficient to generate negative samples reflecting genuine user demands. To overcome these obstacles, we design a model called knowledge graph residual negative sampling Recommendation (KGRNS), which utilizes residual connections and pooling operation to alleviate the over-smoothing problem, and generate high-quality negative samples by negative sampling. Specifically, we devise residual connections on each output layer of the GNN and then utilize sum pooling operation to mitigate the effects of the over-smoothing problem on the model. In addition, to generate high-quality negative samples, we create a gated strategy to mix the knowledge of both positive and negative samples to generate synthetic negative samples and then select the virtual negative sample that is closest to the positive ones through a theoretically backed hard negative sample select strategy. We conducted broad experiments on three datasets. The experimental results showed that KGRNS performed considerable enhancements over state-of-the-art methods. Ablation studies validated the effectiveness of each part of the KGRNS.
Bi-graph attention network for aspect category sentiment classification

Yongxue Shan, Chao Che, Xiaopeng Wei, Xiaodong Wang, Yongjun Zhu, and Bo Jin

Knowledge-Based Systems, Dec 2022

Abs HTML

Aspect category sentiment classification (ACSC) aims to determine the sentiment polarities of sentences under given aspect categories, which can be used to infer finer-grained information in text sequences. It is widely used in consumer services, healthcare, and elections. Most models ignore the interaction of global sequence context and syntactic structure information in sentences and fail to fully learn the rich relations between word nodes related to specific aspect categories. To tackle these problems, this paper introduces a bi-graph attention network (BiGAT) for ACSC, which constructs two graphs to describe the sequential context information and syntactic structure information in sentences. It utilizes the graph attention network to aggregate neighbor information from each node within a single graph and uses biaffine modules to coordinate heterogeneous information between the sequential- and syntactic-based graphs. The model uses the aspect-specific mask operation and retrieval-based attention mechanism to reduce the effect of noise created by useless information in sentences. Experimental results on the SemEval 2015, SemEval 2016, and MAMS datasets show that BiGAT outperforms other state-of-the-art ACSC models.
Linking suicide and social determinants of health in South Korea: An investigation of structural determinantsn

Yongjun Zhu, Seojin Nam, Lihong Quan, Jihyun Baek, Hongjin Jeon, and Buzhou Tang

Frontiers in Public Health, Oct 2022

Abs HTML

Introduction: Studies have shown that suicide is closely related to various social factors. However, due to the restriction in the data scale, our understanding of these social factors is still limited. We propose a conceptual framework for understanding social determinants of suicide at the national level and investigate the relationships between structural determinants (i.e., gender, employment statuses, and occupation) and suicide outcomes (i.e., types of suicide, places of suicide, suicide methods, and warning signs) in South Korea. Methods: We linked a national-level suicide registry from the Korea Psychological Autopsy Center with the Social Determinants of Health framework proposed by the World Health Organization’s Commission on Social Determinants of Health. Results: First, male and female suicide victims have clear differences in their typical suicide methods (fire vs. drug overdose), primary warning signs (verbal vs. mood), and places of death (suburb vs. home). Second, employees accounted for the largest proportion of murder-suicides (>30%). The proportion of students was much higher for joint suicides than for individual suicides and murder-suicides. Third, among individuals choosing pesticides as their suicide method, over 50% were primary workers. In terms of drug overdoses, professionals and laborers accounted for the largest percentage; the former also constituted the largest proportion in the method of jumping from heights. Conclusion: A clear connection exists between the investigated structural factors and various suicide outcomes, with gender, social class, and occupation all impacting suicide.
Understanding the Research Landscape of Deep Learning in Biomedical Science: Scientometric Analysis

Seojin Nam, Donghun Kim, Woojin Jung, and Yongjun Zhu

Journal of Medical Internet Research, Apr 2022

Abs HTML

Background: Advances in biomedical research using deep learning techniques have generated a large volume of related literature. However, there is a lack of scientometric studies that provide a bird’s-eye view of them. This absence has led to a partial and fragmented understanding of the field and its progress. Objective: This study aimed to gain a quantitative and qualitative understanding of the scientific domain by analyzing diverse bibliographic entities that represent the research landscape from multiple perspectives and levels of granularity. Methods: We searched and retrieved 978 deep learning studies in biomedicine from the PubMed database. A scientometric analysis was performed by analyzing the metadata, content of influential works, and cited references. Results: In the process, we identified the current leading fields, major research topics and techniques, knowledge diffusion, and research collaboration. There was a predominant focus on applying deep learning, especially convolutional neural networks, to radiology and medical imaging, whereas a few studies focused on protein or genome analysis. Radiology and medical imaging also appeared to be the most significant knowledge sources and an important field in knowledge diffusion, followed by computer science and electrical engineering. A coauthorship analysis revealed various collaborations among engineering-oriented and biomedicine-oriented clusters of disciplines. Conclusions: This study investigated the landscape of deep learning research in biomedicine and confirmed its interdisciplinary nature. Although it has been successful, we believe that there is a need for diverse applications in certain areas to further boost the contributions of deep learning in addressing biomedical research problems. We expect the results of this study to help researchers and communities better align their present and future work.
Understanding information behavior of South Korean Twitter users who express suicidality on Twitter

Donghun Kim, Woojin Jung, Seojin Nam, Hongjin Jeon, Jihyun Baek, and Yongjun Zhu

Digital Health, Mar 2022

Abs HTML

Objective: Although there were few studies on how suicidal users behave on Twitter, they only investigated partial aspects such as tweeting frequency and tweet length. Therefore, we aim to understand the various information behavior of suicidal users in South Korea. Methods: To achieve this goal, we annotated 20,000 tweets and identified 1097 tweets with the expression of suicidality (i.e. suicidal tweets) and 229 suicidal users (i.e. experimental group). Using the data, a user profile analysis, comparative analysis with control group, and tweets/hashtags analysis were performed. Results: Our results show that many suicidal users used suicide-related keywords in their user IDs, usernames, descriptions, and pinned tweets. We also found that, compared to the control group, the experimental group show different patterns of information behavior. The experimental group did not frequently use Twitter and, on average, wrote longer texts than the control group. A clear seasonal pattern was also identified in the experimental group’s tweeting behavior. Frequently used keywords/hashtags were extracted from tweets written by the experimental group for the purpose of understanding their concerns and detecting more suicidal tweets. Conclusions: We believe that our study will help in the understanding of suicidal users’ information behavior on social media and lay the basis for more accurate actions for suicide prevention and early intervention on social media.

2021

Gender imbalance in the productivity of funded projects: A study of the outputs of National Institutes of Health R01 grants

Chaojiang Wu, Erjia Yan, Yongjun Zhu, and Kai Li

Journal of the Association for Information Science & Technology, Nov 2021

Abs HTML

This study examines the relationship between team’s gender composition and outputs of funded projects using a large data set of National Institutes of Health (NIH) R01 grants and their associated publications between 1990 and 2017. This study finds that while the women investigators’ presence in NIH grants is generally low, higher women investigator presence is on average related to slightly lower number of publications. This study finds empirically that women investigators elect to work in fields in which fewer publications per million-dollar funding is the norm. For fields where women investigators are relatively well represented, they are as productive as men. The overall lower productivity of women investigators may be attributed to the low representation of women in high productivity fields dominated by men investigators. The findings shed light on possible reasons for gender disparity in grant productivity.
Mapping scientific profile and knowledge diffusion of Library Hi Tech

Meen Chul Kim, Yuanyuan Feng, and Yongjun Zhu

Library Hi Tech, Jun 2021

Abs HTML

Purpose: Library Hi Tech is one of the most influential journals that publish leading research in library and information science (LIS). The present study aims to understand the scholarly communication in Library Hi Tech by profiling its historic footprint, emerging trends and knowledge diffusion. Design/methodology/approach: A total of 3,131 bibliographic records between 1995 and 2018 were collected from the Web of Science. Text mining, graph analysis and data visualization were used to analyze subject category assignment, domain-level citation trends, co-occurrence of keywords, keyword bursts, networks of document co-citation and landmark articles. Findings: Findings indicated that published research in the journal was largely influenced by the psychology, education and social domain as a unidisciplinary discipline. Knowledge of the journal has been disseminated into multiple domains such as LIS, computer science and education. Dominant thematic concentrations were also identified: (1) library services in academic libraries and related to digital libraries, (2) adoption of new information technologies and (3) information-seeking behavior in these contexts. Additionally, the journal has exhibited an increased research emphasis on mixed-method user-centered studies and investigations into libraries’ use of new media. Originality/value: This study provides a promising approach to understand scientific trends and the intellectual growth of journals. It also helps Library Hi Tech to become more self-explanatory with a detailed bibliometric profile and to identify future directions in editorship and readership. Finally, researchers in the community can better position their studies within the emerging trends and current challenges of the journal.
Analyzing China’s research collaboration with the United States in high-impact and high-technology research

Yongjun Zhu, Donghun Kim, Erjia Yan, Meen Chul Kim, and Guanqiu Qi

Quantitative Science Studies, Apr 2021

Abs HTML

This study investigates China’s international research collaboration with the United States through a bibliometric analysis of coauthorship over time using historical research publication data. We investigate from three perspectives: overall, high-impact, and high-technology research collaborations using data from Web of Science (WoS), Nature Index, and Technology Alert List maintained by the U.S. Department of State. The results show that the United States is China’s largest research collaborator and that in all three aspects, China and the United States are each other’s primary collaborators much of the time. From China’s perspective, we have found weakening collaboration with the United States over the past 2 years. In terms of high-impact research collaboration, China has historically shared a higher percentage of its research with the United States than vice versa. In terms of high-technology research, the situation is reversed, with the United States sharing more. The percentage of the United States’ high-technology research shared with China has been continuously increasing over the past 10 years, while in China the percentage has been relatively stable.

2020

Analyzing academic mobility of U.S. professors based on ORCID data and the Carnegie Classification

Erjia Yan, Yongjun Zhu, and Jiangen He

Quantitative Science Studies, Dec 2020

Abs HTML

This paper uses two open science data sources—ORCID and the Carnegie Classification of Institutions of Higher Education (CCIHE)—to identify tenure-track and tenured professors in the United States who have changed academic affiliations. Through a series of data cleaning and processing actions, 5,938 professors met the selection criteria of professorship and mobility. Using ORCID professor profiles and the Carnegie Classification, this paper reveals patterns of academic mobility in the United States from the aspects of institution types, locations, regions, funding mechanisms of institutions, and professors’ genders. We find that professors tended to move to institutions with higher research intensity, such as those with an R1 or R2 designation in the Carnegie Classification. They also tend to move from rural institutions to urban institutions. Additionally, this paper finds that female professors are more likely to move within the same geographic region than male professors and that when they move from a less research-intensive institution to a more research-intensive one, female professors are less likely to retain their rank or attain promotion.
Knowledge-driven drug repurposing using a comprehensive drug knowledge graph

Yongjun Zhu, Chao Che, Bo Jin, Ningrui Zhang, Chang Su, and Fei Wang

Health Informatics Journal, Dec 2020

Abs HTML

Due to the huge costs associated with new drug discovery and development, drug repurposing has become an important complement to the traditional de novo approach. With the increasing number of public databases and the rapid development of analytical methodologies, computational approaches have gained great momentum in the field of drug repurposing. In this study, we introduce an approach to knowledge-driven drug repurposing based on a comprehensive drug knowledge graph. We design and develop a drug knowledge graph by systematically integrating multiple drug knowledge bases. We describe path- and embedding-based data representation methods of transforming information in the drug knowledge graph into valuable inputs to allow machine learning models to predict drug repurposing candidates. The evaluation demonstrates that the knowledge-driven approach can produce high predictive results for known diabetes mellitus treatments by only using treatment information on other diseases. In addition, this approach supports exploratory investigation through the review of meta paths that connect drugs with diseases. This knowledge-driven approach is an effective drug repurposing strategy supporting large-scale prediction and the investigation of case studies.
Drug repurposing against Parkinson’s disease by text mining the scientific literature

Yongjun Zhu, Woojin Jung, Fei Wang, and Chao Che

Library Hi Tech, Nov 2020

Abs HTML

Purpose: Drug repurposing involves the identification of new applications for existing drugs. Owing to the enormous rise in the costs of pharmaceutical R&D, several pharmaceutical companies are leveraging repurposing strategies. Parkinson’s disease is the second most common neurodegenerative disorder worldwide, affecting approximately 1–2 percent of the human population older than 65 years. This study proposes a literature-based drug repurposing strategy in Parkinson’s disease. Design/methodology/approach: The literature-based drug repurposing strategy proposed herein combined natural language processing, network science and machine learning methods for analyzing unstructured text data and producing actional knowledge for drug repurposing. The approach comprised multiple computational components, including the extraction of biomedical entities and their relationships, knowledge graph construction, knowledge representation learning and machine learning-based prediction. Findings: The proposed strategy was used to mine information pertaining to the mechanisms of disease treatment from known treatment relationships and predict drugs for repurposing against Parkinson’s disease. The F1 score of the best-performing method was 0.97, indicating the effectiveness of the proposed approach. The study also presents experimental results obtained by combining the different components of the strategy. Originality/value: The drug repurposing strategy proposed herein for Parkinson’s disease is distinct from those existing in the literature in that the drug repurposing pipeline includes components of natural language processing, knowledge representation and machine learning for analyzing the scientific literature. The results of the study provide important and valuable information to researchers studying different aspects of Parkinson’s disease.
Mapping scientific landscapes in UMLS research: a scientometric review

Meen Chul Kim, Seojin Nam, Fei Wang, and Yongjun Zhu

Journal of the American Medical Informatics Association, Oct 2020

Abs HTML

Objective: The Unified Medical Language System (UMLS) is 1 of the most successful, collaborative efforts of terminology resource development in biomedicine. The present study aims to 1) survey historical footprints, emerging technologies, and the existing challenges in the use of UMLS resources and tools, and 2) present potential future directions. Materials and Methods: We collected 10 469 bibliographic records published between 1986 and 2019, using a Web of Science database. graph analysis, data visualization, and text mining to analyze domain-level citations, subject categories, keyword co-occurrence and bursts, document co-citation networks, and landmark papers. Results: The findings show that the development of UMLS resources and tools have been led by interdisciplinary collaboration among medicine, biology, and computer science. Efforts encompassing multiple disciplines, such as medical informatics, biochemical sciences, and genetics, were the driving forces behind the domain’s growth. The following topics were found to be the dominant research themes from the early phases to mid-phases: 1) development and extension of ontologies and 2) enhancing the integrity and accessibility of these resources. Knowledge discovery using machine learning and natural language processing and applications in broader contexts such as drug safety surveillance have recently been receiving increasing attention. Discussion: Our analysis confirms that while reaching its scientific maturity, UMLS research aims to boundary-span to more variety in the biomedical context. We also made some recommendations for editorship and authorship in the domain. Conclusion: The present study provides a systematic approach to map the intellectual growth of science, as well as a self-explanatory bibliometric profile of the published UMLS literature. It also suggests potential future directions. Using the findings of this study, the scientific community can better align the studies within the emerging agenda and current challenges.
Nine million book items and eleven million citations: a study of book-based scholarly communication using OpenCitations

Yongjun Zhu, Erjia Yan, Silvio Peroni, and Chao Che

Scientometrics, Feb 2020

Abs HTML

Books have been widely used to share information and contribute to human knowledge. However, the quantitative use of books as a method of scholarly communication is relatively unexamined compared to journal articles and conference papers. This study uses the COCI dataset (a comprehensive open citation dataset provided by OpenCitations) to explore books’ roles in scholarly communication. The COCI data we analyzed includes 445,826,118 citations from 46,534,705 bibliographic entities. By analyzing such a large amount of data, we provide a thorough, multifaceted understanding of books. Among the investigated factors are (1) temporal changes to book citations; (2) book citation distributions; (3) years to citation peak; (4) citation half-life; and (5) characteristics of the most-cited books. Results show that books have received less than 4% of total citations, and have been cited mainly by journal articles. Moreover, 97.96% of books have been cited fewer than ten times. Books take longer than other bibliographic materials to reach peak citation levels, yet are cited for the same duration as journal articles. Most-cited books tend to cover general (yet essential) topics, theories, and technological concepts in mathematics and statistics.
Network embedding in biomedical data science

Chang Su, Jie Tong, Yongjun Zhu, Peng Cui, and Fei Wang

Briefings in Bioinformatics, Jan 2020

Abs HTML

Owning to the rapid development of computer technologies, an increasing number of relational data have been emerging in modern biomedical research. Many network-based learning methods have been proposed to perform analysis on such data, which provide people a deep understanding of topology and knowledge behind the biomedical networks and benefit a lot of applications for human healthcare. However, most network-based methods suffer from high computational and space cost. There remain challenges on handling high dimensionality and sparsity of the biomedical networks. The latest advances in network embedding technologies provide new effective paradigms to solve the network analysis problem. It converts network into a low-dimensional space while maximally preserves structural properties. In this way, downstream tasks such as link prediction and node classification can be done by traditional machine learning methods. In this survey, we conduct a comprehensive review of the literature on applying network embedding to advance the biomedical domain. We first briefly introduce the widely used network embedding models. After that, we carefully discuss how the network embedding approaches were performed on biomedical networks as well as how they accelerated the downstream tasks in biomedical science. Finally, we discuss challenges the existing network embedding applications in biomedical domains are faced with and suggest several promising future directions for a better improvement in human healthcare.

2019

Drug knowledge bases and their applications in biomedical informatics research

Yongjun Zhu, Olivier Elemento, Jyotishman Pathak, and Fei Wang

Briefings in Bioinformatics, Jul 2019

Abs HTML

Recent advances in biomedical research have generated a large volume of drug-related data. To effectively handle this flood of data, many initiatives have been taken to help researchers make good use of them. As the results of these initiatives, many drug knowledge bases have been constructed. They range from simple ones with specific focuses to comprehensive ones that contain information on almost every aspect of a drug. These curated drug knowledge bases have made significant contributions to the development of efficient and effective health information technologies for better health-care service delivery. Understanding and comparing existing drug knowledge bases and how they are applied in various biomedical studies will help us recognize the state of the art and design better knowledge bases in the future. In addition, researchers can get insights on novel applications of the drug knowledge bases through a review of successful use cases. In this study, we provide a review of existing popular drug knowledge bases and their applications in drug-related studies. We discuss challenges in constructing and using drug knowledge bases as well as future research directions toward a better ecosystem of drug knowledge bases.

2018

Association networks in a matched case-control design – Co-occurrence patterns of preexisting chronic medical conditions in patients with major depression versus their matched controls

Min-hyung Kim, Samprit Banerjee, Yize Zhao, Fei Wang, Yiye Zhang, Yongjun Zhu, Joseph DeFerio, Lauren Evans, Sang Min Park, and Jyotishman Pathak

Journal of Biomedical Informatics, Nov 2018

Abs HTML

Objective: We present a method for comparing association networks in a matched case-control design, which provides a high-level comparison of co-occurrence patterns of features after adjusting for confounding factors. We demonstrate this approach by examining the differential distribution of chronic medical conditions in patients with major depressive disorder (MDD) compared to the distribution of these conditions in their matched controls. Materials and methods: Newly diagnosed MDD patients were matched to controls based on their demographic characteristics, socioeconomic status, place of residence, and healthcare service utilization in the Korean National Health Insurance Service’s National Sample Cohort. Differences in the networks of chronic medical conditions in newly diagnosed MDD cases treated with antidepressants, and their matched controls, were prioritized with a permutation test accounting for the false discovery rate. Sensitivity analyses for the associations between prioritized pairs of chronic medical conditions and new MDD diagnosis were performed with regression modeling. Results: By comparing the association networks of chronic medical conditions in newly diagnosed depression patients and their matched controls, five pairs of such conditions were prioritized among 105 possible pairs after controlling the false discovery rate at 5%. In sensitivity analyses using regression modeling, four out of the five prioritized pairs were statistically significant for the interaction terms. Conclusion: Association networks in a matched case-control design can provide a high-level comparison of comorbid features after adjusting for confounding factors, thereby supplementing traditional clinical study approaches. We demonstrate the differential co-occurrence pattern of chronic medical conditions in patients with MDD and prioritize the chronic conditions that have statistically significant interactions in regression models for depression.
Joint modeling of the association between NIH funding and its three primary outcomes: patents, publications, and citation impact

Fengqing Zhang, Erjia Yan, Xin Niu, and Yongjun Zhu

Scientometrics, Jul 2018

Abs HTML

This paper examines the impact of NIH funding on research outcomes using data from 108,803 projects funded by NIH between January 2009 and March 2017. We extend the prior knowledge on this topic by incorporating the correlation structure of multiple research outcomes, as well as a comprehensive list of grant-level features capturing information on funding size, gender composition and funding type. Specifically, we utilize partial least squares regression (PLS) to jointly model all three primary outcomes (publications, patents and citation impact) and identify the effects of grant-level features on research outputs. Our results show that joint modeling of research outcomes via PLS yields a more accurate prediction than analyzing each outcome separately. Additionally, we find that when other grant-level features are held constant, a 2-year-longer project duration would produce a similar improvement in research outputs to that achieved by $1 million in additional funding. Based on this finding, we recommend no-cost extension of funded projects instead of increased funding support to achieve a comparable increase in research outputs. Promoting multi-organizational grants is found to be more effective for increasing patents, whereas encouraging multiple-PI grants is more productive in terms of publications and citation impact. Of the various NIH grant types, program project/center grants (P series) and research training grants (T series) are the two most productive and impactful. Results also suggest that projects with a higher proportion of male PIs tend to produce more research outputs. This finding, however, needs to be interpreted with caution due to the limitation of our data set.
Understanding the research landscape of major depressive disorder via literature mining: an entity-level analysis of PubMed data from 1948 to 2017

Yongjun Zhu, Min-Hyung Kim, Samprit Banerjee, Joseph Deferio, George S Alexopoulos, and Jyotishman Pathak

JAMIA Open, Apr 2018

Abs HTML

Objective: To analyze literature-based data from PubMed to identify diseases and medications that have frequently been studied with major depressive disorder (MDD). Materials and methods: Abstracts of 23 799 research articles about MDD that have been published since 1948 till 2017 were analyzed using data and text mining approaches. Methods such as information extraction, frequent pattern mining, regression, and burst detection were used to explore diseases and medications that have been associated with MDD. Results: In addition to many mental disorders and antidepressants, we identified several nonmental health diseases and nonpsychotropic medications that have frequently been studied with MDD. Our results suggest that: (1) MDD has been studied with disorders such as Pain, Diabetes Mellitus, Wounds and Injuries, Hypertension, and Cardiovascular Diseases; (2) medications such as Hydrocortisone, Dexamethasone, Ketamine, and Lithium have been studied in terms of their side effects and off-label uses; (3) the relationships between nonmental disorders and MDD have gained increased attention from the scientific community; and (4) the bursts of Diabetes Mellitus and Cardiovascular Diseases explain the psychiatric and/or depression screening recommended by authoritative associations during the periods of the bursts. Discussion and conclusion: This study summarized and presented an overview of the previous MDD research in terms of diseases and medications that are highly relevant to MDD. The reported results can potentially facilitate hypothesis generation for future studies. The approaches proposed in the study can be used to better understand the progress and advance of the field.
Tracking word semantic change in biomedical literature

Erjia Yan, and Yongjun Zhu

International Journal of Medical Informatics, Jan 2018

Abs HTML

Up to this point, research on written scholarly communication has focused primarily on syntactic, rather than semantic, analyses. Consequently, we have yet to understand semantic change as it applies to disciplinary discourse. The objective of this study is to illustrate word semantic change in biomedical literature. To that end, we identify a set of representative words in biomedical literature based on word frequency and word-topic probability distributions. A word2vec language model is then applied to the identified words in order to measure word- and topic-level semantic changes. We find that for the selected words in PubMed, overall, meanings are becoming more stable in the 2000s than they were in the 1980s and 1990s. At the topic level, the global distance of most topics (19 out of 20 tested) is declining, suggesting that the words used to discuss these topics are stabilizing semantically. Similarly, the local distance of most topics (19 out of 20) is also declining, showing that the meanings of words from these topics are becoming more consistent with those of their semantic neighbors. At the word level, this paper identifies two different trends in word semantics, as measured by the aforementioned distance metrics: on the one hand, words can form clusters with their semantic neighbors, and these words, as a cluster, coevolve semantically; on the other hand, words can drift apart from their semantic neighbors while nonetheless stabilizing in the global context. In relating our work to language laws on semantic change, we find no overwhelming evidence to support either the law of parallel change or the law of conformity.

2017

A natural language interface to a graph-based bibliographic information retrieval system

Yongjun Zhu, Erjia Yan, and Il-Yeol Song

Data & Knowledge Engineering, Sep 2017

Abs HTML

With the ever-increasing volume of scientific literature, there is a need for a natural language interface to bibliographic information retrieval systems to retrieve relevant information effectively. In this paper, we propose one such interface, NLI-GIBIR, which allows users to search for a variety of bibliographic data through natural language. NLI-GIBIR makes use of a novel framework applicable to graph-based bibliographic information retrieval systems in general. This framework incorporates algorithms/heuristics for interpreting and analyzing natural language bibliographic queries via a series of text- and linguistic-based techniques, including tokenization, named entity recognition, and syntactic analysis. We find that our framework, as implemented in NLI-GIBIR, can effectively represent and address complex bibliographic information needs. Thus, the contributions of this paper are as follows: First, to our knowledge, it is the first attempt to propose a natural language interface for graph-based bibliographic information retrieval. Second, we propose a novel customized natural language processing framework that integrates a few original algorithms/heuristics for interpreting and analyzing bibliographic queries. Third, we show that the proposed framework and natural language interface provide a practical solution for building real-world bibliographic information retrieval systems. Our experimental results show that the presented system can correctly answer 39 out of 40 example natural language queries with varying lengths and complexities.
Big Data and Data Science: Opportunities and Challenges of iSchools

Il-Yeol Song, and Yongjun Zhu

Journal of Data and Information Science, Aug 2017

Abs HTML

Due to the recent explosion of big data, our society has been rapidly going through digital transformation and entering a new world with numerous eye-opening developments. These new trends impact the society and future jobs, and thus student careers. At the heart of this digital transformation is data science, the discipline that makes sense of big data. With many rapidly emerging digital challenges ahead of us, this article discusses perspectives on iSchools’ opportunities and suggestions in data science education. We argue that iSchools should empower their students with “information computing” disciplines, which we define as the ability to solve problems and create values, information, and knowledge using tools in application domains. As specific approaches to enforcing information computing disciplines in data science education, we suggest the three foci of user-based, tool-based, and applicationbased. These three foci will serve to differentiate the data science education of iSchools from that of computer science or business schools. We present a layered Data Science Education Framework (DSEF) with building blocks that include the three pillars of data science (people, technology, and data), computational thinking, data-driven paradigms, and data science lifecycles. Data science courses built on the top of this framework should thus be executed with user-based, tool-based, and application-based approaches. This framework will help our students think about data science problems from the big picture perspective and foster appropriate problem-solving skills in conjunction with broad perspectives of data science lifecycles. We hope the DSEF discussed in this article will help fellow iSchools in their design of new data science curricula.
Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec

Yongjun Zhu, Erjia Yan, and Fei Wang

BMC Medical Informatics and Decision Making, Jul 2017

Abs HTML

Background: Understanding semantic relatedness and similarity between biomedical terms has a great impact on a variety of applications such as biomedical information retrieval, information extraction, and recommender systems. The objective of this study is to examine word2vec’s ability in deriving semantic relatedness and similarity between biomedical terms from large publication data. Specifically, we focus on the effects of recency, size, and section of biomedical publication data on the performance of word2vec. Methods: We download abstracts of 18,777,129 articles from PubMed and 766,326 full-text articles from PubMed Central (PMC). The datasets are preprocessed and grouped into subsets by recency, size, and section. Word2vec models are trained on these subtests. Cosine similarities between biomedical terms obtained from the word2vec models are compared against reference standards. Performance of models trained on different subsets are compared to examine recency, size, and section effects. Results: Models trained on recent datasets did not boost the performance. Models trained on larger datasets identified more pairs of biomedical terms than models trained on smaller datasets in relatedness task (from 368 at the 10% level to 494 at the 100% level) and similarity task (from 374 at the 10% level to 491 at the 100% level). The model trained on abstracts produced results that have higher correlations with the reference standards than the one trained on article bodies (i.e., 0.65 vs. 0.62 in the similarity task and 0.66 vs. 0.59 in the relatedness task). However, the latter identified more pairs of biomedical terms than the former (i.e., 344 vs. 498 in the similarity task and 339 vs. 503 in the relatedness task). Conclusions: Increasing the size of dataset does not always enhance the performance. Increasing the size of datasets can result in the identification of more relations of biomedical terms even though it does not guarantee better precision. As summaries of research articles, compared with article bodies, abstracts excel in accuracy but lose in coverage of identifiable relations.
Examining academic ranking and inequality in library and information science through faculty hiring networks

Yongjun Zhu, and Erjia Yan

Journal of Informetrics, May 2017

Abs HTML

In this study, we examine academic ranking and inequality in library and information science (LIS) using a faculty hiring network of 643 faculty members from 44 LIS schools in the United States. We employ four groups of measures to study academic ranking, including adjacency, placement and hiring, distance-based measures, and hubs and authorities. Among these measures, closeness and hub measures have the highest correlation with the U.S. News ranking (r = 0.78). We study academic inequality using four distinct methods that include downward/upward placement, Lorenz curve, cliques, and egocentric networks of LIS schools and find that academic inequality exists in the LIS community. We show that the percentage of downward placement (68%) is much higher than that of upward placement (22%); meanwhile, 20% of the 30 LIS schools that have doctoral programs produced nearly 60% of all LIS faculty, with a Gini coefficient of 0.53. We also find cliques of highly ranked schools and a core/periphery structure that distinguishes LIS schools of different ranks. Overall, LIS faculty hiring networks have considerable value in deriving credible academic ranking and revealing faculty exchange within the field.
Adding the dimension of knowledge trading to source impact assessment: Approaches, indicators, and implications

Erjia Yan, and Yongjun Zhu

Journal of the Association for Information Science & Technology, May 2017

Abs HTML

The objective of this paper is to systematically assess sources’ (e.g., journals and proceedings) impact in knowledge trading. While there have been efforts at evaluating different aspects of journal impact, the dimension of knowledge trading is largely absent. To fill the gap, this study employed a set of trading-based indicators, including weighted degree centrality, Shannon entropy, and weighted betweenness centrality, to assess sources’ trading impact. These indicators were applied to several time-sliced source-to-source citation networks that comprise 33,634 sources indexed in the Scopus database. The results show that several interdisciplinary sources, such as Nature, PLoS One, Proceedings of the National Academy of Sciences, and Science, and several specialty sources, such as Lancet, Lecture Notes in Computer Science, Journal of the American Chemical Society, Journal of Biological Chemistry, and New England Journal of Medicine, have demonstrated their marked importance in knowledge trading. Furthermore, this study also reveals that, overall, sources have established more trading partners, increased their trading volumes, broadened their trading areas, and diversified their trading contents over the past 15 years from 1997 to 2011. These results inform the understanding of source-level impact assessment and knowledge diffusion.
An investigation of the intellectual structure of opinion mining research

Yongjun Zhu, Meen Chul Kim, and Chaomei Chen

Information Research: An International Electronic Journal, Mar 2017

Abs HTML

Introduction: Opinion mining has been receiving increasing attention from a broad range of scientific communities since early 2000s. The present study aims to systematically investigate the intellectual structure of opinion mining research. Method: Using topic search, citation expansion, and patent search, we collected 5,596 bibliographic records of opinion mining research. Then, intellectual landscapes, emerging trends, and recent developments were identified. We also captured domain-level citation trends, subject category assignment, keyword co-occurrence, document co-citation network, and landmark articles. Analysis: Our study was guided by scientometric approaches implemented in CiteSpace, a visual analytic system based on networks of co-cited documents. We also employed a dual-map overlay technique to investigate epistemological characteristics of the domain. Results: We found that the investigation of algorithmic and linguistic aspects of opinion mining has been of the community’s greatest interest to understand, quantify, and apply the sentiment orientation of texts. Recent thematic trends reveal that practical applications of opinion mining such as the prediction of market value and investigation of social aspects of product feedback have received increasing attention from the community. Conclusion: Opinion mining is fast-growing and still developing, exploring the refinements of related techniques and applications in a variety of domains. We plan to apply the proposed analytics to more diverse domains and comprehensive publication materials to gain more generalized understanding of the true structure of a science.
The use of a graph-based system to improve bibliographic information retrieval: System design, implementation, and evaluation

Yongjun Zhu, Erjia Yan, and Il-Yeol Song

Journal of the Association for Information Science & Technology, Feb 2017

Abs HTML

In this article, we propose a graph-based interactive bibliographic information retrieval system—GIBIR. GIBIR provides an effective way to retrieve bibliographic information. The system represents bibliographic information as networks and provides a form-based query interface. Users can develop their queries interactively by referencing the system-generated graph queries. Complex queries such as “papers on information retrieval, which were cited by John’s papers that had been presented in SIGIR” can be effectively answered by the system. We evaluate the proposed system by developing another relational database-based bibliographic information retrieval system with the same interface and functions. Experiment results show that the proposed system executes the same queries much faster than the relational database-based system, and on average, our system reduced the execution time by 72% (for 3-node query), 89% (for 4-node query), and 99% (for 5-node query).

2016

Searching bibliographic data using graphs: A visual graph query interface

Yongjun Zhu, and Erjia Yan

Journal of Informetrics, Nov 2016

Abs HTML

With the ever-increasing scientific literature, improving the efficiency of searching bibliographic data has become an important issue. With a lack of support of current bibliographic information retrieval systems in expressing complicated information needs, getting relevant bibliographic data is a demanding task. In this paper, we propose a visual graph query interface for bibliographic information retrieval. Through this interface, users can formulate bibliographic queries by interacting with a graph. Visual graph queries use a set of nodes with constraints and links among nodes to represent explicit and precise bibliographic information needs. The proposed visual graph query interface allows users to formulate several complex bibliographic queries (e.g., bibliographic coupling) that are not attainable in current major bibliographic information retrieval systems. In addition, the proposed interface requires less number of queries in completing everyday bibliographic search tasks.
A Model-Based Method for Information Alignment: A Case Study on Educational Standards

Namyoun Choi, Il-Yeol Song, and Yongjun Zhu

Journal of Computing Science and Engineering, Sep 2016

Abs HTML

We propose a model-based method for information alignment using educational standards as a case study. Discrepancies and inconsistencies in educational standards across different states/cities hinder the retrieval and sharing of educational resources. Unlike existing educational standards alignment systems that only give binary judgments (either "aligned" or "not-aligned"), our proposed system classifies each pair of educational standard statements in one of seven levels of alignments: Strongly Fully-aligned, Weakly Fully-aligned, Partially-aligned^***, Partially-aligned^**, Partially-aligned^*, Poorly-aligned, and Not-aligned. Such a 7-level categorization extends the notion of binary alignment and provides a finer-grained system for comparing educational standards that can broaden categories of resource discovery and retrieval. This study continues our previous use of mathematics education as a domain, because of its generally unambiguous concepts. We adopt a materialization pattern (MP) model developed in our earlier work to represent each standard statement as a verb-phrase graph and a noun-phrase graph; we align a pair of statements using graph matching based on Bloom’s Taxonomy, WordNet, and taxonomy of mathematics concepts. Our experiments on data sets of mathematics educational standards show that our proposed system can provide alignment results with a high degree of agreement with domain expert’s judgments.
Big data and data science: what should we teach?

Il-Yeol Song, and Yongjun Zhu

Expert Systems, Aug 2016

Abs HTML

The era of big data has arrived. Big data bring us the data-driven paradigm and enlighten us to challenge new classes of problems we were not able to solve in the past. We are beginning to see the impacts of big data in every aspect of our lives and society. We need a science that can address these big data problems. Data science is a new emerging discipline that was termed to address challenges that we are facing and going to face in the big data era. Thus, education in data science is the key to success, and we need concrete strategies and approaches to better educate future data scientists. In this paper, we discuss general concepts on big data, data science, and data scientists and show the results of an extensive survey on current data science education in United States. Finally, we propose various approaches that data science education should aim to accomplish.
Understanding the evolving academic landscape of library and information science through faculty hiring data

Yongjun Zhu, Erjia Yan, and Min Song

Scientometrics, Jun 2016

Abs HTML

Using a 40-year (from 1975 to 2015) hiring dataset of 642 library and Information science (LIS) faculty members from 44 US universities, this research reveals the disciplinary characteristics of LIS through several key aspects including gender, rank, country, university, major, and research area. Results show that genders and ranks among LIS faculty members are evenly distributed; geographically, more than 90 % of LIS faculty members received doctoral degrees in the US; meanwhile, 60 % of LIS faculty received Ph.D. in LIS, followed by Computer Science and Education; in regards to research interests, Human–Computer interaction, Digital Librarianship, Knowledge Organization and Management, and Information Behavior are the most popular research areas among LIS faculty members. Through a series of dynamic analyses, this study shows that the educational background of LIS faculty members is becoming increasingly diverse; in addition, research areas such as Human–Computer interaction, Social Network Analysis, Services for Children and Youth, Information Literacy, Information Ethics and Policy, and Data and Text Mining, Natural Language Processing, Machine Learning have received an increasing popularity. Predictive analyses are performed to discover trends on majors and research areas. Results show that the growth rate of LIS faculty members is linearly distributed. In addition, among faculty member’s Ph.D. majors, the share of LIS is decreasing while that the share of Computer Science is growing; among faculty members’ research areas, the share of Human–Computer interaction is on the rise.
Identifying Liver Cancer and Its Relations with Diseases, Drugs, and Genes: A Literature-Based Approach

Yongjun Zhu, Min Song, and Erjia Yan

PLoS ONE, May 2016

Abs HTML

In biomedicine, scientific literature is a valuable source for knowledge discovery. Mining knowledge from textual data has become an ever important task as the volume of scientific literature is growing unprecedentedly. In this paper, we propose a framework for examining a certain disease based on existing information provided by scientific literature. Disease-related entities that include diseases, drugs, and genes are systematically extracted and analyzed using a three-level network-based approach. A paper-entity network and an entity co-occurrence network (macro-level) are explored and used to construct six entity specific networks (meso-level). Important diseases, drugs, and genes as well as salient entity relations (micro-level) are identified from these networks. Results obtained from the literature-based literature mining can serve to assist clinical applications.
How are they different? A quantitative domain comparison of information visualization and data visualization (2000–2014)

Meen Chul Kim, Yongjun Zhu, and Chaomei Chen

Scientometrics, Jan 2016

Abs HTML

Information visualization and data visualization are often viewed as similar, but distinct domains, and they have drawn an increasingly broad range of interest from diverse sectors of academia and industry. This study systematically analyzes and compares the intellectual landscapes of the two domains between 2000 and 2014. The present study is based on bibliographic records retrieved from the Web of Science. Using a topic search and a citation expansion, we collected two sets of data in each domain. Then, we identified emerging trends and recent developments in information visualization and data visualization, captivated in intellectual landscapes, landmark articles, bursting keywords, and citation trends of the domains. We found out that both domains have computer engineering and applications as their shared grounds. Our study reveals that information visualization and data visualization have scrutinized algorithmic concepts underlying the domains in their early years. Successive literature citing the datasets focuses on applying information and data visualization techniques to biomedical research. Recent thematic trends in the fields reflect that they are also diverging from each other. In data visualization, emerging topics and new developments cover dimensionality reduction and applications of visual techniques to genomics. Information visualization research is scrutinizing cognitive and theoretical aspects. In conclusion, information visualization and data visualization have co-evolved. At the same time, both fields are distinctively developing with their own scientific interests.

2015

Identifying entities from scientific publications: A comparison of vocabulary- and model-based methods

Erjia Yan, and Yongjun Zhu

Journal of Informetrics, Jul 2015

Abs HTML

The objective of this study is to evaluate the performance of five entity extraction methods for the task of identifying entities from scientific publications, including two vocabulary-based methods (a keyword-based and a Wikipedia-based) and three model-based methods (conditional random fields (CRF), CRF with keyword-based dictionary, and CRF with Wikipedia-based dictionary). These methods are applied to an annotated test set of publications in computer science. Precision, recall, accuracy, area under the ROC curve, and area under the precision-recall curve are employed as the evaluative indicators. Results show that the model-based methods outperform the vocabulary-based ones, among which CRF with keyword-based dictionary has the best performance. Between the two vocabulary-based methods, the keyword-based one has a higher recall and the Wikipedia-based one has a higher precision. The findings of this study help inform the understanding of informetric research at a more granular level.
Dynamic subfield analysis of disciplines: an examination of the trading impact and knowledge diffusion patterns of computer science

Yongjun Zhu, and Erjia Yan

Scientometrics, Apr 2015

Abs HTML

The objective of this research is to examine the dynamic impact and diffusion patterns at the subfield level. Using a 15-year citation data set, this research reveals the characteristics of the subfields of computer science from the aspects of citation characteristics, citation link characteristics, network characteristics, and their dynamics. Through a set of indicators including incoming citations, number of citing areas, cited/citing ratios, self-citations ratios, PageRank, and betweenness centrality, the study finds that subfields such as Computer Science Applications, Software, Artificial Intelligence, and Information Systems possessed higher scientific trading impact. Moreover, it also finds that Human–Computer Interaction, Computational Theory and Mathematics, and Computer Science Applications are among the subfields of computer science that gained the fastest growth in impact. Additionally, Engineering, Mathematics, and Decision Sciences form important knowledge channels with subfields in computer science.

2014

Dynamic faceted navigation in decision making using Semantic Web technology

Hak-jin Kim, Yongjun Zhu, Wooju Kim, and Taimao Sun

Decision Support Systems, May 2014

Abs HTML

Categorization in the decision making classifies decision makers’ experiences about the world and provides a guide to reach a goal. This implies that dynamically providing categories reflecting the given decision context gives a great enhancement in decision quality. This study discusses the dynamic category selection under the Semantic Web environment, focusing on an implementation of a decision support system, the dynamic facet navigation system working with an ontology. Predefined fixed categories are provided to refine search results to evade use of complex queries and tedious review of search results, but they often output insensible information because of never reflecting the difference in search results. This paper proposes a dynamic category selection mechanism by using the total gain ratio under a given ontology, and a reordering scheme for resulted categories. It proves the validity of the proposed approach with a statistical analysis lastly.