How artificial intelligence enables superior research
From an office looking over Amsterdam’s docks, Georgios Tsatsaronis, Vice President of Data Science at Elsevier, sat down with us to explain the transformative potential of artificial intelligence (AI), machine learning, natural language processing (NLP) and data science. Having led Elsevier’s data science operations to expand its own business model, Tsatsaronis sees exciting potential for AI and big data analytics to drive measurable value for researchers working in academia and health. Here’s what he had to say:
How do you use big data analytics, AI & machine learning at Elsevier?
Elsevier has the good fortune of having some of the most important, high-quality scientific content in the world. The way we use big data analytics and especially machine learning and advanced NLP is to try to understand the content that is inside these publications and try to do advanced extractions of all types of entities. At the end of the day, we can offer this via our platforms to our users and help them do their job faster, easier, better and more efficiently.
How important is it in your operations?
It is extremely important because it is probably our best bet at this point in time to be able to offer to our users all the means they need to perform better, to understand their fields and the research, but also to understand their practice and industry better. It can help them to have an integrated view of where the professional expertise is heading. Through AI and machine learning and advanced NLP, we have very good chances of being able to offer the best service to our users in order to progress their careers, and science as a whole.
Can you give us some examples of how you use AI and big data?
One important use case relates to the funding landscape. Nowadays, a very big volume of researchers is doing research across many different disciplines, and the primary basis for successful research is obtaining funding. The research landscape offers a lot of funding opportunities, yet researchers struggle to locate the best opportunities in a timely fashion and in a way that the research is spot on and responds to the scope of these opportunities. So, a major use case is that we mine, aggregate and link all the funding data we collect from publications and from the funders themselves, then we put all that in a knowledge graph where we offer the ability to our users to have all this information at their hands and understand which are the biggest funding opportunities out there.
In the health landscape, the use case involves our health graph, which is our effort to text mine and data mine both structured and unstructured content and combine it in a way that we can start offering answers to the major questions of our researchers. For example: which are major drugs that buy into this target or which are major symptoms of certain diseases? We can also go the next step and offer predictive analytics that could suggest, ‘well this is a new hypothesis that you can actually work with in your biology field for the coming couple of years, maybe this drug will be re-purposed to actually address this disease’.
How have these developments changed the way things used to be done?
In the past, researchers would manually visit the big content providers to answer queries on the newest publications around this drug or that protein. It was very tedious work, requiring a lot of reading and systematically handling very large volumes of publications. Nowadays, these publications are of the order of a few thousand per day, so it is almost impossible for any researcher to cope with this load. With the platforms Elsevier offers, researchers can quickly access the core information from these publications and figure out the ones that are relevant to their research and progress.
How do you use data enrichment?
We have a number of enrichments that we offer on top of primary content. Important enrichments pertain to understanding who funds research. We also do a lot of topic modelling and extraction of the important scientific concepts from our papers. For example, we have an initiative which we called Elsevier ScienceDirect Topic Pages. You can think of it as the Wikipedia for scientific concepts, but it is the wisdom of experts—not crowds. All the knowledge we mine is coming from very high-quality books that we publish from Elsevier and therefore it is very high-quality material for our researchers.
How have these developments changed the business model at Elsevier?
In recent years, there has been an addition to the way we work in Elsevier. We have expanded our business models next to our publishing business to include a very large variety of information analytics solutions. This includes platforms that crunch and mine our content and offer it to the users via our end points in a way that the information is very concise and high-quality. On top of that, we have the extractions that we do with advanced NLP and machine learning. I would say we have actually added a lot of value to the business that we have traditionally done around publishing.
How do you address ethical issues in your use of data and technology?
Ethics in AI is a very important topic for us and has a lot of aspects. To the extent that the data science community in Elsevier is concerned, some of the most important aspects pertain to the ways we are actually using the models to work with the experts, but also with identifying different kinds of biases that can actually be present in the training of such models. We work hard and with very big entities like Stanford and Google Brain to understand the best way to approach the problems arising from applying advanced AI in the modern world. We ask questions like, ‘Are we missing legal frameworks or policies that frame the risks arising from such publications?’
Tell us about the role of the data scientist in a company?
In the last 10 to 15 years, data scientist has become much more common. The modern data scientist needs to have an amalgamation of skills that span from software development to a very good understanding of advanced analytics and machine learning techniques. Today’s data scientists need to be able to tackle end-to-end processing of data, understand business requirements and understand how we translate those requirements into solutions. They need to be able to find the best AI or machine learning solutions for a problem. By and large, in computer science, the biggest problem is how do I map any of my business requests onto one of the very well-known algorithms or solutions because if I manage to do that, I will probably have a good way of addressing that problem. A data scientist needs these skills to be able to cope with the complexities of data, the volume of data and more and more complex business use cases.
How important is the variety of data sources, including third party sources?
This is extremely important. If we take scientific publications, Elsevier is of course the largest scientific publisher in this world. However, researchers can’t neglect other types of content such as patents, clinical trials, raw data, white papers, technical reports or even preprints of publications. Putting all these together and integrating this knowledge gives researchers a much better view into the state of play of the different areas, what are the top methods, what are the most impactful algorithms and what are the most important protocols currently used. So, you need to have a spherical picture to offer to your users. A lot of other publishers trust Elsevier with their content to process and put together their scientific information with metadata of their publications into their platforms. In that regard, you could say Elsevier is the World Bank of metadata and scientific information.
Elsevier and LexisNexis are both part of the RELX Group.