Analyzing online activities can provide clues as to a person's chances of having cancer, Microsoft researchers showed in a paper published this week.
Specifically, the researchers demonstrated that by analyzing web query logs they were able to identify internet users who had pancreatic cancer even before they'd been diagnosed. The research is part of a larger trend where data analytics is being used to improve healthcare.
The study suggest that "low-cost, high-coverage surveillance systems" can be created to passively observe search behavior and to provide early warning for pancreatic cancer, and with extension of the methodology, for other challenging cancers," the researchers concluded. "Surveillance systems could also provide for automated capture and summarization of data and landmarks over time so as to provide patients with talking points in their discussion with medical professionals."
The researchers used proprietary logs of 9.2 million web queries on Microsoft's own Bing search engine but focused exclusively on English-speaking people in the U.S. from October 2013 to May 2015. They tracked the characteristics of users' search and click activities to capture intentions, which provided data to construct a statistical model.
The study team, made up of Microsoft researchers Dr. Eric Horvitz and Dr. Ryen White and Columbia University graduate student John Paparrizos, said they anonymized the data, but gave each search an identifier linked to the Web browser. That enabled the extraction of search log histories.
First, the team identified searchers in logs of online search activity who made "special queries" that are suggestive of a recent diagnosis of pancreatic cancer. Those queries included phrases such as "Why did I get cancer in pancreas," and "I was told I have pancreatic cancer, what to expect."
The researchers were also able to use special Bing-created filters to weed out queries from users when fewer than 20% were health related, assuming that the searchers were being performed by healthcare professionals. That left 7.2 million web queries to examine.
The team then went back "many months" before the initial queries were made to examine patterns of symptoms as they were expressed by web searches about pancreatic cancer symptoms.
"We showed specifically that we can identify 5% to 15% of cases, while preserving extremely low false-positive rates," the researchers said in their paper. The false positives ranged from one in 10,000 to one in 100,000.
Microsoft researchers showed that they can identify 5% to 15% of pancreatic diagnosed users based on their previous search history.
Unlike many other cancers, which can be slow growing, pancreatic cancer is among the most aggressive, meaning early diagnosis can lead to better outcomes.
Sign up for Computerworld eNewsletters.