Standing on the shoulder of giants

Making new discoveries usually requires knowledge from previous discoveries. Discoveries are published as text in research papers. Scientists find these research papers using keywords they plug into a search engine or when other research papers cite them. Being able to search by keywords has obviously made scientific research much easier than before.

However, the research papers themselves are still written in natural languages, and unless the authors avoid using grammatical features like the words “it” or “them” to avoid repeating nouns over and over again, it’s difficult to extract meaning from them.

So perhaps people shouldn’t be writing research papers in (e.g.) English and instead use something that is easier for computers to parse. The research paper would contain claims, some numeric indicator for how certain these claims are (or from where they came), how these claims came about (e.g. experiment (with instructions for how and experiment was set up and why), logical reasoning (with the actual chain of reasoning), etc.). Then you could, for example, search for claims, instead of just keywords.

Instead of just using the keywords “ocean plastic waste”, you could search for “plastic waste in oceans negatively impacts ocean fauna”. Instead of typing in this query, you would choose every noun, verb and other qualifiers from a searchable menu. You would be able to display this menu in any language. The words in the menu would all mean a specific thing, and anybody would be able to look at the definitions and properties of every word. Eventually, the system could be programmed to (e.g.) look at two different claims and judge whether these claims might be related, based on common subjects or objects, and generate potential new claims all by itself. For example, claim 1: “Thunderstorms may under certain circumstances cause blackouts” and claim 2: “Certain atmospheric conditions cause thunderstorms” would logically mean that “certain atmospheric conditions cause blackouts”, which is true. The system probably shouldn’t let users use the word “certain” though.

How hard would it be to make such a system? Well, it’s best to start in a single niche and slowly expand the system. Here’s a very simple (completely static) example that is a bit different in that it just generates natural English or Japanese text based on what values are selected in the menus: https://blog.qiqitori.com/cve_generator/

Saving the selections would be a first step in the right direction. Currently, only one “allows attackers to:” is allowed, but you would probably want to 1) allow an arbitrary-length chain 2) something indicating the author’s degree of certainty (“may allow”), 3) add other verbs, other nouns, and 4) the reasoning behind a set of these claims (it gets complex here, but in the end you just have to select subjects, verbs, and objects, ifs, etc.).

Perhaps selecting things from a menu would get old quick (unless you just do it for the main claims). So the computer could try its best to analyze a sentence and ask the user to confirm that its interpretation is correct. The preceding sentence could be analyzed like this, but you’ll see that this could be tricky and just selecting something from a menu might be the better option after all: So [potential solution for previous follows:] the computer [the software] could try its best [could potentially (but probably wouldn’t manage to do this perfectly)] to analyze a sentence [analyze a given input sentence] and [afterwards] ask the user to [force the user] confirm that its [the software’s] interpretation [of the given input sentence] is correct [confirm … or force the user to edit the interpretation using a menu]. As you can see, there is a lot of context involved, which makes it hard to derive (e.g.) claims from natural text.

As you can see in the CVE generator, having the facts makes it somewhat easy for the system to generate text in any language for humans to read and think about

Leave a Reply

Your email address will not be published. Required fields are marked *