An Interview with Galen Novello, Lead Data Scientist @ ClearAccessIP
Big things are happening at ClearAccessIP, not the least of which is a major update to our machine learning (“ML”) algorithms. The team is on the cutting edge of ML and is in the process of utilizing hyperbolic embedding, a recently conceptualized spatial network in AI, in answering problems around patent analysis and valuation. Despite its introduction and development over only the past yearFOOTNOTE: Footnote, numerous academic articles and beginnings of its integration in companies such as Facebook are a sign that it stands as an important innovation on traditional models for machine learning.
The following contains excerpts from an interview with Galen Novello, lead data scientist at ClearAccessIP:
Tell us about Hyperbolic Embedding as it relates to ClearAccessIP.
Hyperbolic embedding is a cutting edge technique in machine learning. In recent years there have been a number of developments in natural language processing involving representing characters, words, and sequences of words as vectors. Traditionally these vectors are embedded in a flat Euclidean space, hyperbolic embeddings embed these changes in curved space. The main advantage of this is that volumes in hyperbolic spaces grow exponentially instead of polynomially as a function of radius. Early research in the area has demonstrated that this allows for achieving as good or sometimes even better results using vectors with far fewer coordinates, which reduces computational time and memory requirements.
Very few researchers have even begun to explore using these concepts in their work, but we’ve been working closely with Stanford professors to apply these concepts to problems in patent analysis and valuation. These promise to be some of the first meaningful applications of this technology to find their way into industry.
Why would a company want to utilize hyperbolic embedding for machine learning?
Hyperbolic embedding gives us better results, faster, because it takes less computing power to come up with those results. We all have limited resources and budgets, and getting equivalent output otherwise would be vastly more expensive, if not impossible, with traditional methodologies.
What unique challenges do you face in analyzing the patent corpus when compared with other datasets and fields?
There is no database of training data for patents, at least, not at the scale we would require, so we have to figure it out on our own. Unsupervised learning methods are stock for training word and paragraph vectors, but these techniques are still new and far from perfect, so we have to be creative about how we combine and support these methods to produce better results.
Additionally, once we get the results, vetting them is difficult without an attorney’s perspective., and that’s assuming that attorneys even agree about what a good result might look like. Patent attorneys and patent agents in prosecution firms, litigation houses and licensing organizations expect different outputs. In fact, that’s a big part of the reason why we don’t have training data, for lots of problem we attack, the “right” answer depends on who you’re talking to. Even showing seemingly objective measures and language in documents like enablement reports still leads to different perspectives on breadth of results.
But for me, the lack of training data makes these problems particularly exciting. It forces innovation and original insight, we can’t necessarily just throw more data at a model to get it to improve results. To give a comparison, many computer vision models rely on huge data sets (such as imagenet) that are the product of decades of image analysis and hand labeling. For marketing in social networks, there are years of human interactions to analyze, and there are good metrics for determining how well your personalized ads performed, clicks. But this doesn’t exist in patents to the same degree. Creating training data has been prohibitively expensive, but with cutting edge tools, like Snorkel, we may be able to do so in a more cost-effective manner.
Where do you see the future of ML?
Things are only going to get better as people have more time to research, develop and refine methods. At ClearAccessIP we’re excited to get our hands dirty provide and test out applications of new ideas and research. As such, the big advancements will come as more research is published on the topic and we’ve had time to test and see what works for us and what doesn’t. This research can only happen if venture capital and corporations continue to make the pledge towards increasing the availability of AI in our everyday life, and I’m confident that the benefits of this research will shine through.
*Galen Novello is the Lead Data Scientist at ClearAccessIP. A former high school math teacher, Mr. Novello graduated with his bachelors in pure math from Cal in 2005 where he was lucky to work closely with a number of professors. Most prominently he assisted with the instruction of math 104 and honors 104 (Real analysis and topology) with Michael Klass for 3 semesters. In the high school arena, Mr. Novello taught AP Calculus and Statistics, CCNA networking courses, computer science and prepared numerous students a variety of math competitions. Seeking further applications of these subjects led him to pursue a self directed study of machine learning, and he found it to be a beautiful combination of many of these tools. Guided by resources from Stanford, MIT and CalTech he developed a deep understanding of the theory behind machine learning theory which he now applies, along with current research and methods, to create solutions for problems in patent analysis and valuation.