The story of machine proofsPart II

The story of machine proofs — Part II

Table of contents

  • Introduction
  • Landscape part-1 (Theorem prover as a system)
  • Accomplishments
  • Theorem Proving Systems — Tradeoffs & Architectures
  • Landscape part-2 (Distribution of efforts across provers)
  • Conclusion

Introduction

A quick recap. In “The story of machine proofs — Part-1”, we covered how humans approached formal mathematical proofs, then introduced early attempts at machine proofs and ended with this picture depicting our journey to Centaur Theorem Provers. (Figure-1):

Figure-1: Centaur Theorem Provers Journey

We’ll dig into both the machine-only and human-in-the-loop provers but here’s what you can expect to take away from part-2 of this series:

  1. Landscape of theorem provers;
  2. Comparison: Where they differ from one another;
  3. Status check: How close machines have gotten to humans in the automated reasoning race

One point to note is that humans have been in the loop for a while now so it is more the degree of their involvement that has changed over the years. As noted back in 1969 in Semi-Automated Mathematics [Guard et al.], the idea isn’t new.

Big Data Jobs

“Semi-automated mathematics is an approach to theorem-proving which seeks to combine automatic logic routines with ordinary proof procedures in such a manner that the resulting procedure is both efficient and subject to human intervention in the form of control and guidance. Because it makes the mathematician an essential factor in the quest to establish theorems, this approach is a departure from the usual theorem-proving attempts in which the computer unaided seeks to establish proofs.”

But it has required deliberate attention and effort to generate human readable proofs or any intermediate meaningful representations at critical stages of the pipeline in order to allow humans to guide the machine and co-create proofs.

Trending AI Articles:

1. 130 Machine Learning Projects Solved and Explained

2. The New Intelligent Sales Stack

3. Time Series and How to Detect Anomalies in Them — Part I

4. Beginners Guide -CNN Image Classifier | Part 1

Let’s start with some wins from computer assisted proofs in the last few decades:

A note on SAT-solver mentioned above – It is a very important concept in the context of theorem provers. SAT stands for SATisfiability. The idea is that if we can replace variables in a boolean expression or formula by TRUE or FALSE such that the formula evaluates to TRUE, then we consider it to be satisfiable. This turns out to be an NP-complete problem, that is, there is currently NO known algorithm (yet to be “proven” though — see P vs NP) that can solve all possible input instances in polynomial time. That said, A SAT-solver uses efficient heuristic approaches and approximations and is heavily used to solve problems in the semiconductor space, AI and Automated Theorem Proving. Just within EDA (Electronic Design Automation) where I spent over 15 years I’ve seen it be actively used for combinational equivalence checking, logic optimizations, circuit delay computation, bounded model checking and test pattern generation.

Back to theorem provers …

Landscape part-1

“Names don’t constitute knowledge!”

To get at the landscape, I boiled the ocean a bit and compiled several efforts around automated reasoning from theorem provers and verifiers to related tools and languages risking some redundancy and scope creep. But at this point I’d rather it be an overkill than a best-of list so that we get to see the field in its entirety. Here’s a flat list (Figure-2) organized alphabetically with most if not all efforts in the last few decades.

Really? A perfect 10 x 16 table with 160 cells? I either got lucky or just kept finding new ones until there was diminishing returns. (the latter!)

Figure-2: Efforts in automated theorem provers

Feynman would have taken one look at that table and said what do you want me to do with 160 names? “Names don’t constitute knowledge!”. Here’s a short “blast” from the past that is very well worth the watch:

Even if I say those names are indices into strategies and tactics for theorem proving such as graph based resolution, induction, model generation, proof planning, semi-automatic interactive interfaces, programmability, intuitionistic predicate logic, multi-valued logic, constraint solvers etc. I wouldn’t be adding much value as they’re still just names, perhaps a tad bit more descriptive.

So how then do we slice and dice the above list? Maybe by categories of logical systems and reasoning engines (resolution, interactive, induction, satisfiability checking, model checking, constraint solvers…)? Or something like what the “theorem prover museum” (yes, museum!) does? Their goal is to archive all systems with source code and metadata. They compiled 30 odd theorem provers dating back to 1955 (Logic Theorist) and 1960 (OTTER). The highlighted cells in Figure-3 below are ones from the museum.

Figure-3: Highlighted cells from the theorem prover museum

But another compilation, a more interesting one is Freek Wiedijk’s “The Seventeen Provers of the World” in which he asked authors working on the top active (at the time — 2005) theorem provers to provide their machine proof for the same one problem — irrationality of sqrt(2).

While I had explored different approaches to the Pythagoraean theorem across time in part-1 going back all the way to the Babylonian tablet (Plimpton 322), what I really like about Freek’s approach is the structure and template with targeted questions that he provided to the authors making it easier to compare different approaches to the same problem. We will spend a considerable portion of part-2 discussing his paper. If you’re curious which theorem provers he picked (or got responses for), see Figure-4 below. Only 4 of his 17 appear in the museum above — IMPS, LEGO, Omega and OTTER.

Figure-4: [2005] The Seventeen Provers of the World

Theorem Prover — A system

Before we get into why or how theorem provers differ from one another, let us frame the prover as a system with input, output and a process. Figure-5 shows a blackbox that is fed with a theorem. And the output is a proof.

Figure-5: TP System

Let us expand further on the system notion with a concrete example:

Theorem: “The sum of two even integers is even”.

So the above theorem is our input. Wait! The words theorem, lemma, proposition and corollary are sometimes used interchangeably as the differences are mildly subjective. And the word theorem is typically reserved for a strong or major result that has been proven to be true. Lemma is a relatively minor result popularly acknowledged as a helping or auxiliary theorem. Proposition is even less important but interesting still. And corollary is a consequent result that can be readily deduced from a theorem. For our purposes, let’s then go with just a plain statement as input to the engine before we elevate its status to that of a theorem (Figure6)

Figure-6

Now to prove the statement “The sum of two even integers is even”, we need to first understand everything about that statement, i.e. surface all prior knowledge about the terms and concepts involved (Figure-7).

Figure-7

Prior knowledge in this example includes definitions and properties of integers some of which are assumed to be true i.e. axioms. Here’s what we know about integers (Figure-8)

Figure-8: Theorem Description
Figure-9

That statement was deceptively simple! The one input has grown to 3 inputs to include axioms and definitions as shown in Figure-9 and will grow to 4 (Figure-10) when we pull in previously proven theorems.

Figure-10

Now we need to know what to do with the knowledge, i.e. strategies and clear rules on how to use the knowledge about the statement (Figure-11)

Figure-11

Let us shift attention to the right hand side, i.e. the output. A proof assistant can not only provide the construction of a proof but also check the validity of a proof (Is the statement true or false) and additionally generate counterexamples. So the system can have multiple outputs as well. We call them proof checkers or verifiers.

Figure-12

The above Figure-12 system holds true for all theorem provers. But what exactly happens inside an automated theorem prover (ATP)? Figure-13 shows a flowchart for the ATP process from TPTP (Thousands of Problems for Theorem Provers) which begins with checking syntax and semantics, then checking the consistency and satisfiability of the axioms followed by establishing a logical consequence before proving the theorem and verifying the proof.

ATP Flow chart
Figure-13: ATP Flowchart: http://tptp.org/Seminars/TPTPWorldTutorial/ATPProcess.html

So where do all provers differ? The answer to that lies in the fundamental building blocks of AI — Representation and Reasoning.

  1. Representation: What formal language do we use for mapping the world to the machine; Each formalism has its intrinsic axioms as the starting point for the prover. Nevertheless, custom axioms can help the prover advance more rapidly with the resolution process in the next step.
  2. Reasoning: What algorithms do we use to manipulate the formal expressions. Additionally how the rules are generated (human experts, machine learned or hybrid) is important. “The prover has several techniques at its disposal to process the theorems and the system description. Some examples, among many others, are: Rewriting — substitute terms to simplify computations; Skolemization — elimination of existential quantifiers (∃), leaving only universal ones (∀); Tableaux — break down a theorem into its constituent parts and prove or refute them separately”.

The blackbox in our once-simple system essentially breaks down as shown in Figure-14 where reasoning is split into the logical formalism and the reasoning engine.

Figure-14

We’ll use the above template to walkthrough Freek’s seventeen theorem provers.

[Sidenote: While I have reframed the authors’ responses to Freek’s question so as to fit the above template, I would like to acknowledge that the underlying content is entirely from the paper and any error in translation/representation is solely my responsibility. I would also suggest that you refer to the paper for a richer and more detailed explanation of each solution.]

Do you have to go through all 17 provers? Not necessary. But if you’re curious, it can be interesting. My suggestion would be to skim through the notes in the centre table and just glance at the input and output to get a sense of what the machines take in and spit out. In some of the provers further down, the proofs are too long (like IMPS, Metamath, Nuprl) to even print here so they’ve been compressed to fit the template and will be hard to read anyway. I still wanted them in here to give readers a sense of the formats and readability. If after a few it seems overwhelming feel free to jump ahead to the accomplishments section.

How to read the picture?

The input on the left hand side includes the statement to be proven along with relevant definitions both specified in the language (i.e. REPRESENTATION in green box). The output proof also typically uses the same language. The LOGIC (yellow box) and ENGINE (blue box) together give us the rules and approaches to operate on the input. For instance, in the PVS prover the inputs are written in common LISP using a higher order logic system. Operations such as induction, rewriting, symbolic model checking or simplification using decision procedures will be applied on the input to generate proof chains. In PVS, the user can interact by creating specification files, type checking them and providing proofs.

1. HOL: Higher Order Logic Theorem Prover

Figure-15: HOL

Figure-16 shows the family tree of provers going from LCF (Logic of Computable Functions) to HOL (Higher Order Logic).

Figure-16: HOL Family Tree

2. Mizar

Figure-17: Mizar

References:

3. PVS = Prototype Verification System

PVS is an interactive theorem prover. The philosophy behind the PVS logic has been to exploit mechanization in order to augment the expressiveness of the logic

Figure-18 PVS

References:

4. Coq

Figure-19: Coq

References:

5. OTTER

OTTER: Organized Techniques for Theorem-Proving and Effective Research.

Figure-20: OTTER

References:

6. Isabelle

Figure-21 Isabelle

References:

7. Alfa/Agda

Figure-22: Agda

References:

8. ACL2

ACL2 (“A Computational Logic for Applicative Common Lisp”) is a software system consisting of a programming language, an extensible theory in a first-order logic, and an automated theorem prover.

Figure-23: ACL2

9. PhoX

Figure-24: PhoX

Reference:

10. IMPS

Figure-25: IMPS

Reference:

11. Metamath

Figure-26: Metamath

Reference:

12. Theorema

Figure-27: Theorema

Reference:

13. LEGO

Figure-28: LEGO

Reference:

14. Nuprl

Figure-29: Nuprl

Reference:

15. Omega

Figure-30: Omega

16. B method

Figure 31: B Method

Reference:

17. Minlog

Figure-32: Minlog

Reference:

Accomplishments: Formalizations by provers

In the tables below (Figures 33–35), we can see an impressive list of formalizations from some of the above provers.

Figure 33 — Source: http://www.cs.ru.nl/~freek/comparison/comparison.pdf
Figure 34 — Source: http://www.cs.ru.nl/~freek/comparison/comparison.pdf
Figure 35 — Source: http://www.cs.ru.nl/~freek/comparison/comparison.pdf

Theorem Proving Systems: Tradeoffs & Architectures

Tradeoffs

As shown in the above sections, even among just these 17 provers some use higher order logic, first order logic and ZFC set theory while others use Tarski-Grothendieck set theory which is a non-conservative extension of ZFC that includes the Tarski’s axiom to help prove more theorems. Some use Hilbert-style axiomatic systems while others use Jaskowski’s system of natural deduction where proofs are based on the use of assumptions which are freely introduced but dropped under some conditions. Some use classical logic systems while others use intuitionistic or constructive logic which is a restriction of classical logic wherein the law of excluded middle and double negation elimination have been removed. And one uses LUTINS (Logic of Undefined Terms for Inference in a Natural Styles), a variant of Church’s simple type theory

Type theory was originally introduced in contrast to set theory to avoid paradoxes and is known to be more expressive. But expressiveness tends to impact automation negatively. As noted in this article “A Survey on Formal Verification Techniques for Safety-Critical Systems-on-Chip”: (Figure-36)

…increase in expressiveness has the inverse effect on automation. The more expressive a logic is, the less automated it can be. This means that tools for higher-order logic usually work iteratively, where the user is responsible for providing simpler proofs to help guide the verification process.

Figure-36— Source: https://www.mdpi.com/2079-9292/7/6/81

With each choice there is a tradeoff. But to help tease these tradeoffs out in a foolproof or reliable and faster way, we need help. The turnaround time for humans to spot an error, correct it and re-check can be far too long. Around 1900, Bertrand Russell, a mathematician, logician and philosopher found a bug (Russell’s paradox) in set theory which eventually led to the discovery/invention of type theory. A hundred years later around 2000, Vladimir Voevodsky (a Fields medalist) finds an error in his own paper (written seven years earlier!) which leads to the discovery/invention of univalent foundations. Voevodsky notes in this article titled The Origins and Motivations of Univalent Foundations the following:

In 1999–2000, again at the IAS, I was giving a series of lectures, and Pierre Deligne (Professor in the School of Mathematics) was taking notes and checking every step of my arguments. Only then did I discover that the proof of a key lemma in my paper contained a mistake and that the lemma, as stated, could not be salvaged. Fortunately, I was able to prove a weaker and more complicated lemma, which turned out to be sufficient for all applications. A corrected sequence of arguments was published in 2006. This story got me scared. Starting from 1993, multiple groups of mathematicians studied my paper at seminars and used it in their work and none of them noticed the mistake. And it clearly was not an accident. A technical argument by a trusted author, which is hard to check and looks similar to arguments known to be correct, is hardly ever checked in detail.

Voevodsky goes on to propose a new homotypy type theory to provide univalent foundations for mathematics, based on relations between homotopy and type theory. This helped formalize a considerable amount of mathematics and proof assistants such as Coq and Agda moving the needle in ATP.

Choices such as set theory versus type theory are so core to automated reasoning that this field needs a much deeper understanding of the fundamentals of mathematics and hence needs a lot more mathematicians to work alongside computer scientists, logicians and philosophers to accelerate progress. In this very nicely written article titled Will Computers Redefine the Roots of Math?, the author talks about a recent cross disciplinary collaboration which we need more of:

Along with Thierry Coquand, a computer scientist at the University of Gothenburg in Sweden, Voevodsky and Awodey organized a special research year to take place at IAS during the 2012–2013 academic year. More than thirty computer scientists, logicians and mathematicians came from around the world to participate. Voevodsky said the ideas they discussed were so strange that at the outset, “there wasn’t a single person there who was entirely comfortable with it.” The ideas may have been slightly alien, but they were also exciting. Shulman deferred the start of a new job in order to take part in the project. “I think a lot of us felt we were on the cusp of something big, something really important,” he said, “and it was worth making some sacrifices to be involved with the genesis of it.”

Architectures

One trend to note is that the complexity of theorem provers has slowly been growing from simple systems to system of systems, i.e. pulling in the best reasoners from multiple sources. But this also means that maintaining such systems (for performance, robustness, accuracy, reliability) is getting harder. Let’s take a sneak peak at the architectures of some of them: ACL2, Omega, LEAN, ALLIGATOR, iGeoTutor/GeoLogic, Matryoshka and Keymaera.

ACL2

We saw ACL2 earlier. It has an IDE supporting several modes of interaction, it provides a powerful termination analysis engine and has automatic bug-finding methods. Most notably, ACL2 (and its precursor Nqthm) won the ACM Software System Award in 2005 (Boyer-Moore Theorem Prover) which is awarded to software systems that have had a lasting influence. Figure-37 shows a high-level flow.

Figure-37: ACL2 Architecture Source: https://www.cs.utexas.edu/~moore/acl2/current/manual/index.html?topic=ACL2____ACL2_02System_02Architecture

Omega

We saw Omega earlier as well. Interesting to note here (Figure-38) is how external reasoners (like OTTER, SPASS for first-order or TPS, LEO for higher order) are invoked as needed.

Figure-38: Omega Architecture Source: https://page.mi.fu-berlin.de/cbenzmueller/papers/C11.pdf

LEAN

LEAN is an open source theorem prover developed at Microsoft Research and CMU. It is based on the logical foundations of CiC (Calculus of Inductive Constructions) and has a small trusted kernel. It brings together tools and approaches into a common framework to allow for user interaction hence bringing ATP and ITP closer. Figure-39 shows some of the key modules.

Figure-39— Source: https://leanprover.github.io/presentations/20150717_CICM/#/

ALLGATOR

ALLIGATOR is a natural deduction theorem prover based on dependent type systems. It breaks goals down and generates proof objects (text files with assertions leading to a conclusion), an interesting part of the architecture shown in Figure-40. As noted in the paper “The ALLIGATOR Theorem Prover for Dependent Type Systems: Description and Proof Sample”, the authors identify the following key advantages of having proof objects:

  1. Reliability: Satisfies de Bruijn’s criterion that an algorithm can check the proof objects easily.
  2. Naturalness: Based on natural deduction (closer to human reasoning — less axiomatic)
  3. Relevance: Proof objects can identify proofs which are valid but spurious (do not consume their premises)
  4. Justification of behavior: Proof objects provide direct access to the justifications that an agent has for the conclusions and the interpretations that it constructs, i.e. more explainability.
Figure-40— Source: https://www.aclweb.org/anthology/W06-3918.pdf

iGeoTutor and GeoLogic

An important goal in efforts like iGeoTutor (See Automated Geometry Theorem Proving for Human-Readable Proofs, Figure-41) and GeoLogic is to generate more human readable proofs in geometry. GeoLogic is a logic system for Euclidean geometry allowing users to interact with the logic system and visualize the geometry (equal distances, point being on a line …). Here’s a detailed presentation by Pedro Quaresma on Geometric Automated Theorem Proving dating back to efforts in 1959 by H.Gelerntner (See Realization of a geometry-theorem proving machine)

Figure 41 — GeoTutor Architecture

Matryoshka

Matryoshka fuses two lines of research — Automated Theorem Proving and Interactive Theorem Proving. These systems are based on the premise that first-order (FO) provers are best suited for performing most of the logical work but they go on to to enrich superposition (Resolution+Rewriting) and SMT (Satisfiability Module Theories) with higher-order (HO) reasoning to preserve their desirable properties (Figure-42 and Figure-43)

Figure-42 — Source: https://matryoshka-project.github.io/
Figure-43- Source: https://matryoshka-project.github.io/

KeYmaera

KeYmaera is a theorem prover for differential dynamic logic for verifying properties of hybrid systems (i.e. mixed discrete and continuous dynamics). Figure-44 shows three big parts to the system — the axiomatic core, the tactical prover and the user interface. The architecture gives us a sense of the multiple moving parts.

Figure-44— KeYmaera Architecture: https://www.ls.cs.cmu.edu/KeYmaeraX/documentation.html

What I do find missing though is a single picture describing the entire design space of theorem provers that captures key dimensions and situates automated and interactive theorem provers, Lean based provers, resolution and Tableau provers, SMT and SAT provers, model checkers, equational reasoning, quantifier-free combination of first order theories, conflict-driven theory combination and the more recent machine learning based ones. Maybe it exists but neither could I find one easily and nor am I currently qualified (as an ATP enthusiast) enough to weave one together (perhaps a future project). I did like this taxonomy from back in 1999 in the paper “A taxonomy of theorem-proving strategies” but it is dated and limited to first-order theorem proving (Figure-45).

Figure-45 — Source: http://profs.sci.univr.it/~bonacina/papers/LNAI1600-1999taxonomy.pdf

Landscape (part 2) — Distribution of efforts

One of the surveys on theorem provers showed a significant percentage of efforts are first order logic provers (Figure-46) and the most popular calculus of reasoning is deduction. (Figure-47)

Figure 46- Source: https://arxiv.org/pdf/1912.03028.pdf
Figure 47- Source: https://arxiv.org/pdf/1912.03028.pdf

Look at the variety of languages used across provers. They range from ML (MetaLanguage) and its dialects like OCaml (Objective Categorical Abstract Machine Language), Prolog, Pascal, Mathematica, Lisp and its dialect Scheme to more esoteric specification languages like Gallina (exclusively for Coq). Other theorem provers not mentioned here use Scala, C/C++, Java, SML or Haskell. As for the reasoning engines, some use functional programming, some have a proof plan with goals that are automatically broken down into sub-goals, some are interactive while others pull in multiple external reasoning systems or model checkers.

The same survey above also broke down the programming designs (Figure-48) and languages (Figure-49) across provers indicating that close to a quarter of the efforts use functional programming with OCaml as the choice of language.

Figure 48— Source: https://arxiv.org/pdf/1912.03028.pdf
Figure 49 — Source: https://arxiv.org/pdf/1912.03028.pdf

In addition to the above differences, as noted in “A Survey on Theorem Provers in Formal Methods”, there are some criteria like de Bruijn’s criterion and Poincare’s principle to help determine how good a prover is. According to the de Bruijn criterion “the correctness of the mathematics in the system should be guaranteed by a small checker”. That is generate proofs that can be certified by a simple, separate checker known as the ‘proof kernel’. So if a prover has a small proof kernel, then it improves the reliability. Figure-50 shows which provers satisfy the criterion and which don’t.

Figure-50 — Source: https://arxiv.org/pdf/1912.03028.pdf

And a theorem prover satisfies the Poincare principle, if it has the ability to automatically prove the correctness of calculations or smaller trivial tasks. Figure-51 shows which provers satisfy the principle and which don’t.

Figure-51 — Source: https://arxiv.org/pdf/1912.03028.pdf

Here’s a comparison (Figure-52) of the 17 provers from Freek’s paper. A “+’ means its present and a “-” that it is not.

Figure-52 — Source: http://www.cs.ru.nl/~freek/comparison/comparison.pdf

In summary, here’s the distribution of theorem provers as per the survey (Figure-53)

Figure 53- Source: https://arxiv.org/pdf/1912.03028.pdf

Conclusion

Let’s return to the promised takeaways:

  1. Landscape: We saw a huge dump of 160 efforts out of which we looked at 17 provers with some of their proofs. We further looked at the architectures of a handful of provers noting the emerging complexity of software systems.
  2. Comparison: We compared the logical systems, the representation and reasoning engines across multiple provers calling out some success criteria and dominating efforts.
  3. Status Check: To address this, let us revisit the comparison table (Figure-) from part-1. But a quick digression first:

In the article, “The Work of Proof in the Age of Human–Machine Collaboration“, Stephanie Dick quotes Larry Wos and Steve Winker as having said this in their contribution to a 1983 conference on ATP:

“Proving theorems in mathematics and in logic is too complex a task for total automation, for it requires insight, deep thought, and much knowledge and experience.”

Larry Wos joined Argonne National Laboratory in 1957 and began using computers to prove mathematical theorems in 1963. What is amazing is that he is still working on it! Wos and his colleague at Argonne were the first to win the Automated Theorem Proving Prize in 1982.

With that, lets get back to the original table below (Figure-54):

Figure-54: Part-1 status

And our updated table with an extra column to the right below: (Figure-55)

Figure-55 Part 2 Status

At a high level, it appears that theorem provers have gotten better with logical systems, explainability, geometric reasoning and model checking. There remain issues with graceful failures when decidability and satisfiability cannot be determined. There isn’t much yet on learning and “imagining/generating scenarios”. To borrow from Wos’s and Winker’s statement above, we are still missing the experience (learning), the insight and deep-thought (what I call imagination) bits. And that is precisely what we will dig into in our next and final part. We’ll venture into the role that machine learning, transformers, deep neural networks and commonsense reasoning are beginning to play in the ATP/ITP space. Stay tuned …

Don’t forget to give us your ? !


The story of machine proofs — Part II was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.

Via https://becominghuman.ai/the-story-of-machine-proofs-part-ii-f97dfa5655ea?source=rss—-5e5bef33608a—4

source https://365datascience.weebly.com/the-best-data-science-blog-2020/the-story-of-machine-proofspart-ii

Building a small knowledge graph using NER

Here , we have implemented a knowledge graph from a WikiPedia actors dataset. The article [1] by analyticsvidya has been heavily referred for this. But we have made improvements in the form of :

  1. Data Preprocessing — Removed punctuations , stop words
  2. Named Entity Recognition — This has helped in constructing a more meaningful graph with less noise.

Using Named Entity Recognition has helped us obtain top occuring Actors, writers,stuidio names etc

import re
import pandas as pd
import spacy
from tqdm import tqdm
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")
from spacy.matcher import Matcher
import networkx as nx
import matplotlib.pyplot as plt
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import re
import string
tqdm._instances.clear()
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
Big Data Jobs

Importing data from csv and storing in a pandas dataframe

data_dir = "/content/drive/My Drive/data/wiki_sentences_v2.csv"
sentences = pd.read_csv(data_dir)
sentences.shape
(4318, 1)
sentences.head()
png

Here are some sample sentences containing the words “bollywood” and “contemporary” together

Trending AI Articles:

1. 130 Machine Learning Projects Solved and Explained

2. The New Intelligent Sales Stack

3. Time Series and How to Detect Anomalies in Them — Part I

4. Beginners Guide -CNN Image Classifier | Part 1

sentences[sentences['sentence'].str.contains('bollywood' and 'contemporary')]
png

Entity Pairs Extraction Function

def get_entity(sent):
ent1 = "" #subject entity
ent2 = "" #object entity

prev_token_text = "" #text from previous token
prev_token_dep = "" #depedency from previous token

prefix = ""
modifier = ""

for tok in nlp(sent):
#Go in only if it is not a punctuation, else next word
if tok.dep_ != "punct":
#Check if token is a compund word
if tok.dep_ == "compound":
prefix = tok.text
#Check if previous token is also a compound
if prev_token_dep == "compound":
prefix = prev_token_text + " "+ prefix
#Check if token is a modifier or not
if tok.dep_.endswith("mod")==True:
modifier = tok.text
#Check if previous token is a compound
if prev_token_dep == "compound":
modifier = prev_token_text + " " + modifier

#Checking if subject
if tok.dep_.find("subj") == True:
ent1 =modifier+ " " + prefix + " " + tok.text
prefix = ""
modifer = ""
#Checking if object
if tok.dep_.find("obj") == True:
ent2 =modifier+ " " + prefix + " " + tok.text
prefix = ""
modifer = ""

#Update variables
prev_token_text = tok.text
prev_token_dep = tok.dep_

############################
return [ent1.strip(), ent2.strip()]

Now lets obtain entity pairs from a sentence

[ent1,ent2] = get_entity("the film had 200 patents")
print("Subj : {a}, obj : {b}".format(a = ent1, b = ent2))
get_entity("the film had 200 patents")
Subj : film, obj : 200  patents





['film', '200 patents']
entity_pairs = []
for i in tqdm(sentences['sentence']):
entity_pairs.append(get_entity(i))
tqdm._instances.clear()
100%|██████████| 4318/4318 [00:41<00:00, 103.43it/s]
subjects = []
objects = []
subjects = [x[0] for x in entity_pairs]
objects = [x[1] for x in entity_pairs]

Relation Extraction

The Relation between nodes has been extracted. The hypothesis is that the main verb or the Root word forms the relation between the subject and the object.

def get_relation(sent):
doc = nlp(sent)
#We initialise matcher with the vocab
matcher = Matcher(nlp.vocab)
#Defining the pattern
pattern = [{'DEP':'ROOT'},{'DEP':'prep','OP':'?'},{'DEP':'agent','OP':'?'},{'DEP':'ADJ','OP':'?'}]
#Adding the pattern to the matcher
matcher.add("matcher_1",None,pattern)
#Applying the matcher to the doc
matches = matcher(doc)

#The matcher returns a list of (match_id, start, end). The start to end in our doc contains the relation. We capture that relation in a variable called span
span = doc[matches[0][1]:matches[0][2]]
return span.text

Below we observe a problem. The Relation for the second should be “couldn’t complete”, in order to correctly capture the semantics. But it fails to do so.

get_entity("John completed the task"), get_relation("John completed the task")
(['John', 'task'], 'completed')
get_entity("John couldn't complete the task"), get_relation("John couldn't complete the task")
(['John', 'task'], 'complete')

Anyway, we move on. Let’s get the relations for the entire dataset

relations = [get_relation(i) for i in tqdm(sentences['sentence'])]
tqdm._instances.clear()
100%|██████████| 4318/4318 [00:41<00:00, 102.82it/s]

Let’s look at the topmost occuring subjects

pd.Series(subjects).value_counts()[:10]
it       267
film 222
147
he 141
they 66
i 64
this 55
she 41
we 27
films 25
dtype: int64

We observe that these words are mostly generic, hence of not much use to us.

pd.Series(relations).value_counts()[:10]
is          418
was 340
released 165
are 98
were 97
include 81
produced 51
's 50
made 46
used 43
dtype: int64

Not surprisingly, “is” and “was” forms the most common relations, simply because they are the most common words. We want more meaningful subjects to be more prominent.

Data Pre — Processing

Let’s see what happens if we pre-process the data. We remove stopwords and punctuation marks

def isNotStopWord(word):
return word not in stopwords.words('english')

def preprocess(sent):
sent = re.sub("[\(\[].*?[\)\]]", "", sent)
tokens = []
temp = ""
words = word_tokenize(sent)
# Removing punctuations except '<.>/<?>/<!>'
punctuations = '"#$%&\'()*+,-/:;<=>@\\^_`{|}~'
words = map(lambda x: x.translate(str.maketrans('','',punctuations)), words)

#Now, we remove the stopwords
words = map(str.lower,words)
words = filter(lambda x: isNotStopWord(x),words)
tokens = tokens + list(words)
temp = ' '.join(word for word in tokens)
return temp
preprocessed_sentences = [preprocess(i) for i in tqdm(sentences['sentence'])]
tqdm._instances.clear()
100%|██████████| 4318/4318 [00:06<00:00, 659.03it/s]

Getting entity pairs from preprocessed sentences

entity_pairs = []
for i in tqdm(preprocessed_sentences):
entity_pairs.append(get_entity(i))
tqdm._instances.clear()
100%|██████████| 4318/4318 [00:38<00:00, 111.16it/s]
relations = [get_relation(i) for i in tqdm(preprocessed_sentences)]
tqdm._instances.clear()
100%|██████████| 4318/4318 [00:39<00:00, 109.65it/s]
pd.Series(relations).value_counts()[:10]
released    173
include 81
produced 62
directed 52
made 50
film 48
used 42
became 35
began 35
included 34
dtype: int64

We see above that only the important relations are present, “released” being the most common. There is no “is”, “was” and trivial words as before.This fulfils our desire if eliminating noise.

What if we only want Named Entities to be present? Entities like Actors, Films, Studios, Composers etc

entity_pairs2 = entity_pairs
relations2 = relations
#We keep relations only for those entities whose both source and target are present
entity_pairs3 = []
relations3 = []
for i in tqdm(range(len(entity_pairs2))):
if entity_pairs2[i][0]!='' and entity_pairs2[i][1]!='':
entity_pairs3.append(entity_pairs2[i])
relations3.append(relations2[i])
tqdm._instances.clear()
100%|██████████| 2851/2851 [00:00<00:00, 482837.79it/s]

As we observe, the number of relations decrease to 2851 . Previously it was around 4k

Named Entity Recognition Using Spacy

source = []
target = []
edge = []
for i in (range(len(entity_pairs))):
doc_source = nlp(entity_pairs[i][0]).ents #Getting the named entities for source
#Converting the named entity tuple to String
str_source = [str(word) for word in doc_source]
doc_source = ' '.join(str_source)
doc_target = nlp(entity_pairs[i][1]).ents #Getting the named entities for target
#Converting the named entity tuple to String
str_target = [str(word) for word in doc_target]
doc_target = ' '.join(str_target)
if doc_source != '' or doc_target != '':
edge.append(relations[i])
source.append(entity_pairs[i][0])
target.append(entity_pairs[i][1])

We had obtained entity pairs before but they had many irrelevant words. We narrowed them down quite a bit by preprocessing. Now, we obtain the named entity pairs which will form source and target respectively. It looks like (source, target, edge). We prune this further by removing all the data which have neither source or target entity as a Named Entity. We keep the rest. The Relations that were extracted before form the edge

Now, we find the most popular Source , Target and Relations

print("###################  Most popular source entites    ###################### \n",pd.Series(source).value_counts()[:10])
print("################### Most popular target entites ###################### \n",pd.Series(target).value_counts()[:10])
print("################### Most popular relations ###################### \n",pd.Series(relations).value_counts()[:20])
###################  Most popular source entites    ###################### 
165
film 64
khan 10
first film 9
movie 6
indian film 6
two certificate awards 5
music 5
series 5
films 4
dtype: int64
################### Most popular target entites ######################
242
warner bros 7
national film awards 5
film 4
positive box office success 4
surviving studio india 4
1980s 3
british film institute 3
undisclosed roles 3
positive reviews 3
dtype: int64
################### Most popular relations ######################
released 173
include 81
produced 62
directed 52
made 50
film 48
used 42
became 35
began 35
included 34
films 32
composed 31
become 30
written 27
shot 27
set 26
received 26
introduced 25
considered 22
called 22
dtype: int64

Constructing the Knowledge Graph

We first take the knowledge graph in a pandas dataframe. It will be a directional graph.

knowledge_graph_df = pd.DataFrame({'source':source, 'target':target, 'edge':edge})
knowledge_graph_df.head()
#MultiDIGRaph because its a directional graph
png

Let’s see what we get when the source enity is “khan”

knowledge_graph_df[knowledge_graph_df['source']=="khan"]
png
G = nx.from_pandas_edgelist(knowledge_graph_df[knowledge_graph_df['source']=="khan"],source = 'source', target = 'target', edge_attr = True, create_using= nx.MultiDiGraph())
#MultiDIGRaph because its a directional graph
plt.figure(figsize = (10,10))
pox = nx.spring_layout(G,k = 1.0) #k defines the distnace between nodes
nx.draw(G, with_labels= True, node_size = 2500)
plt.show()
png

We observe that “khan” points to “actress katrina kaif” , “big boss” and “2010 film Veer”.

Pretty obvious which “khan” is indicated here.

This is what we get when the target is “Warner bros”

knowledge_graph_df[knowledge_graph_df['target']=="warner bros"]
png
G = nx.from_pandas_edgelist(knowledge_graph_df[knowledge_graph_df['target']=="warner bros"],source = 'source', target = 'target', edge_attr = True, create_using= nx.MultiDiGraph())
plt.figure(figsize = (10,10))
pox = nx.spring_layout(G,k = 1.0) #k defines the distnace between nodes
nx.draw(G, with_labels= True, node_size = 2500)
plt.show()
png

So, for a very small dataset, we get some interesting graph structures which match with our intuition. We can further develop on this by taking a larger dataset and calculating some measures ,like degree centrality etc.

References

  1. https://www.analyticsvidhya.com/blog/2019/10/how-to-build-knowledge-graph-text-using-spacy/

Don’t forget to give us your ? !


Building a small knowledge graph using NER was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.

Via https://becominghuman.ai/building-a-small-knowledge-graph-using-ner-296930592bcf?source=rss—-5e5bef33608a—4

source https://365datascience.weebly.com/the-best-data-science-blog-2020/building-a-small-knowledge-graph-using-ner

The Four Jobs of the Data Scientist

So, what do you do for a living? Sometimes, the answer to that question can feel like, “everything!” Well, for the Data Scientist, an extreme sense of being a “jack of all trades” is common. In fact, four such trades can be defined that a top-quality Data Scientist will iterate through during any one project.

Originally from KDnuggets https://ift.tt/35ExFrF

source https://365datascience.weebly.com/the-best-data-science-blog-2020/the-four-jobs-of-the-data-scientist

The Best Tool for Data Blending is KNIME

These are the lessons and best practices I learned in many years of experience in data blending, and the software that became my most important tool in my day-to-day work.

Originally from KDnuggets https://ift.tt/3bv3gQq

source https://365datascience.weebly.com/the-best-data-science-blog-2020/the-best-tool-for-data-blending-is-knime

KDnuggets News 21:n02 Jan 13: Best Python IDEs and Code Editors; 10 Underappreciated Python Packages for Machine Learning Practitioners

Best Python IDEs and Code Editors You Should Know; 10 Underappreciated Python Packages for Machine Learning Practitioners; Top 10 Computer Vision Papers 2020; CatalyzeX: A must-have browser extension for machine learning engineers and researchers

Originally from KDnuggets https://ift.tt/35SDMZN

source https://365datascience.weebly.com/the-best-data-science-blog-2020/kdnuggets-news-21n02-jan-13-best-python-ides-and-code-editors-10-underappreciated-python-packages-for-machine-learning-practitioners

GPT-3 And Code GenerationAI-enabled Instant Software Development

Cooking noodles like ramen used to be a long and arduous task. Now we have instant noodles. Just add hot water into a bowl and add your…

Via https://becominghuman.ai/gpt-3-and-code-generation-ai-enabled-instant-software-development-270795077cbd?source=rss—-5e5bef33608a—4

source https://365datascience.weebly.com/the-best-data-science-blog-2020/gpt-3-and-code-generationai-enabled-instant-software-development

How to stop your chatbot giving the wrong answers

You’ve built your chatbot, you’ve carefully and tirelessly trained and tested it, and you’re finally ready to launch it to go live — hoorah! But after monitoring its performance over a period of time after go-live, you notice that some user questions return incorrect intents (so give the wrong answers), despite the fact there’s training data in your model that should result in the correct intent being returned. You also spot that some user questions return the correct intent with very low confidence — so the answer isn’t presented to the user. This leaves everyone very frustrated.

As a chatbot builder, you want to understand what influence each utterance has (and even what influence each word in each utterance has) on your model’s performance. Most of the popular NLP providers use a variety of different models and algorithms that are virtually impossible for chatbot builders to fathom. And it’s also very difficult and time-consuming to do a deep dive of your training data and analyse its learning value. But you really do need to do a thorough analysis if you want to improve the performance of your bot.

Here are three techniques: two to measure your bot’s performance and one for visualising the results of its performance.

Big Data Jobs

1. Cross-validation testing

This involves preparing a separate labelled dataset of utterances that your model hasn’t been trained on (this data would typically be real user questions), and then testing it agains your chatbot to assess your model performance and see if there are any gaps in your bot’s knowledge. Cross-validation testing has its advantages and disadvantages though:

Advantage:

  • If you have a large file from many user interactions over many months, you are likely to have a great dataset that represents all the intents/subject areas you wish to cover.

Disadvantages:

  • If you are at the early stage of your chatbot model, you may still be in the process of creating new intents, and splitting or merging some existing intents. Each time this happens, you’ll have to update your cross-validation file, and this can be time-consuming.
  • Keeping this dataset out of the training data can be a challenge. Naturally, if you know the data set, you’ll check the test results and where your test fails, you may be tempted to train your model with the potential to overfit it to the cross-validation dataset
  • It will only be as good as the data in it.
  • Ideally, cross-validation testing will give meaningful and accurate results, but there’s no way of accurately predicting what your bot will encounter in the future, and you will constantly have to update your cross-validation dataset. You’ll also have to keep testing on your cross-validation dataset to monitor for any regression.

Trending AI Articles:

1. 130 Machine Learning Projects Solved and Explained

2. The New Intelligent Sales Stack

3. Time Series and How to Detect Anomalies in Them — Part I

4. Beginners Guide -CNN Image Classifier | Part 1

2. K-fold cross-validation testing

K-fold cross-validation testing solves some of the issues mentioned above. It’ll generate test data from your training data automatically, by removing some of the training data from where it is (the intent) and using them as test data. It’ll then evaluate your training data by dividing it into a number of sub-samples (or folds) and then using one fold at a time as a testing dataset to test on the rest of the training data. For example, you might divide your training data into 10 equal folds (you could use more or fewer folds, but 10 folds is commonly used), and then perform 10 separate tests, each time holding back one of the folds and testing your model based on the data in the “held back” fold not in your training data. This means that all training data will become test data at one point. This technique helps you to see weaknesses in your data. Again, it has its advantages and disadvantages

K-Fold visualisation

Advantages:

  • If you don’t have test data, and you are in the early stage of model building, it generates the test data itself.
  • K-fold is a known technique.

Disadvantages:

  • It’s time-consuming
  • K-fold doesn’t work well with low levels of training data (fewer than 200 samples per class or intent)
  • Each change to the training data affects the fold the initial training data was in, generating variation in your test results due to the randomisation of the folding. This then makes it difficult to understand the learning value of your new changes.

Here are some Jupiter Notebooks that show examples in Python code for doing cross-validation and K-fold.

3. Visualising your test results through a confusion matrix

This technique allows you to visualize the performance of the intent predictions in the form of a table. To build a confusion matrix, you’d use a test validation dataset. Each piece of data in your dataset needs a predicted outcome (the intent that the data should return) and an actual outcome (the intent that the data actually returns in your model). From the predicted and actual outcomes, you will get a count of the number of correct and incorrect predictions.

These numbers are then organised into a table, or matrix where

Confusion matrix
  • True positive (TP) for correctly predicted event values
  • False positive (FP) for incorrectly predicted event values
  • True negative (TN) for correctly predicted no-event values
  • False negative (FN) for incorrectly predicted no-event values

IMPORTANT: You’ll want to think how you categorise the right and wrong predictions that are below your chatbot confidence threshold.

These values can then be used to calculate more classification metrics like precision and recall, and F1 scores. Precision highlights any false-positive problems you may have in your model and recall highlights any false-negative problems. The F1 score is an overall measure of a model’s accuracy. The confusion matrix technique has its own advantages and disadvantages.

Confusion matrix with precision, recall and F1

Advantages:

  • It provides a good visual of your bot’s performance (see diagram above as an example)

Disadvantages:

  • It is a very time-consuming task
  • Calculations for these additional metrics are quite complex
  • It can be quite challenging to interpret unless you’re familiar with the statistics involved
  • It won’t necessarily help you see why an utterance isn’t working

Here’s a link to some further information on the confusion matrix approach

In summary, the cross-validation, K-fold and confusion matrix methods for diagnosing and improving a chatbot are very time-consuming, and difficult to understand if you’re not a statistician. Also, you’re likely to find a lot of issues that need fixing, and you’ll probably want to fix them all at once — but this in turn will generate challenges as you try to understand which changes worked/helped your model and which didn’t. You also need to think about the possible further regression (the ripple effect of changing data in one part of the model modifying the performance of the rest of the model) of your chatbot, and it’s very difficult to identify and unpick these newly created problems.

Modern tools are arriving on the market

Some companies choose to rely on the help of a tool such as QBox. QBox is another way of testing your chatbot to help improve your its accuracy and performance — in a matter of minutes. It analyses and benchmarks your chatbot training data and gives insight by visualising and understanding where your chatbot does and doesn’t perform, and why. You can see your chatbot’s performance at model level, intent level and utterance level, even word-by-word analysis, in clear and easy-to-understand visuals.

QBox intent diagram — Visualisation of intent utterances

For example, we built a very simple banking bot based on common customer-banking questions. But I noticed that some user questions requesting a contact telephone number didn’t always return the correct intent with good confidence — despite having training data to that effect. I was able to diagnose the problem quickly by doing a deep dive into the actual training data and getting word-by-word analysis, using the visuals that QBox provides:

QBox utterance influence analysis — training-word density

This helped me to diagnose why the training data wasn’t working as expected.

From the colour-coded words (on the left), I had a good understanding of the word density, and could see I had several instances of “telephone” in a different intent, and a lack of variations on asking for a telephone number in my contact_us intent.

QBox utterance influence analysis — Words’ influence on prediction

You can get an even deeper word-by-word analysis by using the QBox Explain feature (bottom illustration). This chart shows each word according to its influence on the prediction. In this illustration, you can easily see the biggest influencer to return an incorrect intent is “telephone”, and the lack of influence this word has on my Contact intent. This information clearly showed what the problem was and what I needed to do to fix it: add a little more training data to my Contact intent to strengthen the concept of asking for a telephone number.

Once you make changes to your chatbot, you can test your model again through QBox to validate those changes and see what effect they’ve had on your model overall — whether good or bad.

To find out more, visit QBox.ai.

Conclusion

Nobody would dispute the importance of ensuring your chatbot performs well to secure customer satisfaction. More recently, we’re seeing that it’s becoming a business requirement to understand the NLP model performance, analysing regression, ensuring clear KPIs are set and building your model development as part of a DevOps (MLOps) strategy. Whichever technique you use, it should help you on the path to improving performance, and ultimately your confidence in your model.

Don’t forget to give us your ? !


How to stop your chatbot giving the wrong answers was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.

Via https://becominghuman.ai/how-to-stop-your-chatbot-giving-the-wrong-answers-50ffcd7a22c9?source=rss—-5e5bef33608a—4

source https://365datascience.weebly.com/the-best-data-science-blog-2020/how-to-stop-your-chatbot-giving-the-wrong-answers

Top 20 Most Popular Programming Languages For 2021 and Beyond

The most popular programming languages list available here will help your business in making the right choice for the next project

Here, the most popular programming languages 2021 listing is present; this list will help you choose the best language for web and mobile app development.

Several programming languages are there; still, new ones are constantly emerging. But the major concern is which one running the whole market or which programming language is the most popular and well suited for web and mobile app development.

It is not so simple to list down the most popular programming languages 2021. But this task can be executed efficiently by considering various metrics, such as technology popularity, trends, career-prospects, open-source, etc.

IEEE spectrum comes with the listing sheet of top programming language 2021. The list of the programming language is based on popularity.

List Of Top Programming Languages

Based on metrics mentioned above, I have done the listing of most popular coding languages, so now you can blindly pick any of the best programming languages for your next project. But always remember to choose the one which suits your project profile.

Big Data Jobs

If you are finding difficulty in choosing the best-suited language for your mobile app, then you can consult the best mobile app development company as the company experts can help you by picking the best technology for your app in just a few hours.

1. JavaScript

JavaScript used for both backend and frontend programming. This is one of the most popular coding languages widely used within the Internet of Things (IoT). JavaScript works well with other languages, is very versatile, and updated annually.

69.7% of developers prefer employing the JavaScript programming language.

Image Source: Stack Overflow

Instagram, Accenture, Airbnb, Slack are a popular organization using JavaScript.

JavaScript Vital features

  • Simple Client-side Calculations
  • Offer greater control
  • Handling Dates and Time
  • Generating HTML Content
  • Detecting the User’s Browser and OS

Trending AI Articles:

1. 130 Machine Learning Projects Solved and Explained

2. The New Intelligent Sales Stack

3. Time Series and How to Detect Anomalies in Them — Part I

4. Beginners Guide -CNN Image Classifier | Part 1

2.Python

This widely accepted and most popular programming languages 2021, is used for developing web applications, desktop apps, media tools, network servers, machine learning and more. This technology grants outstanding library support, control capabilities, and robust integration. If you are running the startup business, then I will recommend you use this programming language for your app as Python is the best language.

YouTube, Instagram, and Pinterest are popular apps built using Python.

230,906 live websites employing Python and additionally 600,600 sites used Python historically.

Python Vital Features

  • Simple to code
  • Extensible feature
  • GUI Programming Support
  • Object-Oriented Language
  • Python is Portable language

3. Java

Java is an object-oriented programming language. With in-built open-source libraries easily accessible for users to pick from. This programming language is simple to handle and grants the best documentation and community support. With the help of this technology, you can build the best cross-platform apps, games, Android apps, embedded space, server apps, websites, etc.

Netflix, Google, Pinterest, Instagram are a few popular names using Java.

9490 firms reportedly use Java in their tech stacks.

Java vital features

  • Architecture neutral
  • Native threads
  • Great libraries
  • Dynamic compilation
  • Automatic memory management

4. PHP

It is one of the most commonly used programming languages for mobile apps that require database access. It is an open-source language employed for command-line scripting, server-side scripting, and coding applications.

PHP is widely utilized for creating eCommerce apps, dynamic web applications, content-heavy apps, and mobile apps. This language is highly flexible and can be quickly embedded with HTML or HTML5.

Wikipedia, Facebook, and Yahoo are very popular websites developed using PHP.

34,713,433 live websites make use of the PHP programming language.

PHP vital features

  • Error handling facility
  • powerful Zend Engine
  • Case Sensitive
  • Platform Independent
  • Brand-new Spaceship and Null Coalescing operators

5. Kotlin

Kotlin is the most commonly used programming language used for building modern Android apps. This programming language has the potential to lead other programming languages like JAVA to make high-performing and excellent apps.

Trello, Evernote, Coursera are some popular apps built using Kotlin.

Kotlin vital features

  • Interoperable with Java
  • Backed by JetBrains
  • Expressive Syntax
  • Null Safety

6. HTML

HTML (Hypertext Markup Language) is the ideal choice for building web-fronted apps or location-based apps for mobile devices. It is a markup language that employs tags to structure and show the content on the webpage. HTML 5 is the latest update, and it comes with including media elements, multi-platform functionality for different programs, and quick market deployment features.

Google Docs and Google Drive are the famous Google apps using HTML.

HTML vital features

  • Allows adding Images, video, and audio
  • Hypertext can be added to the text
  • Support Geolocation service
  • Multi-platform support

7. C++

C++ is one of the most popular coding languages mainly used for mobile app development. It is an object-oriented, and general-purpose language with generic and low-level memory manipulation features. This programming language is recommended to develop a gaming app, GUI-based applications, real-time mathematical simulations, and more. C++ is successful with Cloud computing apps as it can swiftly adopt changing hardware or ecosystems.

Google, Accenture, Walmart, Telegram are very popular brand names using C++. If you also want to make an app like Telegram, then I will recommend you to hire mobile app developers as they have the ability to offer innovative solutions.

1245 firms reportedly use C++ in their tech stacks.

C++ vital features

  • Case-sensitive
  • Compiler-Based
  • DMA (Dynamic Memory Allocation)
  • Platform or Machine Independent/ Portable
  • Mid-level programming language.
  • Structured programming language.
  • Rich Library and memory management

8. Rust

It is one of the most beloved programming languages sponsored by Mozilla. This language almost has a similar syntax to C++. It grants agility and security without reducing performance. Rust may be more complicated than some other programming languages present in this list.

Dropbox, Yelp, Sentry, and Postmates are famous names using Rust.

Rust vital features

  • Supports functional and necessary procedural paradigm
  • Safe, concurrent, and practical language
  • Algebraic data types
  • Efficient C bindings
  • Pattern matching

9. TypeScript

TypeScript is the top programming language created and maintained by Microsoft. It is utilized to build JavaScript apps for client-side and server-side execution (as with Node. js or Deno). TypeScript can help you avoid bugs that experts usually run into when writing JavaScript by type-checking the code.

Slack, Vox Media, Accenture, and Stack are famous brands using TypeScript.

3079 corporations use TypeScript in their tech stacks.

TypeScript vital features

  • Supports other JS libraries
  • Build-in Support for JavaScript Packaging
  • Static Type-checking
  • Class and Module Support

10. CSS

This programming language is used for defining the presentation of Web pages, including fonts, colours, and layout. It allows you to adapt the presentation to different types of devices, such as small screens, large screens, or printers. CSS can be utilized with any XML-based markup language.

CSS vital features

  • Advanced selectors
  • Content-visibility property
  • Contain-intrinsic-size property
  • Fix layout problem

11. Swift

Swift is an open-source technology specially designed to work with OS X, iOS, and tvOS platforms. The programming language is scalable, flexible, and can easily adopt a secure programming pattern to add smart features to any app.

Lyft, LinkedIn, Hipmunk, and more are the famous names utilizing Swift.

2037 corporations reportedly use Swift in their tech stacks.

Swift vital features

  • Closures united with function pointers
  • Robust error handling built-in
  • Functional programming patterns (map and filter)
  • Structs that support extensions, methods, and protocols

12. C#

C# is the best programming language used to perform a broad range of tasks and objectives. This programming language is mainly used on Windows. C# (C-Sharp) is a company formed by Microsoft that works on the .NET Framework. It is utilized to create web apps, mobile apps, desktop apps, games and more.

Delivery Hero, Microsoft, and Accenture are the most popular companies using C#.

2049 businesses reportedly use C# in their tech stacks.

C# vital features

  • Structured programming language
  • Component oriented
  • Interoperability
  • Rich Library
  • Object-oriented

13. Perl

Perl is the most popular programming language 2021. This is a general-purpose programming language produced for text manipulation and now employed for a broad range of tasks including web development, system administration, GUI development, network programming, and more.

Amazon, Booking.com, MIT are popular companies using Perl.

335,008 live websites are using Perl. Additionally, 1,123,725 sites used Perl historically.

Perl vital features

  • Database integration
  • C/C++ library interface
  • Easily extendible and embeddable
  • Unicode support
  • Text manipulation

14. Scala

It is one of the top programming languages, combining both the object-oriented and functional programming paradigm. Scala enables developers to make great use of usual JVM features and Java libraries. A top-rated mobile app development company prefers using Scala for building robust apps.

Twitter, Airbnb, Thatcham, Tumblr, Netflix are famous organizations using Scala.

Scala vital features

  • Lazy computation
  • Singleton object
  • String interpolation
  • Higher-order function
  • Concurrency control

15. Scheme

It is one of the oldest and multi-purpose computer programming languages that take a minimalistic strategy to system applications development and aims at enlarging the core with compelling language extensions. Scheme’s format is simple to learn and ideal for teaching functional programming.

Scheme technology is utilized by large, authorized internet entities such as Reddit and Google.

16. SQL

Structured Query Language (SQL) employed for communicating, assessing, and manipulating the regular database for most applications. SQL is designed to reach particular standards, both ISO and ANSI. Referential probity and relational data model between data, data manipulation, data query, and data access control.

SQL vital features

  • Data Manipulation Language (DML)
  • Transaction Control Language
  • Embedded SQL
  • Advanced SQL

17. R Programming Language

It is a complicated statistical analysis and determining excessive graphics programming, R is one of the top programming languages used for ad hoc analysis and examining large datasets. You can also utilize the R programming language for open-source data mining projects.

R Programming Language vital features

  • Adequate data handling and storage facility
  • Offers an integrated collection of tools for data analysis
  • Proffers graphical facilities
  • Provides operators for calculations on lists, arrays, vectors and matrices

18. Golang (Go)

Go is the latest programming language to achieve rapid growth as it ensures to tackle some of the most challenging computational problems with a comparatively definite approach. Golang unites all the advantages of C, like being a compiled language, and statically typed.

Uber, Google, and Pinterest are the recognized names using Go.

2304 firms employ Go in their tech stacks.

Go vital features

  • Garbage collection
  • Structural typing
  • CSP-style concurrency
  • Built-in Testing

19. Ruby

Ruby is an object-oriented and back-end scripting language utilized in web applications development, system utilities, servers, and standard libraries. This programming language is designed as a high-level multiple-paradigm, general-purpose, and interpreted programming language.

Twitter, Bloomberg, Airbnb, and Shopify are the named companies using Ruby.

4957 corporations reportedly use Ruby in their tech stacks.

Ruby vital features

  • Supports MVC Architecture
  • Automated Deployment
  • Active Record
  • Simple Programming Language

20. Elixir

Elixir is one of the best programming languages created entirely on Erlang and uses the Erlang runtime environment (BEAM) to manage its code. This programming language supports modern functionalities such as macros, meta programming, and polymorphism.

Vox Media, Stack, and Postmates are one of the best companies using Elixir.

Elixir vital features

  • Fault-tolerant
  • Highly concurrent
  • Pattern-matching

Ending Words

There is constant progress in the world of programming languages. Some evergreen technologies such as JAVA, PHP, and JavaScript have gained an enduring place in the list, whereas, other programming languages like R and Kotlin have boosted at an exceptional pace and have made it to the list of the most popular programming languages.

The languages present in the list proffer leading features and functionality that can help you build robust web and mobile apps. So now choose the right technology to develop agile web and mobile apps for your business.

In order to make efficient use of selected programming language in mobile app development get connected with the top mobile app development company and from there hire mobile app developers; this will help your business get a full-fledged mobile application.

Don’t forget to give us your ? !


Top 20 Most Popular Programming Languages For 2021 and Beyond was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.

Via https://becominghuman.ai/top-20-most-popular-programming-languages-for-2021-and-beyond-735ee8370c61?source=rss—-5e5bef33608a—4

source https://365datascience.weebly.com/the-best-data-science-blog-2020/top-20-most-popular-programming-languages-for-2021-and-beyond

Working With Sparse Features In Machine Learning Models

Sparse features can cause problems like overfitting and suboptimal results in learning models, and understanding why this happens is crucial when developing models. Multiple methods, including dimensionality reduction, are available to overcome issues due to sparse features.

Originally from KDnuggets https://ift.tt/3bydpfi

source https://365datascience.weebly.com/the-best-data-science-blog-2020/working-with-sparse-features-in-machine-learning-models

Design a site like this with WordPress.com
Get started