## TLDR

Artificial Intelligence is a very deeply mathematical subject. I won't be getting into the math to any great degree here, others have done a much better job of that. I will include some links you may visit if you wish. In case you want to skip ahead the points I try to make here are:
• The math behind AI has been around a long, long time. Longer than computers.
• "Learning" algorithms have been around almost as long as computers.
• All "Learning" algorithms boil down to the premise that even if we can't devise a deterministic way of making a decision, we can always devise a probabilistic way of making the same decision.
• Yes probabilistic decision making is just a fancy way of saying "flipping a coin" though rolling an n-sided die would be more accurate.

## Bayesian Inference

Bayesian inference a means of statistical inference (decision making) where Bayes' theorem is used to update the probability of an hypothesis (a decision) as more evidence becomes available. We use the term Bayesian as a nod to Thomas Bayes (1702-1761) who proved a special case of Bayes' theorem though much is also due to Pierre-Simon Laplace (1749–1827) who introduced a general version of the theorem. Charles Babbage began work on his difference engine, often called the first computing device, in 1819. However, the first Turing-complete computer was ENIAC completed in 1945.

A Bayesian inference could work like this: Let's say we are playing a coin flipping game. If the coin is fair then on each flip each side (heads or tails) has an equal chance of coming up. If the idea of the game is to predict which side will be up you will be able to, over the long term, correctly predict the result one half or 50% of the flips. But let's say the coin is not fair, but that heads comes up 70% of the time. If you continue to predict each flip as if the coin is fair you will continue to guess correctly half the time. You may, if you are paying attention, notice heads are coming up more often than tails. Using Bayesian inference you could alter your prediction strategy to increase the number of times you predict heads over tails. This will improve the number of guesses you get right. If you continue this strategy you will ultimately guess heads all the time. A simple program (see below) can simulate this with these results (for 100 million coin tosses for each test):

```Coin bias: 0.5, Guess bias: 0.5, Heads: 50000082, Wins: 50004461, Losses: 49995539
Coin bias: 0.7, Guess bias: 0.5, Heads: 70006538, Wins: 50001384, Losses: 49998616
Coin bias: 0.7, Guess bias: 0.6, Heads: 69995464, Wins: 53996254, Losses: 46003746
Coin bias: 0.7, Guess bias: 0.9, Heads: 69998373, Wins: 66004210, Losses: 33995790
Coin bias: 0.7, Guess bias: 1.0, Heads: 70000956, Wins: 70000956, Losses: 29999044
```

Of course most AI programs use much more sophisticated algorithms to deal with much more complex situations, but they are all based generally on the Bayesian inference concept.

### Fixed vs Continuous Learning

What I have described above is a form of continuous learning. If, as we are playing, you continue to monitor the heads to tails ratio of the coin you can continue to update your prediction strategy. The advantage here is that you can adapt to fair and unfair coins, even coins that favor tails. The disadvantage is the work you have to do.

The other strategy is fixed learning. You build a model of coin flipping from data you already have on hand, perhaps using Bayesian inference. If that data contains results from coins that favor heads as well as fair coins you may settle on a strategy of always guessing heads. Much less work going forward, but a big disadvantage if you encounter any unfair coins that were not represented in your data set.

Continuous learning is often the ideal solution, but it comes with two very large costs. First, training very sophisticated AI algorithms can require lots of computation. This implies large fast computers, or lots of time. Second, Training requires data to be presented to the algorithm, and a score assigned to the output of the algorithm. This score, often called the cost function, is essentially how far the algorithm result is from the

## Markov Models

Bayesian inference, along with more advanced but similar techniques, are great for determining the probability single isolated events; determining if a picture is of a cat or dog, or some other kind of animal, or of the owner of a phone. But very often the problem is not to classify a single event but to classify a sequence of events. Speech to text is a very common example today with Siri, Alexa, Cortana and Google. This is where Markov models come in. A Markov model is a stochastic model (a description of a random process) used to model randomly changing systems. Clearly it would not be useful if the speech to be recognized was a truly random process; but since the content of speech can not usually be predicted until it has started it is useful to treat it as random.

From Markov models it is a short step to hidden Markov Models in which the process being modeled has observable states. This is a nice feature since it means the process states don't have to be enumerated before training starts, and training doesn't have to start over if new states are discovered or evolve. And hidden Markov models can be represented by simple dynamic Bayesian networks, bringing the problem nearly full circle. In a speech recognition system one might use each word as a state. Each word in the speech would be dependent on the previous word according to the language syntax.

So Bayesian inference allows us to adjust the probability of an event as we gather more information over time. Hidden Markov models allow us to reason about a sequence of events and draw inferences from the sequence. It doesn't take very many building blocks such as these to come up with a system that is eerily similar to what we call intelligence.

## My History with AI

This is only the beginning of what we now call Artificial Intelligence, but it is enough to talk about my early experience and work with learning algorithms. I wrote my first learning algorithm in the 1970's when I was 16 years old. My program was based on an article I read in Scientific American which discussed some of these mathematical techniques. I wrote the program on a Digital pdp-8e (pictured on the top of this blog) made available by the local community college. The software was based on a very simple dynamic Bayesian network.

In the late 1980's I became involved in a programming project using these, and more advanced, techniques on Masscomp high performance real time mini-computers and Cray supper computers. During this work I was struck by two things: 1) knowing who the software worked from the inside out, and therefor knowing how easily it could be lead astray by bad data I was always carefully skeptical and cautiously optimistic; 2) people who did not understand or appreciate how the software worked would often believe the output implicitly and assume if the program generated an answer at all it would be the right answer. Things haven't changed much since, unfortunately.

My work with AI didn't end in the 1980's. I will probably write more when it is relevant to more recent topics I want to write about.

## Pictures Masscomp computer. Rhode Island Computer Museum. Cray Y-MP Cray T-94 System Source computer Museum

## Software

### Coin Toss Simulation

```#include <iostream>
#include <random>
#include <functional>
#include <array>

class CoinToss {
public:
CoinToss() = delete;
template <class Sseq>
explicit CoinToss(Sseq& seed, double weight = 0.5) : generator{seed}, weight{weight} {}

int operator()() {
auto v = distribution(generator);
return v < weight ? 1 : 0;
}

protected:
double weight;
std::mt19937 generator;
std::uniform_real_distribution<double> distribution{0,1};
};

void run_test(double coin_bias, double guess_bias, int flip_count) {
std::random_device randomDevice;
std::seed_seq seedToss{randomDevice()};
std::seed_seq seedGuess{randomDevice()};

CoinToss    toss(seedToss, coin_bias);
CoinToss    guess(seedGuess, guess_bias);
int wins{0};

std::cout << "Coin bias: " << coin_bias << ", Guess bias: " << guess_bias;

for( int idx = 0; idx < flip_count; ++idx ) {
auto t = toss();
auto g = guess();
//        std::cout << t << ' ' << g << '\n';        if( t == g )
++wins;
if( t )
}

std::cout << ", Heads: " << heads << ", Wins: " << wins << ", Losses: " << flip_count - wins << '\n';
}

int main() {
run_test( 0.5, 0.5, 100000000);
run_test( 0.7, 0.5, 100000000);
run_test( 0.7, 0.6, 100000000);
run_test( 0.7, 0.9, 100000000);
run_test( 0.7, 1.0, 100000000);
return 0;
}```