Introduction to AI: an A to Z Tutorial for Building a Basic Linear Regression Algorithm from Scratch Without Libraries ( Source Codes Included )

Besarion Turmanauli
Level Up Coding
Published in
11 min readMay 30, 2021

--

My relationship with AI began not so long ago. Yes, I heard of it multiple times in movies when I was a child, Terminator, Skynet becoming self-aware and destroying the world… But over time news articles and videos were describing real-life breakthroughs in gaming, self-driving, and other applications.

At some point I realized AI is actually THE unavoidable next thing, not crypto or something else (although crypto will probably have multiple applications and fanbase for years to come)… and not because AI is one of the trendy topics today or because people think it’s cool but because we’ve been moving towards this direction all the time:

  • Using primitive tools to increase labor productivity (since ancient civilizations);
  • Automating mass production lines, using steam engine based machines to replace labor force and do more manual work faster (industrialization);
  • Inventing computers and the internet to do more mental work with higher precision +using programmable assembly line robots (information age);

The keyword here is more automatization, with more experience and knowledge we accumulate over time, we automatize more and more complex tasks, the more complex the tasks are, the smarter people are required to do them, and smarter tools are required to replace those people.

Do we really need all this ever-increasing productivity? This trend doesn’t make us happier, almost certainly but we want to explore & expand our understanding of the universe, we have to be able to do more in order to explore and understand more.

So I purchased a course on Udemy, that I didn’t finish. A year or so went by and I stumbled upon a data analysis problem in my own project that was going to take a huge amount of time if done manually, therefore I Googled a little bit and sure enough one of the solutions I found was renting Google Cloud’s AI something, then I watched a video tutorial on youtube, prepared my data, uploaded it, chose linear regression algorithm and ran it for a few hours.

Generally, I’m a big believer in not reinventing the wheel, if I have a simple task to do, even if I was doing such tasks for months and I’m more or less good at it, I prefer to use relevant libraries, frameworks, and so on to automate them. I think there’s no point in doing something from scratch that has already been done over and over again, you might use your own classes and snippets (in software projects) or someone else’s but why would you write exactly the same code 10 times or even twice? I don’t think there’s a point.

However, after a few hours Google came up with a solution but it turned out that the solution was to be used in the same place to analyze other sets of data points, this is a simplified version of what I was trying to do:

In the first column, there are answers, the algorithm should go through this list and find out a formula involving variable a and variable b that gives the highest percentage of correct answers, so let’s try a formula of a*b :

  • 5*5 = 25 — OK
  • 3*2 = 6 — ERR
  • 4*3 = 12 — ERR
  • 6*6 = 36 — OK
  • 5*9 = 45 — ERR
  • 6*1 = 6 — ERR
  • 4*6 = 24 — ERR

Only 2 out of 7 answers (28.57%) turned out to be right, you can set a minimum acceptable percentage of correct answers, as well as allowed deviation, you might not care if the correct answer is 40 and you get 42 instead, but if the difference is greater than 2, the answer won’t be accepted.

Let’s go through the same list with a maximum acceptable deviation of +/- 2:

  • 5*5 = 25(0) — OK
  • 3*2 = 6(-1)—OK
  • 4*3 = 12(-1)—OK
  • 6*6 = 36(0) — OK
  • 5*9 = 45(-4) — ERR
  • 6*1 = 6(-5) — ERR
  • 4*6 = 24(+2)—OK

For comparison purposes we’ll usually need all the information about accuracy we can get, we’ll definitely need average deviation (absolute values) in this case: (0+1+1+0+4+5+2)/7=1.8571

In some scenarios, we might also have a missing variable we need to find, not just a relationship between the two known variables. We didn’t achieve 100% accuracy so we might even have a missing variable right now.

This kind of research is used to determine how gender, age, and other factors determine the expected salary of an individual, for example,

or what set of signs in a patient could indicate a possible illness with a high probability…

Those signs could be changes in blood pressure, change in the production of certain hormones in the body that in combination could be a sign of something important but for the naked eye would go unnoticed especially during the early stages of an illness when it is more treatable and when symptoms aren’t severe enough to be noticed right away.

In practice, we rarely have all the factors/variables, and therefore predictions with 100% accuracy can’t be made, linear regression algorithm usually comes up with an approximation of the values and gives you an accuracy percentage of said approximation.

Now that the overall concept is clear, let’s do some actual coding and create an algorithm to reverse engineer a trading bot, shall we?

SPOILER ALERT: vanilla PHP ahead :-) !

Well… the good news is it works…

Create a new PHP file with any name of your choice and let’s start:

I basically prepared the environment for an unusually long execution time and created some variables to store statistics. (PHP version is 7.x).

We need a function to output information every step of the way:

A simple function for understanding the relationship between 2 variables, it only understands >, < and =:

We’re going to process a CSV file, so if the file is not uploaded, we’ll show a HTML file upload form:

We can also input a custom minimum accuracy value in decimals (0.8 = 80%, default = 80% )

Once we have our very first row of data, we can compare all the variables inside it using our compare function, this requires a lot of iterations (, where N is a number of variables in a given row) so let’s create another function to do that for us:

Here’s our web interface so far:

Once we select the file, it is submitted automatically using javascript and then processed using PHP:

A brief overview: this script reverse engineers a MetaTrader 4 or 5 trading bot, we run the bot in MetaTrader Strategy Tester (visual mode) that displays all the trades (buy/sell entries) on the chart. After the testing is complete, we add a bunch of different indicators. If you look at the chart with a naked eye, you might notice that certain conditions are occurring before buy or sell signals are generated, but instead of trying to identify those conditions manually, we export all the data (indicator values, buy/sell entry prices and times, OHLC price data and so on…) and import them for processing in our PHP script.

To export the data, we need to launch a MetaTrader 5 bot (created specifically for this purpose) on this chart (supports up to 26 indicators of your choice), here’s the source code:

And here’s a combined source code of a PHP file as well:

It's worth noting that this script won’t work if the trading bot we want to dissect uses more than 1 set of criteria for entry (when either one set or another is enough for a buy or sell signal to be generated), therefore let’s continue by creating a very simple bot for the purposes of this tutorial and see what happens:

Let’s compile and run it in MT5 Strategy Tester, I got this chart at the end with all trades displayed on it:

Time to launch TradeData bot, here’s my output in Toolbox > Experts window:

It shows what indicators are on the chart right now, (RSI with a period of 5 in this case) and we can also see where our data was exported, this is what the file looks like inside:

Before we import it into our PHP script, please make sure your server has upload_max_filesize and post_max_size php.ini settings set to 256M or more. File upload should be enabled.

If you have Microsoft Excel installed, it will usually assign *.csv files to itself automatically and PHP will read file type as application/vnd.ms-excel . Please modify this line in a PHP script:

if ($file_arr[“type”] === “text/plain” || $file_arr[“type”] === “application/octet-stream”)

like this:

if ($file_arr[“type”] === “application/vnd.ms-excel” || $file_arr[“type”] === “text/plain” || $file_arr[“type”] === “application/octet-stream”)

You might need to increase the value of init_set(‘max_execution_time’,300) at the beginning of the PHP file as well in case it takes longer to execute.

After waiting for a while, we have an output in JSON format:

Copy and paste into a JSON viewer somewhere, I’m using http://jsonviewer.stack.hu/. Initially, there are 3 subnodes:

  • Buy conditions
  • Sell conditions
  • No entries

Unfold buy conditions and sell conditions, you’ll see a list:

Our default setting was a minimum accuracy of 80%, we have a list of buy entry criteria (conditions for buy) and sell entry criteria (conditions for sell) sorted by accuracy from 80 to 100%. After that we have:

  • A sum of all criteria within 80% — 100% accuracy range;
  • Criteria that only occur during buy or sell trades (buy without sell and sell without buy (or buy minus sell conditions and sell minus buy conditions)).
  • Then we have buy minus no entry and sell minus no entry.
  • The last 2 subnodes in the list are the same thing as the previous 2 subnodes, but the duplicates are removed, for example, if one of the criteria for buy is: close price > open price, open price < close price will be filtered out, because it’s just the same thing reversed.

It’s important to note that the last 4 subnodes take into consideration our accuracy setting and include all criteria in an 80–100% accuracy range, we could’ve chosen a higher value but for now, let’s just open 100% accuracy criteria:

First, we have the usual suspects:

  • high > low
  • low < high and so on... it’s almost always true anyway, not interesting…

p60, for example, means plus 60 or 60 above zero, or just 60, m60 would be -60. ohlc is open, high, low, and close respectively, if they come with a number, it’s an index in historical price data, o is the current open price, o1 is the previous open price, and so on.

In buy conditions, we see that the RSI indicator with a period of 5 RSI(5) is less than 60 (and 80, naturally) 100% of the time.

In sell conditions, RSI(5) is more than 40 100% of the time.

Both of those are indeed true 100% of the time but the actual oversold/overbought levels the bot is using are 20 and 80 respectively, not just 60 and 40.

For some reason, the algorithm is not digging deep enough. Something that could be addressed in the next tutorial for sure.

Would you like more free stuff?

Disclaimer: these contain affiliate links, and I may receive compensation if you use them.

Deploy your next app in seconds: Get $100 in cloud credits from DigitalOcean using this link: https://m.do.co/c/8c5a2698b1a2

$140 from FBS: regulated by IFSC, this broker is one of the oldest and most established institutions, operating since 2009.

Requirements:

Available Markets: Cryptocurrencies, Stocks, CFDs, Metals, Commodities, Foreign Exchange

$30 from Tickmill: regulated by FSA, this broker operates since 2015.

Requirements:

Available Markets: Stock Indices, Oil, Precious Metals, Bonds, Foreign Exchange.

$30 from Roboforex: regulated by CySEC and IFSC, Roboforex is operating since 2009 and is one of the most popular and trusted brokers among traders today.

Requirements:

  • Open an account and deposit $10 to verify your payment method (can be withdrawn at any time) and get $30 as a gift
  • Profits are withdrawable without limitations
  • If you trade the necessary number of lots, you can withdraw the $30 too

Available Markets: Stocks (all NYSE, NASDAQ, and AMEX shares + German and Chinese listed companies), Stock CFDs (on all stocks, $1.5 per trade fee on US-listed shares), Indices, ETFs, Commodities, Metals, Energy Commodities, Cryptocurrencies, Cryptoindices, Foreign Exchange.

Have a great day!

--

--