∇ The Gradient

Demystifying the Black Box: A Guide to Interpretable and Explainable AI Models

Isaac Onyeakagbu — Wed, 21 Feb 2024 04:12:15 GMT

Artificial Intelligence (AI) has made remarkable strides in recent years, transforming different industries and enabling machines to make complex decisions. However, the inner workings of AI models have often been regarded as "black boxes," raising concerns about their transparency and trustworthiness. In this article, we'll unravel the mysteries of black box AI models and introduce you to interpretable and explainable AI techniques. We'll provide some code examples in Python to make the concepts tangible.

Lets dive in.

Understanding the Black Box

AI models, particularly deep learning models like neural networks, are often perceived as black boxes due to their complex architectures and the opacity of their decision-making processes. These models learn from data, but understanding why they make specific predictions can be challenging.

What is Interpretability and Why Does it Matter?

Interpretability is the ability to understand and explain how a model arrives at a particular prediction. Explainability provides transparency into the reasoning and internal logic behind AI systems. But why does interpretability matter?

Trustworthiness - Users must trust systems to adopt them. Complex models like deep learning can behave as black boxes, making errors mysterious. Explainability builds user trust in model behaviors.
Ethics and Fairness - AI systems must avoid perpetuating historical biases or discrimination. Interpretability allows auditing for fairness. The EUs GDPR grants users the right to explanations for algorithmic decisions affecting them.
Legal Compliance and Adoption - Regulations increasingly demand explanations of algorithmic systems. The US Federal Trade Commission may soon require explainability for certain AI applications. Interpretability is key for ethical, compliant adoption.
Debugging and Improvement - Insights from interpretable models can reveal and prevent errors, improving performance. Explanations can identify weakly modeled areas needing more training data.

Interpretable AI Models

Let's start by exploring interpretable AI models that are easy to understand and analyze.

Linear Regression

Linear regression is one of the simplest and most interpretable machine learning models. It models a linear relationship between input features and the target variable. Here's a Python code example:

import numpy as npfrom sklearn.linear_model import LinearRegression# Sample dataX = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)y = np.array([2, 4, 5, 4, 5])# Create and fit the modelmodel = LinearRegression()model.fit(X, y)# Get coefficientsslope = model.coef_[0]intercept = model.intercept_print(f"Slope: {slope}, Intercept: {intercept}")

Decision Trees

Decision trees are another interpretable model. They make predictions by following a tree-like structure of decisions. Visualizing a decision tree can help understand its decision-making process:

from sklearn.tree import DecisionTreeClassifier, plot_treeimport matplotlib.pyplot as plt# Sample dataX = [[0, 0], [1, 1]]y = [0, 1]# Create and fit the modelmodel = DecisionTreeClassifier()model.fit(X, y)# Visualize the decision treeplt.figure(figsize=(10, 5))plot_tree(model, filled=True, feature_names=["Feature 1", "Feature 2"])plt.show()

Explainable AI Models

Now, let's explore explainable AI models that shed light on black box models' predictions.

LIME (Local Interpretable Model-Agnostic Explanations)

LIME is a powerful tool for explaining the predictions of complex models. It works by training a locally interpretable model on a dataset generated around the instance of interest. Here's a Python example:

import limeimport lime.lime_tabularfrom sklearn.linear_model import LogisticRegressionimport numpy as np# Sample data for Logistic Regression (2 features)X_logistic = np.array([[0, 0], [1, 1], [0, 1], [1, 0]])y_logistic = np.array([0, 1, 1, 0])# Create and fit the Logistic Regression modellogistic_model = LogisticRegression()logistic_model.fit(X_logistic, y_logistic)# Sample data for LIME explanation (2 features)X_lime = np.array([[0, 0]])  # Replace with your data# Create a LIME explainerexplainer = lime.lime_tabular.LimeTabularExplainer(X_logistic, mode="classification")# Explain a specific prediction (e.g., the first data point)explanation = explainer.explain_instance(X_lime[0], logistic_model.predict_proba)explanation.show_in_notebook()

SHAP (SHapley Additive exPlanations)

SHAP values are rooted in game theory and provide a unified measure of feature importance for any model. They can be used to explain both global and individual predictions:

import shap# Create an explainerexplainer = shap.Explainer(model, X)shap_values = explainer(X)# Visualize the SHAP valuesshap.summary_plot(shap_values, X)

Building Interpretable Neural Networks

Neural networks, despite being black boxes, can be made more interpretable.

Saliency Maps

Saliency maps highlight regions of an input image that influence a neural network's decision. Below is a Python code snippet using TensorFlow/Keras:

import tensorflow as tfimport matplotlib.pyplot as pltmodel = tf.keras.applications.MobileNetV2(weights="imagenet")# Load an imageimage = tf.keras.preprocessing.image.load_img("cat.jpg", target_size=(224, 224))input_image = tf.keras.preprocessing.image.img_to_array(image)input_image /= 255.0# Compute the gradients (saliency map)with tf.GradientTape() as tape:    inputs = tf.convert_to_tensor(input_image[tf.newaxis, ...], dtype=tf.float32)    tape.watch(inputs)    predictions = model(inputs)    top_prediction = tf.argmax(predictions[0])    gradient = tape.gradient(predictions[:, top_prediction], inputs)    gradient = gradient / np.max(np.abs(gradient)) gradient = (gradient + 1) / 2 # shift range from [-1, 1] to [0, 1]# Plot the saliency mapplt.imshow(gradient[0])plt.axis('off')plt.show()

Feature Importance in Neural Networks

You can assess feature importance in neural networks using gradient-based methods or feature occlusion. Here's a code snippet using gradient-based feature importance:

import numpy as npimport tensorflow as tfimport matplotlib.pyplot as pltdef load_dataset():  num_samples = 1000  num_features = 4  X = np.random.rand(num_samples, num_features)  y = np.random.randint(0, 3, size=num_samples)  return X, yX_train, y_train = load_dataset()# Convert NumPy array to TF tensorX_train = tf.convert_to_tensor(X_train)model = tf.keras.Sequential([  tf.keras.layers.Dense(128, activation='relu', input_shape=(4,)),   tf.keras.layers.Dense(64, activation='relu'),  tf.keras.layers.Dense(3, activation='softmax')])model.compile(  optimizer='adam',  loss='sparse_categorical_crossentropy',  metrics=['accuracy'])model.fit(X_train, y_train, epochs=10)def gradient_feature_importance(model, inputs, class_index):  with tf.GradientTape() as tape:    tape.watch(inputs)    predictions = model(inputs)    loss = predictions[:, class_index]  gradient = tape.gradient(loss, inputs)   feature_importance = tf.reduce_mean(tf.abs(gradient), axis=0)  return feature_importance.numpy()class_index = 0feature_importance = gradient_feature_importance(model, X_train, class_index)plt.bar(range(len(feature_importance)), feature_importance)plt.xlabel('Feature Index')plt.ylabel('Feature Importance')plt.show()

Ethical Considerations

As AI interpretability and explainability are vital, we must also consider ethical aspects:

Fairness and Bias

AI models can inherit biases present in training data. To detect and mitigate bias, you can use libraries like AIF360 or Fairlearn in Python.

Below is a code snippet using AIF360

import pandas as pdfrom aif360.datasets import BinaryLabelDataset  from aif360.metrics import BinaryLabelDatasetMetricfrom aif360.algorithms.preprocessing import Reweighing# Create sample biased datasetdata = {'gender': ['male', 'female', 'female', 'male', 'male'],          'hired': [1, 0, 1, 1, 0],        'qualified': [1, 1, 0, 1, 0]}df = pd.DataFrame(data)# Encode categorical columns  df['gender'] = df['gender'].map({'male': 0, 'female': 1})# Convert to BinaryLabelDatasetbl_data = BinaryLabelDataset(df=df,                              label_names=['hired'],                             protected_attribute_names=['gender'],                             favorable_label=1,                             unfavorable_label=0)# Compute original bias metricsprint(BinaryLabelDatasetMetric(bl_data, privileged_groups=[{'gender': 1}], unprivileged_groups=[{'gender': 0}]))# Mitigate bias using reweighingRW = Reweighing(unprivileged_groups=[{'gender': 0}], privileged_groups=[{'gender': 1}])rw_data = RW.fit_transform(bl_data)# Compute bias metrics after reweighing print(BinaryLabelDatasetMetric(rw_data, privileged_groups=[{'gender': 1}], unprivileged_groups=[{'gender': 0}]))

Best Practices for Model Interpretability

To ensure model interpretability in practice, follow these best practices:

Model Documentation

Documenting your model is essential for transparency and collaboration. Here are some key components to include in your model documentation:

Model Architecture: Describe the structure of your model, including the type and number of layers, activation functions, and any regularization techniques used.
Hyperparameters: List the hyperparameters used during model training, such as learning rate, batch size, and dropout rate.
Training Data: Specify the dataset used for training, including data sources, preprocessing steps, and any data augmentation techniques applied.
Performance Metrics: Report the evaluation metrics used to assess your model's performance, such as accuracy, precision, recall, and F1-score.
Interpretability Techniques: Document the interpretable techniques employed, such as LIME, SHAP, or saliency maps.
Bias Assessment: If applicable, describe how you assessed and addressed bias in your model.
Results: Provide results and insights gained from model interpretation, including any actionable recommendations.

Model Selection

When choosing a model for your AI application, consider the trade-offs between complexity and interpretability. Here are some guidelines:

Start Simple: If interpretability is a top priority, begin with simpler models like linear regression or decision trees.
Evaluate Trade-offs: Assess the balance between model accuracy and interpretability. Sometimes, a slightly less accurate but more interpretable model is preferred.
Ensemble Models: Ensemble techniques like random forests can provide a compromise between accuracy and interpretability by combining multiple decision trees.
Regularization: Use regularization techniques (e.g., L1 regularization) to promote sparsity in neural networks, making them more interpretable.

The Future Outlook

Explainable AI adoption will likely accelerate due to growing calls for accountability. Techniques to generate explanations must continue evolving beyond simple attention layers or feature attribution methods. The ultimate goal is demystifying even the most complex black box models.

Conclusion

In this guide, we've demystified the black-box nature of AI models by introducing you to interpretable and explainable AI techniques.

Interpretable and explainable AI models are crucial for building trustworthy and ethical AI systems. By following best practices, documenting your models, and being mindful of ethical considerations, you can harness the power of AI while maintaining transparency and accountability.

The Future of Fitness Industry: Innovative Trends, Community Challenges, and Opportunities

Mohsen Davarynejad — Tue, 13 Feb 2024 10:40:43 GMT

Introduction

I have been looking into ways to leverage the capabilities of LLMs in the fitness industry. And here we have it. The following few paragraphs are the result of feeding ChatGPT-4 through API calls with relevant information gathered from some Reddit posts and asking it to make a blog post out of it. All this is made possible by constructing a few Agents and defining the specific Tasks for them. In a forthcoming follow-up post, I will provide the full detailed process of how to get there. I shall provide an in-depth look at how Python, along with LangChain and LangGraph (to leverage the concept of agents and tasks) can be harnessed to generate insights that drive innovation in fitness and beyond. Stay tuned for a comprehensive guide that will showcase the potential of AI to transform our approach to wellness and community building.

The Future of Fitness Industry: Innovative Trends, Community Challenges, and Opportunities

In our progressively digital society, the fitness world is on a constant revolution. The most recent trends extracted from fitness discussions on Reddit underline the significance of well-structured exercise routines, balanced protein-rich diets, and the essentiality of resources for beginners.

Innovative Fitness Trends

The fitness community is buzzing with more than just the latest High-Intensity Interval Training (HIIT) workout. A deep dive into online discussions unveils that structured exercise routines are the new norm. It's not just about hitting the gym anymore - it's about following a plan, setting goals, and tracking progress for optimum results.

Nutrition is equally vital in this journey. A balanced, protein-rich diet has emerged as a fundamental part of the fitness lifestyle. It's not just what you do in the gym, but also what you consume outside it that shapes your fitness story.

For fitness novices, the journey often seems overwhelming. The high demand for beginner-friendly resources is reflective of individuals seeking guidance on where to start and how to stay committed.

Community Challenges

However, every trend brings its own set of challenges. In the fitness community, one of the most prevalent hurdles is maintaining consistency. Upholding a regular diet and fitness routine is challenging, and it becomes even more demanding when life throws curveballs.

Identifying Market Opportunities

Our findings also disclose potential opportunities in the fitness market. Despite the abundance of resources available, there's a need for personalized diet and workout plans tailored to individual requirements and objectives. Another opportunity lies in creating beginner-friendly fitness resources. While the internet is flooded with information, not all of it is easy to understand or implement. Lastly, platforms that offer community support and motivation are in high demand.

The Path Ahead

The fitness industry should take note of these insights. There's a chance to innovate by developing resources that not only educate beginners about fitness and nutrition but also mentor them and offer support to help them stay motivated.

By addressing these opportunities, we can encourage a supportive community environment, making the fitness journey less intimidating and more achievable. Whether you're a fitness novice or a seasoned pro, remember, your journey is unique, and the right resources can significantly impact your success.]

Stay connected for more insights and discussions from the rapidly evolving universe of fitness. Until next time, keep moving, keep fueling right, and keep aiming for your best self!

The link to the follow-up post with detailed Python implementation will be provided here. So stay tuned!!

*Cover Image generated by DALLE 3

Scala Tutorial by Example - Platform free

Mohsen Davarynejad — Sun, 11 Feb 2024 20:17:06 GMT

This is a gentle introduction to programming in Scala. This tutorial is still under active development. In case you happen to have an interest in completing the tutorial by providing your working example, then let's get connected.

Let's start by answering the question of why I want to learn Scala.

Why Scala? The real drivers for getting into it!

Scala is among the top 10 most popular programming languages. It helps you elegantly solve real-world problems in many ways. It allows you to combine the concepts of functional programming and object-oriented design.

The reasons there are so many high-profile Scala users, like Twitter and LinkedIn, eventually boil down to runtime performance and stability as well as the availability of libraries for building concurrent and distributed applications (using actors). In addition, Scala programming significantly cuts down the development time of an application and its maintenance expense.

If I want to go a bit more into detail, I would say that Scala encourages the usage of immutable data structures, making it more error-prone, safer, and perhaps easier to understand. Moreover, Scala runs on Java Virtual Machine (JVM) and it plays well with existing Java applications, meaning that a lot of libraries can be easily used with Scala. JVM is known for its monitoring, garbage collecting, and load balancing tools.

If you want to explore the world of functional programming then Scala is a perfect choice without completely disregarding the choice of object-oriented programming.

Note: If you are new to the concept of functional programming, it is often recommended to start with Haskell, and play with it for a while. The point is that in Haskell, you cannot drift into non-functional programming. So perhaps you can do a quick but thorough brush-up to rewire your brain. The best place to practice some Haskell is Learn You a Haskell for Great Good!

For my tutorial, I will try to follow the structure presented here, but from time to time I may deviate from it. I will also borrow some of the nice visualizations presented there.

Some facts about Scala

In Scala usage of semicolons is mandatory, and you need to use them only when you have multiple statements per line.

Scala offers better concurrency by using immutability and actors. (I will make this point more clear as we go)

Scala runs on both Java and. NET. For more info have a look here

Now let's jump right into the installation process.

Installation

Lets first try to install Scala on a Linux machine and set up a Jupiter notebook for IDE. The first thing we need to install or upgrade is JVM. So go to www.java.com/en/download. To install Java you may like to follow the instructions provided here.

If you are on Mac, its pretty easy to install Scala. You need to go to brew.sh and install Homebrew. Now inside your terminal just type $ brew install scala and thats it.

For Linux, you need to download the binaries and unpack the archive from here. Unpack the .tar file by tar zxvf fileNameHere.tgz . Then you need to add Scala and Scalac to your path by editing .bashrc .

export SCALA_HOME=/home/cloudera/workspace/scala/scala-2.12.1/export PATH=$SCALA_HOME/bin:$PATH

The best choice for me is just to download a Cloudera VM. If you are a data scientist or if you just would like to start with data science and analytics then I would recommend you go and get one of the Cloudera VMs. It comes prepacked with most of the big data tools, like Hadoop, Hue, Hive, HBase, and Impala. Also, Scala is installed in it, where you can call spark jobs through Scala or Python APIs. If you have downloaded a Cloudera VM, then the only thing you need to do is just to type in spark-shell, and you are going to your first Scala project.

And now if you are interested in running your Scala application in Jupiter Notebook, follow the instructions here on how to install Scala Kernel for Jupyter.

You may also like to use IntelliJ IDEA, which I recommend if you are coding for a big Scala project.

OK, enough on the installation process and facts on Scala. Lets begin with some coding.

Basics of Scala

Here I would like to start with some very basic codes in Scala, just to give you the feeling that Scala is a language that shares a lot of similarities with other languages. At the same time, I would to emphasize that Scala is a paradigm shift. We will cover the fundamentals of Scala in another section.

// This is the way to put comments on your code. You can put multiline comment by opening /* and closing */. This is similar to way we put comments in Java.// Now lets do some basic math. In your command line just type in2 + 2 * 2// This will create a variable res0 with type Int and value of 6. Now you can used this res0 variable in the rest of your code."The value of res0 is: " + res0// And what you get in return is: res1: String = The value of res0 is: 4. So now not only we have the results, but the result is also store in a variable res1 with type String.// Note that when a value is assigned to variable, it cannot be changed. So in you type inres0 = 2// Then you will get an error message, informing you that the value of the variable cannot be changed.// If you need to define a variable that needs to be changed, then you need to use var.var myMessage = "Make sure to import the necessary libraries."

Make big numbers

Like all other programming languages, if you use big numbers, you will lose the precision. To prevent that you can use BigDecimal and BigInt.

var myInt = BigInt("1234567890123456789012345678901234567890")var myDecimal = BigDecimal("10.1234567890123456789012345678901234567890")

Generate random numbers with Scala

To get a random number drawn from a uniform distribution bounded between 0 and one:

random

You can easily generate random inciter numbers between 0 and 6:

(random * 6 + 1).toInt

Some Math

In Scala, you cannot do ++ or --. However, you can do the following:

myInt = 12myInt += 4myInt -= 2myInt *= 10

First, let's import a library in Scala and do some simple math.

import scala.math._// This will import all theabs(-1) // for abs valueceil(-1.2)round(-1.2)floor(-1.2)toRadians(45)toDegrees(0.78)sqrt(10)pow(10,2)cbrt(12) // for cube rootexp(10)

Function definition

def add(a:Int, b:Int):Int = a + bval sum:Int = add(2, 10) //val specifies immutability of Int sumprintln("The value of the sum is: " + sum)

As you have noticed, pretty much anything that appears after : specifies the type. Also, there is no explicit return statement. The value of the last expression is always returned automatically. Check the next piece of code:

def MyFunc(a:Int):Int = {    a + 1    a + 2    a + 3}val p = MyFunc(10) // Here I deliberately dropped the type of val p, and Scala does not complain about it.println("The value of p is: " + p)

Conditionals in Scala

Now let us jump into Scalas If...Else Expression Blocks and Loops.

Type-level programming in Scala (coming soon)

Fundamentals of Scala

In this section, I would like to cover the unique part of Scala, the side that stands apart from the crowd.

In types we trust!

In Scala, you can trust your code more than the codes developed in other languages. You want a coding platform where in there if your code compiles, then you should be fine to go. In Scala you know if your types are wrong at compile time not at runtime.

Functions are first-class citizens in Scala

One of the concepts in Scala that looks very new to Java users is that in Scala functions are types. In Scala, just like the recent developments in Java 8, you can pass functions around. Not something that we were used to. In your function, you can be very specific about what type of input you expect the function to accept and what should be the return type.

So in Scala function types are based on a) The number and the type of input parameters and b) The return type. So in the MyFunc function that we have defined above (here), the function takes an Int and returns a type Int. As another example, we may have a function that takes two Int as input and returns a string:

def makeString(x: Int, y: Int): String = x + " and " + y

I could have defined the above function like this:

val makeString: (Int, Int) => String = (x, y) => x + " and " + y

The above code says that I'm going to define a function named MakeString that takes two integers and generates and returns a String. Those two incoming parameters are Integers x and y and the return String is x + " and " + y. Now I can try the above code on REPL:

scala> makeString(2,4)res1: String = 2 and 4

Note that in Scala the last executable statement of a function is what it will return.

Recursion and Stack Overflow

For-each loops as well as while loops, do loops, etc are all iterations. Java is designed to accommodate iterations, and it's not very well at handling recursions. In contrast, Scala, being a functional language, avoids iterations and more gears toward recursion.

Let us write a function that computes the sum of the first natural n natural numbers:

def sumn(n: Int): Int = {    if (n == 0) 0 else n + sumn(n-1)}println("The sum of first 10 natural numbers is: " + sumn(10))

By increasing n, we will reach a point where we will get a stack overflow error. Let us see how Scala overcomes the limitations of the call stack. Let's rewrite the above code, this time using the concept of Tail Calls to avoid the stack overflow message:

def sumn(n: Int, acc: Int): Int = {    if (n == 0) acc else sumn(n-1, acc + n)}println("The sum of first 10 natural numbers is: " + sumn(10, 0))

If you need more clarifications on this then please visit here.

Higher order function in Scala

A function is eligible to be called a higher order if It can be passed a function OR It returns a function

def sqrt(x: Int)= x * xdef sum(f: Int => Int, a: Int, b: Int): Int = {    if (a == b) f(b) else f(a) + sum(f, a + 1, b)}println("The sum of square of number sbetween 10 and 20 is: " + sum(sqrt, 10, 20))

In the above example, the function sum is called a higher-order function.

Anonymous functions in Scala

Let's say that we need to compute the sum of the square of numbers between two integers, a and b.

def sum(f: Int => Int, a: Int, b: Int): Int = {    if (a == b) f(b) else f(a) + sum(f, a + 1, b)}println("Here the assumpotion is that a is always smaller than b. I'm not cheking if that assumption is indeed true or not!")println("The sum of squares netween 10 and 20 is: " + sum(x => x*x, 10, 20))

In this example x => x*x is an anonymous function that takes x as an input and returns x^2.

Note: Can you explain why the code does not compile properly if I replace x\x with scala.math.pow(x,2)*?

Map, Filter, and Reduce: The methods of collections

Let's build an array of type Int that covers all of the integers from 100 to 500. The task is to:

Compute the square root of each element of the array
Filter the ones that are bigger than 450.
Compute the sum of all the elements of the vector.

Now let's see the Map, Filter, and Reduce in action.

import Array._var A0 = range(100, 500)val A1 = A0.map(x => x*x)val A2 = A0.filter(x => x > 450)val A3 = A0.reduce((x, y) => x + y)

The above code is self-explanatory. Let's add another example, this time with Vectors. Vectors are built for accessing its elements in constant time. Their elements can be modified easily, a feature that makes them easier to use. The comments in the code explain the process:

// Lets begin by making an empty Vector.val v = scala.collection.immutable.Vector.empty// Now we will make three Vectors, v1, v2, and v3val v1 =  v :+ 1 :+ 2 :+3val v2 = Vector(2, 2, 3);val v3 =  Vector(3, 2, 3)// The idea here is to compute the sum of v1, v2 and v3val sum = Vector(v1, v2, v2).transpose.map(_.sum)

Pattern matching in Scala

If you tag a .r to any string, scala will make that string into a regular expression.

Define a class in Scala

Ok, now we have everything in place to get more real and implement our first scala class. The idea here is borrowed from the world of the Internet of Things (IoT), the buzzword that we hear a lot these days.

Let's imagine that an LED is connected to the internet. We would like to check its status, and we want to be able to turn it on and off. Also, we want to know the time since the last change in its status. To implement the idea let's create the Time class.

import java.util.Calendarcase class Time(val hour: Int, val minute: Int) {    val inMinutes:Int = hour * 60 + minute    def -(that: Time): Int = this.inMinutes - that.inMinutes}

Perfect. So now we can get the difference between two time stamps. Now it is time to create the class LED. This class needs to be able to store the time stamp it's created. Also, we will need to implement methods for changing its status only when the object(the LED) is not at the desired status.

case class LED() {    var now = Calendar.getInstance()    var hour = now.get(Calendar.HOUR)    var minute = now.get(Calendar.MINUTE)    private var time = new Time(hour, minute)    private var internalStatus = 0    override def toString = s"The LED has a status of $internalStatus. The status has not changed since ${statusTime()} minute(s) ago."    def report { println(this) }  // uses toString    def statusTime() = {        now = Calendar.getInstance()        var cHour = now.get(Calendar.HOUR)        var cMinute = now.get(Calendar.MINUTE)        Time(cHour, cMinute) - this.time    }    def turnOn() = {        if (internalStatus == 0) {            now = Calendar.getInstance()            hour = now.get(Calendar.HOUR)            minute = now.get(Calendar.MINUTE)            internalStatus = 1            time = Time(hour, minute)            println("The status has changes to ON")        } else {            println("The LED is already ON. No change in the status!")        }    }    def turnOff() = {        if (internalStatus == 1) {            now = Calendar.getInstance()            hour = now.get(Calendar.HOUR)            minute = now.get(Calendar.MINUTE)            internalStatus = 0            time = Time(hour, minute)            println("The status has changes to OFF")        } else {            println("The LED is already OFF. No change in the status!")        }    }}

Now let's make an instance of the class LED and see its behavior in REPL:

scala> var myLED = new LED //Here I new a class. We are not passing any value to it. Although a nice extention of the class LED could be to add the posibility of specifing the state of LED when creating it.myLED: LED = LED()scala> myLED.timeres1: Time = Time(10,33)scala> myLED.statusres2: Int = 0scala> myLED statusTimeres84: Int = 2

Extend a class in Scala: Subclassing

This section has to be completed, but if you are interested you may want to take a look here.

Extending an object in Scala

This section has to be completed, but if you are interested you may want to take a look here.

Conclusions

In Scala, there is a special keyword called case.

In the follow-up post, we will go through concise OO and powerful functional collections in Scala.

Have a look at here for more examples.!!

Reference

Its easy to lie with statistics. Its hard to tell the truth without statistics.
Andrejs Dunkels

*Cover Image Credit: DALLE 3

A Data Scientist's Chronicle of Lessons Learned and Strategies for Success

Mohsen Davarynejad — Mon, 05 Feb 2024 23:00:00 GMT

"A data scientist combines hacking, statistics, and machine learning to collect, scrub, examine, model, and understand data. Data scientists are not only skilled at working with data, but they also value data as a premium product."
Erwin Caniba

Around 9 years ago, I decided to leave academia in the pursuit of putting my learnings to the test. On that occasion, I have decided to pen down my experiences gathered over the last 9 years of my professional voyage; a journey that has taken me through diverse industries, from the corridors of academia to the landscapes of corporate challenges. I distill my experiences into a set of principles, strategies, and wisdom gained through years of exploration.

Always challenge your stakeholders on the problem definitions, problem assumptions, and the importance of the problem at hand. Make sure to understand the business context and challenges of your stakeholders.
In many real-world and industrial use cases, leveraging an existing solution promotes efficiency by avoiding unnecessary repetition. Often, you will find that your particular problem has already been solved. If it initially appears to be unique, attempt to abstract the problem, and then look for an algorithm that suits this view. This method usually uncovers that your issue corresponds to well-known problems that have established solutions.
Start from design, before code. Your initial design is the key to the project's success. Break down the solution into testable maintainable pieces. Review your design, put it out there, and ask relevant people to challenge it. Feedback is a gift, so embrace it.
Start with implementing Unit tests.
Be visible, dont work in a silo, and always speak your mind.
Focus on the data rather than focusing on the model. Also, think about other sources of data that can be used. Check on their availability and accessibility.

"Data science isn't about the quantity of data but rather the quality."
Joo Ann Lee

Always start from the simplest model possible, which ticks the requirements.
More often than not the models make strong assumptions. So it's always critical to accompany your analysis and findings with a sensitivity analysis: How robust are your conclusions to the model assumptions? Reliable inferences can be only in light of model-specific sensitivity analysis.
Ensure that the open-source tools embedded in decision-support solutions are accompanied by thorough documentation and will continue to receive support over time. Additionally, a strong user community around these tools is essential for securing the necessary expertise for future needs.
Solution adoption is more important than model performance. Use interpretable and glass box models whenever possible.
You cannot improve what you cannot measure. So always have your KPIs in place, and study the impact of the changes on the KPI.
It's good to work hard and get the whole to-do list finished. But more importantly, you need to focus on the impact. Protrize the work, measure the impact of each feature, gauge, and show flexibility.
When joining a new organization ensure a code review process is in place. Make it part of the organization if it's not already. Always look for ways to make the code review process smoother and more productive.
Praising good work is more than a formality; it is a celebration of the commitment to excellence. Each milestone achieved and every challenge overcome is a testament to the unwavering dedication of individuals who go above and beyond. So recognize them in the meeting/stand-ups.
Naming is important. When implementing the solution, you will need to cook names along the way, whether it's process names, file names, variable names, function names, etc. Consider this as a selling opportunity.
Share your results with your stakeholders in incremental steps. And focus on the story and the implications of the results of your work. In most cases, your stakeholders have no interest in the details of the used algorithms and failed solutions. Just get to the bottom line and avoid technical presentations.
Do not focus only on building a great ML model in a notebook. Think automation. Automate as much as possible.
When it comes to maintaining legacy code, start with Increased test coverage, API solidifying, and small steps toward an ideal state isolated to subsystems behind the solidified interfaces. No Massive refactors as they introduce too much change to reason about at once, and any mistake costs momentum and trust, which can lead to rolling back and abandonment at great expense. Make sure to take small steps that are easy to roll back. Avoid all-or-nothing ethos around the success of the refactor.
Do not stop learning. Self-development is the key to both individual and team success.

"The best way to learn data science is to do data science."
Chanin Nantasenamat

* ^{Image Credit: DALL-E 3 with the prompt: Create an image where a puzzle game is featured at the center. As you move towards the outer edges of the image, the puzzle pieces seamlessly transform into pins. These pins are intricately connected to each other with wires, resembling a complex network of junctions in a bustling city. Ensure that the pins exhibit a variety of colors to enhance the vibrant and dynamic feel of the cityscape}

Classification with imbalanced data

Mohsen Davarynejad — Mon, 29 Jan 2024 08:17:40 GMT

Consider the problem of predicting late passengers at the airport. Based on some statistics, only around 8% of passengers are considered late. Assuming the availability of good features like Age, Gender, Nationality, Country of residence, Walking distance from the baggage drop-off point to the gate, etc., the easy solution would be to build a random forest and train it using lets say 80% of the available data set. The other 20% of data is reserved for testing. What does it mean If the classifier gets an accuracy of 92%? Bad.

So the question that comes to mind is why does it happen, and how to handle imbalanced classification problems?

The answer to the first question, why does it happen, is a simple one. When the problem at hand is imbalanced, the ML algorithm has little information about the minority class. So its not a simple task to come up with a list of features that are informative to discriminate the classes.

For the second question, how to handle the problem, one may provide several answers, mainly supported by conceptual reasoning and simulated results. We shall review a number of them in this article. Without loss of generality, binary classification is considered here.

Approaches to deal with imbalanced data sets

The skewed class distribution is quantized by the imbalance ratio (IR) and is defined as the ratio of the number of instances in the majority class to the number of examples in the minority class.

The approaches are categorized into two groups: the internal approaches and external approaches. While the former group aims at creating new algorithms or modifying the existing ones, the latter acts on the data by resampling to diminish the effect caused by class imbalance.

Internal approaches

Internal approaches modify the existing algorithms or create new ones equipped with approaches to handle imbalanced classification problems. They may adapt the decision threshold to create a bias toward the minority class. Or may modify the cost function to compensate the minority class.

External approaches

This group acts on the data rather than the learning method. Their advantage over internal approaches is their independence from the classifier used. One of the methods to deal with imbalance classification problems is the Data Sampling Method, each with its characteristics, strengths, and weaknesses. These techniques assume that a fully balanced dataset can be attained. Data Sampling methods transform the imbalanced data set into balanced data by under-sampling the majority class, over-sampling the minority class, and/or generating synthesized data to bring the class ratio to 50:50. A very well-known representative of the latter method is known as Synthetic Minority Oversampling Technique or SMOTE. Although several studies have shown little to no difference between these data sampling techniques, we shall review all three of them.

Random Undersampling (RUS)

Random Undersampling (RUS) tries to balance the two classes by reducing the size of the majority class accomplished by removing instances of the majority instance. Observations from the majority class are selected and removed from the database at random, and the process continues until the desired class ratio is reached.

Random Oversampling (ROS)

As you might have already guessed, Random Oversampling balances the dataset by increasing the size of the minority class. The desired class ratio is achieved by duplicating samples from the minority class. The items to be duplicated are selected at random.

ROS, when compared to RUS, ensures efficient use of information and prevents information loss. On the other hand, the chance of overfitting is high when using random oversampling, meaning that the training accuracy may be improved at the cost of lower performance on the test set.

Synthetic Data Generation

This group of techniques is a type of oversampling with the aim of balancing the classes by generating synthetic data. The most widely used and well-known member of this group is the synthetic minority oversampling technique (SMOTE). This set of algorithms oversamples the minority class by the use of k-nearest neighbors.

Cost-sensitive Learning (CSL)

Cost-sensitive learning framework takes advantage of both External approaches (by adding costs to instances) and Internal approaches (by modifying the internal learning process to accept costs). In this framework, the cost of misclassification of the minority class is higher in comparison to the cost of misclassification of the majority class. The cost matrix is composed of four quadrants, that resemble the confusion matrix. The penalty for True Positive and True Negatives is zero, thus the total cost is the weights sum of $C_{FN}$ and $C_{FP}$ and may be computed using the following equation:

$$C_T=C_{FN}||FN||+C_{FP}||FP||$$

where $||FN||$ is the number of false negatives (positive examples wrongly predicted) and $||FP||$ is the number of False negatives (negative observations wrongly predicted). Depending on the application, $C_{FN} > C_{FP} $ or $C_{FP} > C_{FN}$.

Selection of performance metrics

ROC curve and how to plot it?

In most cases, the output of the classifier is a score (and not the probability) bounded in the range of 0 and 1. If the output is less than a threshold $\theta$ then we would say the instance belongs to class $-1$, otherwise, it belongs to class $1$. Note that $\theta$ is bounded between the max and min value of the output of the model, which might be 0 and 1.

For every $\theta$, the true positive ratio (TPR) and false positive ratio (FPR) can be computed. TPR is the ratio of correct positive results to all positive samples available during the test. FPR is a number of incorrect positive results to all negative samples available during the test.

Now every choice for $\theta$ corresponds to a point in the ROC Space. In ROC Space the X-axis specifies the False Positive Rate (also known as Specificity) and the Y-axis represents the True Positive Rate (also known as Sensitivity). So the y-axis reflects the proportion of actual positives that are correctly classified as positives.

The area under the ROC curve is known as the AUC (area under curve) or c-statistic. In the ML community AUC statistic is the most often used metric for model comparison.

Note that the area between the ROC curve and the no-discrimination line (the diagonal line) is known as the Gini Coefficient.

ROC curves are known to be insensitive to class balance, hence a good choice for imbalanced datasets.

Summary

While Imbalanced classification is a common issue in many ML and data science, there is no single-bullet solution to address the challenge. Strategies such as resampling and the use of various evaluation metrics have the potential to address the challenges. What remains important is continuously monitoring the models we have in production and being cautious of potential data drifts.

Behind every great product there is a great product manager. That is why there are so few great products out there.
A cool data scientist, Delft

_{Cover Image Credit: Chris Wren and Kenn Brown/mondoworks.}

11 must-know Python Pandas tips and tricks for data scientists

Mohsen Davarynejad — Fri, 26 Jan 2024 05:48:27 GMT

You can have data without information, but you cannot have information without data.
Daniel Keys Moran, science fiction writer

Pandas is considered the most widely used tool for data manipulation, filtering, and wrangling. It comes with enormous features and functionalities designed for fast and easy data analytics. As a data scientist, I use its features on a daily basis, and in this post, Id love to share with you some of the tricks of pandas.

I hope that this crafted list of features serves as a quick reference and gives you a good starting point for learning Pandas. The listed features come in no specific order, so no conclusions can be made on the order.

I will be using the NVDA ticker price data set for this post. So before going into the feature lists, Ill share with you a few lines of code to download the data locally.

Pull the NVDA exchange pricing data

We will be using Yahoo! Finance's API to download some market data. For that, we will be using the yfinance Python package. Follow this link and install yfinance. I installed it using the following single line of code:

!pip install yfinance

import osimport numpy as npimport pandas as pdimport pickleimport yfinance as yf

Let's first pull the historical NVDA exchange rate from the Yahoo! Finance's API.

def get_yf_data(ticker, period = "1mo"):    '''Pull and cache yfinance data into a dataframe. Caching is important for avoiding rate limits.'''    pkl_file_name = ticker.replace(" ", "_") + "_" + period     cache_path = '{}.pkl'.format(pkl_file_name).replace('/','-')    try:        # Check if the pickle file is in the working directory.        f = open(cache_path, 'rb')        ticker_df = pickle.load(f)           print('Loaded {} from cache'.format(ticker))    except (OSError, IOError) as e:        # The data is not locally avaiable. Download it and store it in a form of pickle file.         print('Downloading {} from yfinance'.format(ticker))        ticker_df = yf.download(ticker, period=period)        ticker_df.to_pickle(cache_path)        print('Cached {} at {}'.format(ticker, cache_path))    return ticker_df# Pull NVDA and AAPL price from exchange dataNVDA_price= get_yf_data("NVDA", period="10d")# If you would like to download more tickers you try the following:NVDA_price= get_yf_data("NVDA APP TSLA", period="1mo")# The above will download the Nvidia, Apple and Tesla share prices. NVDA_price.tail()

Now that we have the data ready at hand, let's have a look at my selection of Pandas tips and tricks.

Number 11: Apply aggregations on DataFrame and Series

NVDA_price.agg(['mean', 'min', 'max'])NVDA_price.High.agg(['mean', 'min', 'max'])

.describe() provides the same functionality but .agg() is more flexible as far as I can tell.

NVDA_price.describe()

Number 10: All you need is pandas_profiling

pandas_profiling is a Python library mainly used for initial data exploration. A running example:

# Make sure that you have the pandas_profiling installed.# To install pandas_profiling run: !pip install pandas_profilingimport pandas_profilingNVDA_price.shape# Chart the NVDA pricing datapandas_profiling.ProfileReport(NVDA_price)

This helps you with a few lines of code and time-saving tricks when exploring the data. The pandas_profiling library comes with more features than the one listed above. Interested readers are referred to the official website.

The package provides an extensive exploratory data analysis (EDA) that includes reporting on different types of correlations, such as Pearson's r and Phik coefficient (k). Phik (k) is a new correlation coefficient that works consistently between categorical, ordinal, and interval variables. Phik can capture non-linear dependency and revert to the Pearson correlation coefficient in the case of the bivariate normal input distribution. Read more here.

Number 9: Easy way to convert data to`int` type

This is a simple but useful trick! When converting the data into int type, the most common approach is to use .astype('int'). However, if there are strings or other invalid types, you will face an error, and the conversion will fail. An alternative is to use the pd.to_numeric() command which will handle the errors. You may try this for yourself.

Number 8: Break up strings in multiple columns

A simple trick for splitting a column into multiple columns using a split function:

NVDA_price['column_name'] = NVDA_price['column_name'].str.split(" ", expand=True)

Number 7: .str

Id like to show you how awesome .str is with Pandas Series. Check the following piece of code:

# Covert the Data column to str.NVDA_price['Date_str'] = NVDA_price.index.astype(str)# And take the first 4 elements of the resulting string. NVDA_price['Year_str'] = NVDA_price['Date_str'].str[:4]

It's awesome how they make this bit of magic work.

Number 6: Concatenate strings

Its so easy with the pandas to create a new column by combining two existing columns:

NVDA_price['ts'] = NVDA_price.index.values.astype(np.int64) // 10 ** 9 NVDA_price['price_ts'] = NVDA_price["Open"].map(str) + " - " + NVDA_price["ts"].map(str)NVDA_price['price_ts'].head()

and to get them back again:

NVDA_price['price'], NVDA_price['ts'] = zip(*NVDA_price.price_ts.str.split(' - ', 1))NVDA_price[['price_ts', 'price', 'ts']].head()

Number 5: Percentage of missing data

To get the percentage of missing data in columns and print them in descending order:

# gives % of missing data in columns!(NVDA_price.isnull().sum()/NVDA_price.isnull().count()).sort_values(ascending=False)

Number 4: Highly correlated columns

Lets assume that we want to build a predictive model and we would like to know which columns are highly correlated (bigger than 0.10) to the column Volume!

corr_coeff = NVDA_price.corr()['Volume'].abs().sort_values()corr_coeff[corr_coeff > 0.10]

Number 3: Apply and lambda functions

One of the most awesome things that you can do with pandas is that you can store objects such as networkx graphs in them. Then you can easily .apply() the metrics and algorithms on each graph and see the tabulated results in a nice pandas dataframe. You can also easily .apply() custom functions or lambdas to modify all objects at once. The index then can be used to keep track of everything and to keep results near their original objects.

In the example lets assume that you have multiple networks from different years. We will make a graph at random.

import networkx as nximport itertoolsimport randomyears, graphs = list(zip(*[(year, nx.barbell_graph(random.randint(2,10),random.randint(2,50))) for year in range(2010,2019)]))df = pd.DataFrame({"graph": graphs}, index=years)df["number_of_nodes"] = df.graph.apply(nx.number_of_nodes)df["number_of_edges"] = df.graph.apply(nx.number_of_edges)df["transitivity"] = df.graph.apply(nx.transitivity)

def nodes_odd_degree(g):    # Calculate list of nodes with odd degree    nodes_odd_degree = [v for v, d in g.degree() if d % 2 == 1]    return nodes_odd_degreedef len_odd_node_pairs(g):    # Compute all pairs of odd nodes and return the length of list of tuples!    odd_node_pairs = list(itertools.combinations(nodes_odd_degree(g), 2))    return len(odd_node_pairs)

df["nodes_odd_degree"] = df.graph.apply(nodes_odd_degree)df["len_nodes_odd_degree"] = df["nodes_odd_degree"].apply(len)df["len_odd_node_pairs"] = df.graph.apply(len_odd_node_pairs)# And now print a selection of columns:df[["number_of_nodes", "number_of_edges", "transitivity", "nodes_odd_degree", "len_nodes_odd_degree", "len_odd_node_pairs"]]

Number 2: drop_duplicates vs unique

Lets assume that you have a column with millions of entries and you need to find the unique set of values It. df.column_name.drop_duplicates(keep="first", inplace=False) is a lot faster than df.column_name.unique(). Can be tested using %timeit.

The keep="first" option drops all the duplicates except for the first occurrence.

Number 1: Ufuncs

As explained here Pandas Ufuncs is much better than .apply command.

Wrap-Up

This post should give you an idea of some Pandas tricks commonly used by data scientists. The list aims to make your code more efficient.

In my next post, I'll share some basic concepts you need for working with data in PySpark. The post will cover topics such as reading and writing data from CSV files, along with some basic statistical analysis and visualization of data.

Notes:

1 - The official pandas cheat sheet can be downloaded from he re.

2- Link to the notebook of this tutorial.

Refernces:

1 - Data School

2 - Lukas Erhard

Please leave a message below if you would like to share your thoughts or if you like to share your tricks on Pandas.

Explainable boosting machine

Mohsen Davarynejad — Tue, 23 Jan 2024 12:39:27 GMT

The business impact of machine learning models is becoming increasingly significant whether it is in marketing campaign budget allocation or credit risk management. However, the lack of interpretability and explainability of many machine learning models makes them not only difficult to trust but also difficult to debug. When the end user can comprehend the decision-making process of a machine learning model, they are more likely to accept, trust, and adopt the model's outputs, giving rise to reliable model-based decisions.

The easiest way to ensure interpretability is to use interpretable models. Consequently, there is a growing interest in the development and utilization of interpretable machine learning models.

Explainable Boosting Machine (EBM) originally developed at Microsoft Research is an inherently interpretable machine learning model that gained increasing popularity in a variety of business applications due to its predictive performance and low computational cost. It falls into a category of ML models known as "Glass box". Not only it can capture non-linear relationships between dependent and independent variables, but it also has a built-in mechanism to capture interactions among independent variables. at its heart, it uses some of the well-known ML techniques like bagging and gradient boosting.

Explainable Boosting Machine is a special case of ensemble models, which involve training multiple very simple decision trees and combining their predictions to obtain a final prediction.

Imagine we are given a dataset with n features. The first decision tree is trained only on the first feature. The residuals are then calculated and we continue by training another model but this time only using the second feature. The process continues until the last model is trained in a boosting fashion using the last feature. The order of the features will not impact the results since we ensure to keep the learning rate very small. Now we have the first iteration completed. We continue the process for many many iterations, let's say 10,000 iterations. At the end of the cycle, we have 10,000 small trees trained on feature 1 that can be summarized as a graph. We can extract a graph for all the features. These series of graphs with size n construct our model.

The ability to model complex relationships and interactions while maintaining explainability makes the Explainable Boosting Machine a powerful tool for a wide range of predictive modeling tasks.

Conclusion:

You probably heard about the trade-off between explainability and accuracy of machine learning models. The more accurate the model is, the more complex it is and less explainable. EBMs violate this trade-off. With EBMs you will get high accuracy while getting a highly explainable model with an added bonus of editability. This makes EBMs the perfect candidate of choice for many real-world applications that support model-based decisions.

Inherently interpretable ML models

Interested in this topic? Read more on "How to design inherently interpretable machine learning models" by Sudjianto and Zhang (2021)

In my next post, I should follow up on the fairness, weakness detection, reliability, robustness, and resilience of XAI.

Overfitting in ML models

Mohsen Davarynejad — Sun, 21 Jan 2024 13:52:25 GMT

Overfitting happens when a model is more complex than it should be and starts to fit the noise in the data (or some degree of that) instead of the underlying pattern. Overfitting leads to poor model performance on new and unseen data.

While there is no single answer for what level of overfitting is acceptable (although we know it partially depends on the specific application, goals, and tradeoffs), a common approach is to use a validation data set to monitor the models performance during training. When the performance of the model on the validation set starts to degrade while its performance on the training set continues to improve (typically presented as a U-shaped curve) we are transitioning between under- and overfitting regimes. At this point, you can stop training the model or use techniques like regularization or data augmentation to prevent overfitting. While overfitting leads to poor generalization (beyond the training set) and reduces the predictive power of the model, almost always, some level of overfitting is expected and acceptable.

Avoiding overfitting

Few techniques are known to mitigate overfitting.

Regularization: A very well-known technique that helps prevent overfitting by adding a penalty term to the model's cost function. Regularization discourages overly complex or extreme parameter values. Subsampling rows/columns in XGBoost or ML models with similar architectures is another form of regularization.
Model Simplification: Use a simple model but not simpler. A simple model has fewer parameters and less possibility to fit noise. Also ensure to discard less discriminatory features and only keep the most relevant ones.
Knowledge Distillation: Train a smaller, more regularized model to mimic the behavior of a larger, more complex model. This helps in efficient knowledge transfer from the complex model to the simpler one.
Cross-Validation: While its primary purpose is not to prevent overfitting, it provides valuable insights into how well a model is likely to perform on unseen data and can indirectly help in preventing overfitting.
Early Stopping: Stop model training once its performance on validation set plateaus or starts to worsen.
Data Augmentation: This includes not only collecting more diverse and representative data but also Introducing random noise into the input data (to make the model more robust and prevent it from relying too much on specific patterns in the data).

As a rule of thumb, I would not be comfortable deploying an ML model whose performance drops from train to test by more than 2-3%. If that is the case ensure to use any of the techniques listed above to alleviate the situation.

Benign overfitting

A very related phenomenon in machine learning is what is known as "benign overfitting", where a model is overfitting the training data but still performs well on the test data. These models have great expressive power. While the performance of these models might be very good concerning testing data, these models might be very sensitive to small perturbations in the inputs; thus, very fragile in production.

Benign overfitting can be the consequence of Abundance of Data or Robust Model Architecture to name a few. While benign overfitting suggests that the model can handle the noise with no significant degradation in its ability to make accurate predictions on new, unseen data, we need to ensure models generalize well without relying on spurious patterns in the data.

While overfitting does not necessarily harm generalization, it negatively impacts model sensitivity and robustness.

In case you need to get a bit more practical in selecting the right model with reasonable performance and robustness you need to look into PiML (PiML GitHub).

pip install PiML

PiML for Identifying model weaknesses before putting them into production

Well-tuned models tend to be more robust in production when compared to overtly fit and underfit models. That holds true even when we have an overfit/underfit model with similar performance on the test data as a well-tuned model. So once you've gotten to a reasonable threshold of model performance, you're better off focusing on model stability, model generalizability, and the policies you'll put in place around the model than on over-optimizing the model itself. It's the combination of the model + the decision strategies around it that will yield meaningful results in lending. So Identify model weaknesses before putting them into production.

Let's have a quick look into the impact of benign overfitting on model robustness using PiML.

The interplay between model robustness and the quality of fit

A robust model can effectively manage variations, noise, or outliers in the input data without experiencing a substantial drop in performance. Experimental studies show that the performance of a well-tuned model is statistically significantly less affected by perturbations in input data when compared to that of overfit models.

Interested to read more on this topic? Continue here.

Avoid overfitting in DNNs

In part 2 of the same series, I'll cover techniques used to avoid overfitting in DNNs.

What's next? Markov Chain Monte Carlo for approximation of the posterior distribution: Metropolis-Hastings vs Hamiltonian Monte Carlo (Read here)

PyTorch for image classification - Part 1

Mohsen Davarynejad — Tue, 16 Jan 2024 16:35:37 GMT

1. Introduction

This tutorial shows how to classify images using a pretraining Residual Neural Network (ResNet). Image classification is a supervised learning problem with the objective of training a model that learns the relationship between input features and corresponding labels. The output of a classification model is a discrete label or category, indicating the class to which the input belongs.

We will demonstrate the following concepts:

Efficiently loading a dataset off the disk.
Pulling a pre-trainer model and fine-tuning it.
Mitigate overfitting by data augmentation and dropout.

And we will follow a standard machine-learning workflow:

Examine and understand the data
Build an input pipeline
Build the model
Train the model
Test the model
Improve the model and repeat the process

In addition, we will also demonstrate how to save models locally.

Steps

Creating a custom dataset
Build a DNN model in PyTorch
Training the MLP model classifier in PyTorch
Plotting of loss and accuracy curve on training a test data
Evaluation of model performance - Confusion matrix

2- Setup

Import TensorFlow and other necessary libraries:

import numpy as npimport pandas as pd import randomfrom sklearn.metrics import confusion_matrixfrom sklearn.preprocessing import LabelEncoderfrom tqdm import tqdm_notebook as tqdmimport osimport timefrom copy import deepcopyfrom PIL import Imagefrom PIL import ImageOpsimport matplotlib.pyplot as pltimport torchvision.transforms as transformsfrom torchvision.models import resnet50, resnet101import torchfrom torch.utils.data import Dataset, DataLoaderfrom torch.utils.data.sampler import SubsetRandomSamplerimport torch.nn.functional as Ffrom torchsummary import summaryimport seaborn as snfrom utils import flatten_list, read_files_in_path

It's important to specify where the tensors and models are stored and processed, whether on a CPU or a GPU. The following code will ensure to work on the default GPU if available.

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")print(f'Using {device} for inference')

3- Train and Test data

The dataset is divided into training and validation sets, offering a diverse collection of images of your choice. We assume that the training set, residing at BASE_PATH_TRAIN, serves as the foundation for our model to learn and generalize. Meanwhile, the validation set at BASE_PATH_VALID provides a litmus test for the model's accuracy and effectiveness.

BASE_PATH_TRAIN = "Images/Train"BASE_PATH_VALID = "Images/Valid"

Let's look into the dataset and extract the number of classes within our training set. I need to exclude the generic "Others" category.

classes = os.listdir(BASE_PATH_TRAIN)classes = [x for x in classes if "Others" not in x]print(f'Number of classes: {len(classes)}: {classes}')

Next, we will index the full path of images and their corresponding category and also introduce an element of randomness by shuffling the rows. Shuffling ensures that our model encounters a diverse mix of images during each epoch of training, enhancing its ability to generalize and recognize patterns effectively.

training_data = read_files_in_path(path = BASE_PATH_TRAIN)training_data = training_data.sample(frac=1).reset_index(drop=True)print(training_data.shape)training_data.head()validation_data = read_files_in_path(path = BASE_PATH_VALID)print(validation_data.shape)validation_data.head()

4- Decoding the Dataset: Label Encoding

Next, we introduce an important step label encoding. This process not only facilitates the training of the model but also streamlines the interpretation of results.

lb = LabelEncoder()training_data['encoded_labels'] = lb.fit_transform(training_data['labels'])validation_data['encoded_labels'] = lb.transform(validation_data['labels'])# Construct the Mapping Dictionarymapping = dict(zip(lb.classes_, lb.transform(lb.classes_)))reversed_mapping = {value: key for key, value in mapping.items()}

The above code not only encodes the categorical labels but also constructs a dictionary that maps classes to their corresponding encoded values.

5- Crafting a Data Pipeline and built-in data transformation

For this project, we prefer to write our very own data loader from scratch mainly because it will give us full control over the transformations we like to apply to the images. The pipeline loads the images off disk, applies the specified transformations, and makes the images ready for the model.

class CreatePipeline(Dataset):    def __init__(self, data, transform):        super(CreatePipeline, self).__init__()        self.data = data.values        self.transform = transform    def __len__(self):        return len(self.data)    def __getitem__(self, x):        image, _, label = self.data[x]        im = np.asarray(Image.open(image).convert('RGB'))        if self.transform is not None:            im = self.transform(im)        return im, labeltrain_pipeline = CreatePipeline(training_data, image_transforms['train'])train_loader = DataLoader(train_pipeline, batch_size=BATCH_SIZE)validation_pipeline = CreatePipeline(validation_data, image_transforms['valid'])validation_loader = DataLoader(validation_pipeline, batch_size=BATCH_SIZE)

where image_tranforms defines a dictionary of image transformations for both training and validation datasets. All images pass through the transformation phase (random rotations and flips, color jittering, etc) to ensure the ResNet model is exposed to a rich and varied dataset. These random transformations ensure a more robust model and prevent overfitting. This also helps to expose the model to more aspects of the data and generalize better.

The list of transformations is below:

image_transforms = {'train':   transforms.Compose([transforms.ToPILImage(),                               transforms.ToTensor(),                               transforms.RandomRotation(degrees=5),                               transforms.Resize((SIZE, SIZE)),                               transforms.RandomResizedCrop(224, scale=(0.9, 1.0)),                               transforms.RandomHorizontalFlip(),                                                                                transforms.ColorJitter(brightness=.1, hue=.1),                               transforms.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225))]),                    'valid':   transforms.Compose([transforms.ToPILImage(),                               transforms.ToTensor(),                               transforms.RandomHorizontalFlip(),                               transforms.ColorJitter(brightness=.1, hue=.1),                               transforms.Resize((SIZE, SIZE)),                               transforms.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225))])                   }

Applying distinct transformations to training and validation datasets ensures a balance between data augmentation (which helps the model generalize better during training) and ensuring a fair evaluation of the model's performance on unseen data during validation.

6- Model Selection

In the realm of deep learning, the choice of neural network architecture is a critical decision that can significantly impact the model's performance. ResNet, short for Residual Network, stands out as a powerful and influential architecture due to its ability to address challenges associated with training very deep networks.

Here's why ResNet is a compelling choice for our use case, image classification:

The Vanishing Gradient Problem: One common challenge in training deep neural networks is the vanishing gradient problem, where the gradients diminish as they backpropagate through numerous layers. This can hinder the training of deep networks. ResNet introduces skip connections or residual connections (shortcut paths), allowing information to bypass certain layers. This architecture mitigates the vanishing gradient problem and makes it possible to train extremely deep networks. The skip connections also allow the model to learn identity mappings, making it easier to optimize and converge during training.
Performance on Image Classification: ResNet architectures have demonstrated remarkable performance on image classification tasks, particularly in competitions like ImageNet. The ability to capture intricate features and patterns in images contributes to their success.
Transfer Learning Capability: Pre-trained ResNet models on large datasets (e.g., ImageNet) can be used as a starting point for a variety of computer vision tasks. Leveraging transfer learning allows models to benefit from features learned on diverse datasets.
Adaptability to Different Scales: The skip connections in ResNet allow the model to adapt to features at different scales, capturing both low-level and high-level information in images.
Consistent State-of-the-Art Performance: ResNet variants have consistently achieved state-of-the-art performance in various computer vision tasks, making them a reliable choice for image classification.

7- Model Hyperparameters

With our training and validation Dataset was meticulously prepared

# Hyperparameters for TrainingBATCH_SIZE = 16EPOCHS = 24LR = 0.1SIZE = 224NUM_CLASSES = len(lb.classes_)print(f"NUM_CLASSES: {NUM_CLASSES}")RANDOM_SEED = 42MODEL_STORE_NAME = 'equipment_classification.pt'

The above hyperparameters set the batch size to 16 images, allowing the model to incrementally learn from diverse samples. We set the Epochs equal to 24, with each epoch representing a complete pass through the entire training dataset. The Learning Rate (LR) of 0.1 determines the size of the steps our model takes. Each image is resized to a dimension of 224x224 pixels (SIZE) and the number of classes NUM_CLASSES is set. The RANDOM_SEED at 42, ensures the reproducibility of our training process, providing a consistent backdrop for our model's learning journey. We also set the MODEL_STORE_NAME.

Part 2 of this series will be published soon!!

Navigating the Depths of Data Science and Machine Learning in TheGradient.io

Mohsen Davarynejad — Sun, 14 Jan 2024 17:21:55 GMT

Hello, Data Enthusiasts!

I'm thrilled to welcome you to TheGradient.io, your go-to hub for all things data science, machine learning, and LLM. Whether you're a seasoned data scientist or an aspiring ML engineer, this space is tailored just for you.

What's TheGradient.io All About?

At TheGradient.io, we dive deep into the realms of data science, exploring the latest trends, and practical use cases, and sharing valuable learning materials. Expect a blend of insightful articles, personal experiences, and a curated collection of learning resources that will sharpen your skills in the ever-evolving landscape of data and machine learning.

For Data Scientists and ML Engineers

This space is crafted with you in mind. Expect discussions on cutting-edge techniques, real-world applications, and the occasional coding adventure. Whether you're looking to refine your skills or stay updated on industry trends, TheGradient.io is here to be your companion on this exciting journey.

Got Questions? We've Got Answers!

Have a burning question about a machine learning algorithm, data preprocessing, or the best practices in the field? Look no further! This platform is not just about sharing knowledge; it's about building a community. If you've got a question, or a suggestion, or just want to connect with fellow data enthusiasts, don't hesitate to reach out. Your curiosity fuels the discussions here.

Connect with Learning Materials

As we embark on this data-driven adventure together, I'll be sharing learning materials and resources, all neatly organized on my GitHub account. TheGradient.io is not just a blog; it's a knowledge hub. Stay tuned for GitHub links accompanying our posts, providing you with hands-on tools and materials to amplify your learning experience.

Get Ready for the Gradient Journey!

Whether you're here to gain insights, share experiences, or connect with a like-minded community, TheGradient.io is ready to be your companion in the exciting world of data science and machine learning. Let's dive into the gradient together, where every step brings us closer to mastering the art and science of data.

If you're as passionate about data as I am, welcome aboard! I can't wait to share this journey with you.

Feel free to explore, engage, and enjoy the data-driven ride ahead!

Cheers,

Mohsen Davarynejad

https://twitter.com/TheGradient_io/status/1746913124001648817

∇ The Gradient

Demystifying the Black Box: A Guide to Interpretable and Explainable AI Models

Understanding the Black Box

What is Interpretability and Why Does it Matter?

Interpretable AI Models

Linear Regression

Decision Trees

Explainable AI Models

LIME (Local Interpretable Model-Agnostic Explanations)

SHAP (SHapley Additive exPlanations)

Building Interpretable Neural Networks

Saliency Maps

Feature Importance in Neural Networks

Ethical Considerations

Fairness and Bias

Best Practices for Model Interpretability

Model Documentation

Model Selection

The Future Outlook

Conclusion

The Future of Fitness Industry: Innovative Trends, Community Challenges, and Opportunities

Introduction

The Future of Fitness Industry: Innovative Trends, Community Challenges, and Opportunities

Innovative Fitness Trends

Community Challenges

Identifying Market Opportunities

The Path Ahead

Scala Tutorial by Example - Platform free

Why Scala? The real drivers for getting into it!

Some facts about Scala

Installation

Basics of Scala

Make big numbers

Generate random numbers with Scala

Some Math

Function definition

Conditionals in Scala

Type-level programming in Scala (coming soon)

Fundamentals of Scala

In types we trust!

Functions are first-class citizens in Scala

Recursion and Stack Overflow

Higher order function in Scala

Anonymous functions in Scala

Map, Filter, and Reduce: The methods of collections

Pattern matching in Scala

Define a class in Scala

Extend a class in Scala: Subclassing

Extending an object in Scala

Conclusions

Reference

A Data Scientist's Chronicle of Lessons Learned and Strategies for Success

Classification with imbalanced data

Approaches to deal with imbalanced data sets

Internal approaches

External approaches

Random Undersampling (RUS)

Random Oversampling (ROS)

Synthetic Data Generation

Cost-sensitive Learning (CSL)

Selection of performance metrics

ROC curve and how to plot it?

Further Reading

Summary

11 must-know Python Pandas tips and tricks for data scientists

Pull the NVDA exchange pricing data

Number 11: Apply aggregations on DataFrame and Series

Number 10: All you need is pandas_profiling

Number 9: Easy way to convert data toint type

Number 8: Break up strings in multiple columns

Number 7: .str

Number 6: Concatenate strings

Number 5: Percentage of missing data

Number 4: Highly correlated columns

Number 3: Apply and lambda functions

Number 2: drop_duplicates vs unique

Number 1: Ufuncs

Wrap-Up

Notes:

Refernces:

Number 9: Easy way to convert data to`int` type