Francisco Camargo

Logo

Data Scientist & Machine Learning Engineer

View the Project on GitHub francisco-camargo/francisco-camargo

Python

Return to top README.md

Install Python on Windows

Link to download Python.

During installation, be sure to add Python to PATH:

1670921358303

Check the version of Python. Note that it can have different aliases, e.g. python, py, python3 etc.

python --version
>>> Python3.11.0

If you use VSCode, be sure that the desired Python Interpreter is used: from the Command Pallette search for Python: Select Interpreter. Can check the bottom right of the window:

1670921755378

Install Python on WSL

Guide

give an alias

alias python = python3.10

Guide to install pip

sudo apt-get update
sudo apt-get upgrade # not needed according to guide, but may as well
sudo apt install python3-pip

Project structure

Python Code Environment

Download and install Python from link

Code Environment on Windows

To create an environment via the terminal, use

python -m venv env

To activate environment, use

env/Scripts/activate

In git bash may have to use . env/Scripts/activate

To update pip, use

python -m pip install --upgrade pip

To install libraries, use

pip install -r requirements.txt

To deactivate an active environment, use

deactivate

Code Environment on Ubuntu

On Ubuntu, you may need to install python3-venv, link

sudo apt install python3-venv

Create the virtual environment

python3 -m venv env

Activate the environment

. env/bin/activate

Upgrade pip

python -m pip install --upgrade pip

Install python libraries

pip install -r requirements.txt

Deactivate the environment

deactivate

Testing

Importing local code from other directories

Assume you have the following folder structure:

parent
  scriptE.py
  folder1
    scriptA.py
    scriptB.py
    folder3
      scriptF.py
  folder2
    scriptC.py
    scriptD.py

and the current working directory is folder1.

If you need to run from a submodule directly such that __name__ is '__main__', try adding the parent directory to the system path

try:
    from folder1.folder3 import scriptF
except ModuleNotFoundError as e:
    sys.path.append(os.getcwd())
    from folder1.folder3 import scriptF

If you want to do relative imports: good write-up. My current understanding is that whatever script is '__main__' will not be able to utilize relative imports!

Again assume the current working directory is folder1:

Here are some useful commands to help debug some of these issues:

    import sys
    print(f'system path: {sys.path}')
    import os
    print(f'current working dir (cwd): {os.getcwd()}')
    print(f'ls within cwd: {os.listdir('./')}')
    print(f'ls within ../cwd: {os.listdir('../')}')

Running Python code in VSCode

To use the IPython terminal, install IPython

pip install ipython

Then in the VSCode terminal run

ipython

need to load (once) the %autoreload extension, guide,

%load_ext autoreload

now can use %autoreload

%autoreload
%run main.py

I was able to run both of these lines together in the VSCode terminal by using Ctrl+o in the terminal to have multiple lines of input.

This will have the same effect as clicking Run within Spyder on whatever Python file Spyder has open. This also has the benefit of seeing changes to the code while still in IPython, which is not working when using the “import main” approach.

(old) and then within the IPython terminal that has now been instantiated, import the module you want to use

import main; main.main()

pyproject.toml

The documentation for black suggests that we can use a pyproject.toml file to replace both setup.py and setup.cfg files.

1670897817471

Packaging

Learn how to package code, there are several options, so first want to just look at all of them before picking one to run with

Guide

Guide

Guide, in it, they say: nowadays the use of setup.py is discouraged in favour of pyproject.toml together with setup.cfg. Find out how to use those here.

Poetry also does a similar thin? There may be some serious problems with Poetry. webpage Sounds like poetry also does dependance management, so maybe use it instead of venv? Poetry intro from ArjanCodes, who seems happy with it. Sounds like it can help with package publishing.

What the heck is a wheel?

Plotting

I can plot out-of-the box to a pop-up window in VSCode if I use Run, but if I use ipython to run the code, I can’t seem to be able to get the plots.

In ./src/plotting.py I placed sample code for a plotting function that handles the aesthetics well. It Accounts for; uncertainty bands, .eps formatting, size of plot for two column paper, and several other minor details.

Data Validation

Pydantic

Pandas

Data Types for DataFrames

Guide

Guide

df.apply()

df.apply()

SettingWithCopyWarning

Guide, simple explanation on how to deal with this. First line of defence: use .loc and .iloc

df.loc[idx_label, col_label] = some_new_value

I think idx_label then just needs to be an element found within df.index().

Alternatively, you could use a mask. By example:

mask = df[column_label]==some_value

This returns a mask of boolean values to pick out rows. Now use this mask instead of idx_label

df.loc[mask, col_label] = some_new_value

SciKit-Learn

Want to be able to use sklearn without having to switch back and forth between numpy arrays and dataframes. Use sklearn-pandas. I have successfully used this in make_features.py within the dsc_roadmap project; import DataFrameMapper and use it in conjunction with the sklearn SimpleImputer.

ShuffleSplit

StackOverflow on why ShuffleSplit is useful as training data size grows

FutureWarning: This Pipeline instance is not fitted yet

I created a custom class to use as part of a feature engineering Pipeline and was getting this warning. One way to get this warning to go away is to add an Estimator at the end of the Pipeline but that is undesirable as I want this pipeline to be dedicated to feature engineering.

This post has as solution that I think is better. Essentially, in variables defined within the custom class, name them with an underscore suffix.

class CustomClass(BaseEstimator, TransformerMixin):
    def __init__(self, attribute):
        self.attribute_ = attribute

    def fit(self, X, y=None):
        scale_cols_ = X[self.attribute_]
        return self

It seems like this works even if I only apply an underscore suffix to one (and not all) the attributes. Also not sure if it matter if it’s a method instead. I don’t understand why this works in the first place, but I can live with it and it enables the use of custom transformers to be used in Pipelines!

Experimental Setup

Guide into how to use Hydra to help run experiments using Python.

Video intro

Video mentions integration with mlflow around 20min mark, talks about parallelization (26min mark)

Video for hydra-zen

Model based testing

There is the hypothesis python package. Docs, demonstration

Logging

Don’t pass logger around as a function input, instead import logger into each script of interest

Very simple logging:

import logging

logger = logging.getLogger(__name__)
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
)

logger.info('Added an entry to the logger')

More sophisticated logging:

"""
Logging Configuration Module

This module sets up a centralized logging configuration for use across the application.
It configures a logger that writes to both a file (overwriting it on each run) and the console.
The timestamp in log messages is set to GMT (UTC) time.
Log messages include the filename and line number instead of the logger name.

The module provides a pre-configured logger object that can be imported and used in other modules.

To use this in another script
  from src.logging.log_config import logger
  logger.info('Write this to the log')

Attributes:
    logger (logging.Logger): A configured logger object for use across the application.
"""

import logging
import sys
from time import gmtime

# Configure the logger
logging.Formatter.converter = gmtime  # Set time to GMT

# Create a logger
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

# Create file handler which logs even debug messages
file_handler = logging.FileHandler('src/logging/app.log', mode='w')
file_handler.setLevel(logging.INFO)

# Create console handler with a higher log level
console_handler = logging.StreamHandler(sys.stdout)
console_handler.setLevel(logging.INFO)

# Create formatter and add it to the handlers
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(filename)s:%(lineno)d\n\t%(message)s')
file_handler.setFormatter(formatter)
console_handler.setFormatter(formatter)

# Add the handlers to the logger
logger.addHandler(file_handler)
logger.addHandler(console_handler)