Welcome to The DataChemist's Journey

I'm glad you are here, come and take a seat while we go on this adventure. I've started this blog to talk about a chemist's journey learning Computational Chemistry, Machine Learning, Data Science and everything else in between. Since June 21 I've been collaborating at AyersLab as a MITACS Globalink research intern. The main purpose of this blog is to be able to document the 12-week internship, in the format of blog posts. Furthermore, I wish to list and put to disposition every resource that I've found helpful and/or interesting related to the topics that I am covering in the blog. Stay with me in this journey!

Posts

Sep 28, 2021
September 28 - Discussing Training Dataset and Cross-Validation
Today @fwmeng88 and I had a very fruitful discussion about some modifications that can be done to the scripts for model training as these have experienced some running problems, specially memory-related. These topics invited us to reevaluate the way in which we are performing the nested-cross-validation (cv) in the script as it is related to the way in which the hyperparameters are optimized.
Sep 21, 2021
September 21 - Partial Results for Delta-Solv
QM descriptors
Aug 4, 2021
August 4 - Learning one or two things
Today I got the covid vaccine. Thank goodness that I was able to find a spot in Linares (a county about 1.5 hours away from Monterrey) and I there I went and received the first shot of AstraZeneca. And man ... it hit me strong. Right now that I am writing the post, I have some fever, dizzyness and a little bit of cough.
Aug 3, 2021
August 3 - Preparing notes for meeting
Prof. Paul was out of the city for some weeks, then it's time to update our (Fanwang and I) progress made so far in the project and present it in a future meeting.
Aug 2, 2021
August 2 - 50% done!
This month will be the most intense regarding work in the internship. Most of the issues or extensions on the work have been already overseen, now is time to invest a good amount of work and hours to work out the main projects in my internship. Here are the notes of the day ...
Jul 30, 2021
July 30 - Database meeting
Fridays are about Ayers-Labs and Databases meetings. The last couple of weeks there haven't been meetings with Ayers-Lab as Prof. Paul was on a break. Databases have been recurrent but last's week meeting was cancelled. Some of the points that were discussed are the following:
Jul 29, 2021
July 29 - Data collection for imbalanced algorithms and meeting with Fanwang
Today I spent a good time reading articles about application of ML in drug discovery with the goal to find if any significant amount of research applying these kind of tools are using balanced datasets. Our feeling is that most datasets from the real world isn't balanced, but common algorithms and approaches need balanced data. Our research upon current literature may give us a glimpse about what's happening in the industry.
Jul 28, 2021
July 28 - Troubleshooting ML jobs in CC
The model training for common algorithms hasn't gone as expected, most models have some type of error or can't be trained because of extensive parameter grids. Up to date this is the status of the ML models.
Jul 27, 2021
July 27 - Getting geometries
Today I am going back to some of the pending work that I had in Database workflow. Here are the notes of the day.
Jul 26, 2021
July 26 - Back on the saddle
Last week I spent some of the days taking time off with my family. We traveled to Mazatlan where we enjoyed sunny days in the beach. Now, this week there is a lot of stuff to do. My to-do list includes the following work:
Jul 13, 2021
July 13 - Distracted from my activities
The good thing about having this blogging project is that it is personal and sometimes (as today), I will be sharing more than just technical stuff. Right now I am collaborating at AyersLab in McMaster University, but I study my B.Sc. at Tecnológico de Monterrey or TEC for short.
Jul 12, 2021
July 12 - Progress in B3DB
Mondays are the days when I have recurrent meetings with Fanwang. He is a PhD candidate at AyersLab and is currently working in his thesis. I am glad to be helping with some of the work he has to do, specially the one that involves B3DB and imbalanced learning. I had some peding work from last week, so today I hope to finish that.
Jul 9, 2021
July 09 - Database Meeting
After two weeks of not having a meeting for the work done in databases, we had it finally this friday. There were several updates in the scripts, specially with those realted to the upgrades made in the new fix command.
Jul 8, 2021
July 08 - Classification Algorithms
Today I am coming back to my main project 3BDB with the task to script different common classification Machine Learning (ML) models for this database. First of all, it's important to go back to the basics and remember what type of problem we are tackling. In this case, it is classification, which is one of the two different families in Supervised Learning (the other one is regression).
Jul 7, 2021
July 07 - Data collection and working with strange formats
Working today with databases - alkalides. These are some very curious molecules, in which an alkali metal atom (Li, NA, K, Rb, Cs, Fr) gains an electron, becoming an anion. This is quite strange as usually these atoms just have 1 valence electron, thus the "easier" thing for them is to lose that electron. That's why alkali metals make up most ionic salts (i.e. NaCl). To make these compounds they usually get paired with halogens, which are the atoms who gain an extra eletron (i.e. Na⁺ and Cl^-).
Jul 6, 2021
July 06 - B3DB Feature Selection
It's time to integrate in one script the different strategies for data cleaning and feature selection upon multiple criteria in a single python script. This will have the following structure:
Jul 5, 2021
July 05 - Checking QC Calculations
Last week I ended my activities scheduling and running calculations for a large subset of the Issue 14 database. Now it's time to check how those calculations ran.
Jul 2, 2021
July 2 - My database issues
After learning about databases, it is time to make my own progress. Here's how a common workflow will work for the creation of a database, in this example, I am creating and scheduling calculations for issue_14 database:
Jul 1, 2021
July 1 - Happy Canada Day!
So today is Canada Day and it is a holiday for everyone in the research group, but I am over here in Mexico and that doesn't feel very canadian. It was cool to read a little bit about Canada's history and that it is a very celebrated day over there.
Jun 30, 2021
June 30 - B3DB Feature Selection
Why have feature selection? Isn't the more the data, the best prediction? Well... not precisely. Here are several reasons for selecting features in an ML model:
Jun 29, 2021
June 29 - Databases and Python Scripts
Today I continued the work for Databases and while working with the Linux terminal that I use to access Compute Canada, I came also to the need of an efficient way to rename folders, subfolders and files. Some here are some of the notes from these days.
Jun 28, 2021
June 28 - Blogging with Jekyll
Through this research experience, there is also a skill that I was eager to put into practice and that is blogging. The way that I work most of times is through streaks of intense work, then followed by a lot of days menial to almost zero work. That means that it is an issue of discipline. The short but constant act of write a summary throught the day is definitively going to help me building consistency in my work habits.
Jun 25, 2021
June 25 - Databases
There are two main projects in which I will be involved during this summer internship. The first one is the B3DB which stands for Blood-Brain Barrier Data Base, where I will be working with Machine Learning Models to predict a molecule's BBB permeability. The second project in which I am also working is in the construction of molecular databases with computational data. For this project, we are assigned different "issues" of databases, where several steps need to be taken:
Jun 24, 2021
June 24 - Happy Birthday to me!
Today is my birthday! And what best birthday gift than solving you installation issue? Thank goodness the problem was solved after spending at least 5 hours trying to figure out what was happening with Compute Canada (CC). I had already created an installation script where I was able to reproduce all the steps up to my problem.
Jun 23, 2021
June 23 - Exploratory Data Analysis
Exploratory Data Analysis (EDA) is the base work for any implementation with a Data Set. Usually, data is not clean, nor neatly formatted. Mucho of the time is spent dealing with NaN (missing) values, funny data and packages that are not working as they are supposed to. For this initial EDA I will try to take a look upon NaN values present in the chemical descriptor dataset.
Jun 22, 2021
June 22 - SHARCNET Summer School: Bioinformatics
There is a Summer School currently running in SHARCNET. I enrolled in some courses and today's starts the course in Bioinformatics. I share underneath the notes from the 2-session course.
Jun 21, 2021
June 21 - First MITACS Day!
So this is the start of MITACS Globalink Research Internship with Prof. Ayers in his lab. I am very excited for what is coming in these next 12 weeks of intense learning experience.

Welcome to The DataChemist's Journey

Posts

QM descriptors