Welcome to The DataChemist's Journey

I'm glad you are here, come and take a seat while we go on this adventure. I've started this blog to talk about a chemist's journey learning Computational Chemistry, Machine Learning, Data Science and everything else in between. Since June 21 I've been collaborating at AyersLab as a MITACS Globalink research intern. The main purpose of this blog is to be able to document the 12-week internship, in the format of blog posts. Furthermore, I wish to list and put to disposition every resource that I've found helpful and/or interesting related to the topics that I am covering in the blog. Stay with me in this journey!

Posts

  • September 28 - Discussing Training Dataset and Cross-Validation

    Today @fwmeng88 and I had a very fruitful discussion about some modifications that can be done to the scripts for model training as these have experienced some running problems, specially memory-related. These topics invited us to reevaluate the way in which we are performing the nested-cross-validation (cv) in the script as it is related to the way in which the hyperparameters are optimized.

  • September 21 - Partial Results for Delta-Solv

    QM descriptors

  • August 4 - Learning one or two things

    Today I got the covid vaccine. Thank goodness that I was able to find a spot in Linares (a county about 1.5 hours away from Monterrey) and I there I went and received the first shot of AstraZeneca. And man ... it hit me strong. Right now that I am writing the post, I have some fever, dizzyness and a little bit of cough.

  • August 3 - Preparing notes for meeting

    Prof. Paul was out of the city for some weeks, then it's time to update our (Fanwang and I) progress made so far in the project and present it in a future meeting.

  • August 2 - 50% done!

    This month will be the most intense regarding work in the internship. Most of the issues or extensions on the work have been already overseen, now is time to invest a good amount of work and hours to work out the main projects in my internship. Here are the notes of the day ...

  • July 30 - Database meeting

    Fridays are about Ayers-Labs and Databases meetings. The last couple of weeks there haven't been meetings with Ayers-Lab as Prof. Paul was on a break. Databases have been recurrent but last's week meeting was cancelled. Some of the points that were discussed are the following:

  • July 29 - Data collection for imbalanced algorithms and meeting with Fanwang

    Today I spent a good time reading articles about application of ML in drug discovery with the goal to find if any significant amount of research applying these kind of tools are using balanced datasets. Our feeling is that most datasets from the real world isn't balanced, but common algorithms and approaches need balanced data. Our research upon current literature may give us a glimpse about what's happening in the industry.

  • July 28 - Troubleshooting ML jobs in CC

    The model training for common algorithms hasn't gone as expected, most models have some type of error or can't be trained because of extensive parameter grids. Up to date this is the status of the ML models.

  • July 27 - Getting geometries

    Today I am going back to some of the pending work that I had in Database workflow. Here are the notes of the day.

  • July 26 - Back on the saddle

    Last week I spent some of the days taking time off with my family. We traveled to Mazatlan where we enjoyed sunny days in the beach. Now, this week there is a lot of stuff to do. My to-do list includes the following work:

  • July 13 - Distracted from my activities

    The good thing about having this blogging project is that it is personal and sometimes (as today), I will be sharing more than just technical stuff. Right now I am collaborating at AyersLab in McMaster University, but I study my B.Sc. at Tecnológico de Monterrey or TEC for short.

  • July 12 - Progress in B3DB

    Mondays are the days when I have recurrent meetings with Fanwang. He is a PhD candidate at AyersLab and is currently working in his thesis. I am glad to be helping with some of the work he has to do, specially the one that involves B3DB and imbalanced learning. I had some peding work from last week, so today I hope to finish that.

  • July 09 - Database Meeting

    After two weeks of not having a meeting for the work done in databases, we had it finally this friday. There were several updates in the scripts, specially with those realted to the upgrades made in the new fix command.

  • July 08 - Classification Algorithms

    Today I am coming back to my main project 3BDB with the task to script different common classification Machine Learning (ML) models for this database. First of all, it's important to go back to the basics and remember what type of problem we are tackling. In this case, it is classification, which is one of the two different families in Supervised Learning (the other one is regression).

  • July 07 - Data collection and working with strange formats

    Working today with databases - alkalides. These are some very curious molecules, in which an alkali metal atom (Li, NA, K, Rb, Cs, Fr) gains an electron, becoming an anion. This is quite strange as usually these atoms just have 1 valence electron, thus the "easier" thing for them is to lose that electron. That's why alkali metals make up most ionic salts (i.e. NaCl). To make these compounds they usually get paired with halogens, which are the atoms who gain an extra eletron (i.e. Na+ and Cl-).

  • July 06 - B3DB Feature Selection

    It's time to integrate in one script the different strategies for data cleaning and feature selection upon multiple criteria in a single python script. This will have the following structure:

  • July 05 - Checking QC Calculations

    Last week I ended my activities scheduling and running calculations for a large subset of the Issue 14 database. Now it's time to check how those calculations ran.

  • July 2 - My database issues

    After learning about databases, it is time to make my own progress. Here's how a common workflow will work for the creation of a database, in this example, I am creating and scheduling calculations for issue_14 database:

  • July 1 - Happy Canada Day!

    So today is Canada Day and it is a holiday for everyone in the research group, but I am over here in Mexico and that doesn't feel very canadian. It was cool to read a little bit about Canada's history and that it is a very celebrated day over there.

  • June 30 - B3DB Feature Selection

    Why have feature selection? Isn't the more the data, the best prediction? Well... not precisely. Here are several reasons for selecting features in an ML model:

  • June 29 - Databases and Python Scripts

    Today I continued the work for Databases and while working with the Linux terminal that I use to access Compute Canada, I came also to the need of an efficient way to rename folders, subfolders and files. Some here are some of the notes from these days.

  • June 28 - Blogging with Jekyll

    Through this research experience, there is also a skill that I was eager to put into practice and that is blogging. The way that I work most of times is through streaks of intense work, then followed by a lot of days menial to almost zero work. That means that it is an issue of discipline. The short but constant act of write a summary throught the day is definitively going to help me building consistency in my work habits.

  • June 25 - Databases

    There are two main projects in which I will be involved during this summer internship. The first one is the B3DB which stands for Blood-Brain Barrier Data Base, where I will be working with Machine Learning Models to predict a molecule's BBB permeability. The second project in which I am also working is in the construction of molecular databases with computational data. For this project, we are assigned different "issues" of databases, where several steps need to be taken:

  • June 24 - Happy Birthday to me!

    Today is my birthday! And what best birthday gift than solving you installation issue? Thank goodness the problem was solved after spending at least 5 hours trying to figure out what was happening with Compute Canada (CC). I had already created an installation script where I was able to reproduce all the steps up to my problem.

  • June 23 - Exploratory Data Analysis

    Exploratory Data Analysis (EDA) is the base work for any implementation with a Data Set. Usually, data is not clean, nor neatly formatted. Mucho of the time is spent dealing with NaN (missing) values, funny data and packages that are not working as they are supposed to. For this initial EDA I will try to take a look upon NaN values present in the chemical descriptor dataset.

  • June 22 - SHARCNET Summer School: Bioinformatics

    There is a Summer School currently running in SHARCNET. I enrolled in some courses and today's starts the course in Bioinformatics. I share underneath the notes from the 2-session course.

  • June 21 - First MITACS Day!

    So this is the start of MITACS Globalink Research Internship with Prof. Ayers in his lab. I am very excited for what is coming in these next 12 weeks of intense learning experience.

subscribe via RSS