Project: Buzzwords for Data Science jobs

Exploring job postings to find popular buzzwords

Richard Mei
3 min readNov 23, 2020

Data Science is a very popular phrase, but much too broad. When I learned data science in a boot camp, I was taught programming in Python, statistics, machine learning and working with big data. I essentially learned the workflow of the data science and got an amazing amount of knowledge out of it. After finishing the boot camp with stronger skill sets, it was of course time to find a job to get some professional experience, but before that I needed to craft a great resume.

With not much experience in the tech field behind me, I was told I needed to get all the buzzwords into my resume. I didn’t know what those words were and wished I had a list of buzzwords that actually could relate to my studies. This gave me an idea to find those buzzwords!

Obtaining the Data

First I decided I wanted to get data from job postings that had positions called Data Scientist, Data Science, or Data Analyst. I searched on LinkedIn using the keyword Data Science and narrowed down my search to just around my location. After trying to automatically scrape everything, I decided to just manually scroll down and save the HTML file containing all the links to the job postings. I set up a SQLite3 database using the sqlite3 library since it wasn’t going to be a huge project.

with open('non_data_science.html', 'r') as file: #postings for non ds roles
contents = file.read()
search_soup = BeautifulSoup(contents, 'lxml')

I opened the html files and converted it into a BeautifulSoup and found the links pertaining to the job postings. I used the request library to direct to the scraped link. After being able to access up to 1900 job postings, I wrote my own function called job_info_from_soup to scrape the of a BeautifulSoup for data points like title (of job position), company, location, description, and location. The gist of the script looked like:

for job_link in jobs_list[index:]:  posting_soup = BeautifulSoup(request.content, 'lxml')  data = job_info_from_soup(posting_soup)  data = [index] + data  query = """INSERT INTO {} VALUES (?, ?, ?, ?, ?);""".format(job_type) c.execute(query, data)

In the end, I had a sqlite database with one table with data science postings, I repeated the process to get another table with non data science postings by not using a keyword search on LinkedIn. This would be for potentially doing some NLP for an extra component of this project.

Data Handling

After gathering all my data, I called the two tables and stored the data into a Pandas data frame. I labeled the data science roles with a target value of 1 and the non data science roles with a target value of 0.

Head of data frame
Tail of data frame

I then wrote my own cleaning function to use Regex to tokenize the description. After taking all the descriptions in each row of the data frame with a target of 1, I used the FreqDist function from the ‘nltk’ library to create a dictionary with a count of each element.

With the help of a word cloud library in Python I made an easy visual for seeing the most common words.

WordCloud(colormap='Spectral').generate_from_frequencies(freqdist)

Takeaways

The takeaways of this project would be to use these words somehow in my resume. For example, data of course needs to be in my resume some how. Since I see the word team, I would use try to craft a sentence about my teamwork skills.

To improve on this project, I would want to look at the bigrams of data. This means looking at pairs of words frequency to see if they will appear more than some of this unigrams.

--

--