Scraping LinkedIn Data With Python

    My goal for this lab was to use Python to scrape LinkedIn data without scrapping HTML tags. LinkedIn does provide an API but it sits behind a pay wall. For a simple experiment like this I think the open web would suffice. The next step is to dissect the data using Pandas’ data frame and visualize that data using Matplotlib.

    Where Is Waldo?

    Before I dive deeper into the “how”, I needed the where. I know I can use Python code for the “how,” but where is the data located? Since I am not scraping HTML, where else can I find LinkedIn data?

    Typically, when you want to scrape data off a website, you would have to use Beautiful Soup or a similar technique for extracting data from HTML tags. This sometimes solves the problem, but I wanted to go straight to the source and skip the HTML tags.

    Now, if you have been developing websites as longs as I have, or you’re just joining me on your web dev journey, you can simply to open web developer console and switch to the Network tab for clues.

    Now, I have the “Where”. Sort of, more on this later.

    Lets dig deeper.

    HTTP Network Requests

    Unfortunately, simply switching to the Network tab won’t solve the puzzle. You have to be a detective, using your magnifying glass, putting on your Sherlock Holmes hat, and investigate.

    But what exactly was I looking for?

    The answer?

    JSON.

    Now I’m getting closer.

    For the most part, if you want to find a specific text in any of the HTTP requests in the Network tab, you can select one of the search tools because there is more than one, choose one of the items in the list, and it will open a detailed window on the right. With the detailed window open, use the search feature to find text within an HTTP response body or header.

    Once I find the text and JSON that has the information that fits my use case, I can now grab the HTTP Request URL and use it my Python code.

    Python Code

    Note: The code is incomplete to focus on the main objectives of this article.

    To initialize my project at imported the Python libraries.

    import requests import time import os from os.path import exists from datetime import datetime import re from requests.models import ProtocolError from slugify import slugify

    You’ll notice that I did not use a Python class. That’s intentional. This project was small enough to use a functional programming approach.

    headers = { 'Host': 'www.linkedin.com', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/110.0' } try: url = "https://www.linkedin.com/)".format(location, key_word, i) r = requests.get(url, headers=headers, data=payload) except (requests.ConnectionError, requests.HTTPError, KeyError, ValueError) as error: print(error) raise try: response = r.json() jobs = response.get("included") if len(jobs) == 0: break except: break if jobs is not None: for c in jobs: if c.get("jobPostingId") != None: id = c.get("jobPostingId") job_post_id.append(id) with open("job_ids.txt", "a+") as file_object: file_object.write("{}n".format(id)) print("Job ids found {}".format(len(job_post_id)))

    Example raw data from the Python request above. The challenge here was to find the key text I wanted using various methods and JSON tools.

    { "dashEntityUrn": "urn:li:fsd_jobPosting:3509879850", "companyDetails": { "company": "urn:li:fs_normalized_company:157327", "*companyResolutionResult": "urn:li:fs_normalized_company:157327", "$recipeTypes": [ "com.linkedin.voyager.deco.jobs.web.shared.WebCompactJobPostingCompany" ], "$type": "com.linkedin.voyager.jobs.JobPostingCompany" }

    Here is the final result. JSON data refactored and cleaned to show the data I wanted.

    { "Job Title": "Cloud Engineer", "Company Name": "BlackLine", "Location": "United States" }

    Here is the Python code I used to generate a bar chart using matplotlib.

     import pandas as pd import matplotlib.pyplot as plt with open("dataset.json", "r") as jsn: a = pd.read_json("dataset.json", orient='records') a.drop_duplicates(inplace=True) a.dropna(inplace=True) a['Location'] = a['Location'].str.replace('Remote', '').str.strip() a.groupby('Location').size().sort_values(ascending=False) counts = a.groupby('Location').size().sort_values(ascending=False) counts.plot(kind='bar', figsize=(10, 5)) plt.title('Number of Job Postings by Location') plt.xlabel('Location') plt.ylabel('Count') plt.show()

    And the final result of the experiment. The X axis, Location , had to many cities. I had to shrink down to fit the text. The Y axis, Count, shows the number of jobs in a given location.

    number-of-jobs