Headlights

Calvin drove to the Kilearney Hills with a medium sized pepperoni, obsessing over his asshole boss, Matt. “You gotta pick it up, you’re killing me here,” Matt had snarled as Calvin hurriedly grabbed…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Running SQL Syntax on a Python DataFrame

For those of you who were introduced to SQL syntax for sorting through data before learning python (like I was) or people who just prefer using queries to find their data but are doing a project in python, here is a quick tutorial to help you understand how to use pandasql. I will be using some examples from a recent project of mine, where I was analyzing data from 2019 residential sales and seeing how different features of a home affect its sales price.

The first thing you are going to need to to do is import pandas and pandasql.

So lets check out the structure of this process. First I am saving my query to an expression, this helps keep things clear for anyone trying to read your code. Then I am using triple quotes so that I can easily space my query out to multiple lines , giving it a cleaner format like we are used to seeing with SQL. Finally I am running the query, by plugging my query expression (q1) into ps.sqldf(). (I know this query is only doing the same thing as df.head() in the pandas library but this is only an example.)

To create and combine columns in your DataFrame is very simple with pandasql. The only thing you need to do is after your select statement, place a comma and add the concat function, then give it a name. Just make sure you save the new DataFrame to a variable.

SQL syntax is prime for sorting through your data, first off SQL makes it very easy to combine “and”, “or” and “not” operators together to filter through your data but it also allows you to use wildcards so that you can grab similar rows as well.

Our Dataframe is looking pretty good right now, but by doing some research about our PropertyType column I know that it is also including condos and multifamily homes, which for my purpose I want to filter out. Also by graphing out my sales prices I noticed that there are some homes that were sold for 0 dollars, which is not very honest in King County, so lets clear those things out real quick with a couple “and” and “or” operators.

Note that I placed all of my “or” statements within parentheses, this is so that they will all be attached to the “and” statement. If we did not include them we would end up still having sales prices of 0 where property types are between 11 and 13.

We now have a Dataframe we can use for some analysis. From here we could do some extra cleaning such as, using CASE statements to do some basic encoding, removing some columns by defining a new Dataframe with only the columns we want to keep with the SELECT statement. Though the Pandas library has some great features that do these things as well, but I will leave it up to you to decide which one you are more comfortable with.

Add a comment

Related posts:

Who is Retarded? He or We?

A very strange incident happened few days back. I witnessed a man who seems to be gifted with different powers by God, brutally attacked by somebody with iron rod. Probably the man is isolated by the…

Play a game by coding your character for victory

Are you a developer who likes gaming? Or a gamer who wants to expand your development skills? What if you could code and conquer a game at the same time? Introducing Rogue Cloud, a game where players…

Things to keep in mind when estimating efforts using story points

Story points are the most common way in Agile to estimate a piece of work in the product backlog. Their value provides an idea around the efforts needed to implement a piece of work. This effort…