Aster nPath functionality (Volume 2)
I’m assuming you’re already read Aster nPath functionality (Volume 1) and and are now ready to go into a deep dive with Use Case scenarios. First we’ll cover some additional concepts, and then move into real-life examples that bring out the value of nPath. Here’s the lesson plan:
- Understanding PARTITION BY and ORDER BY
- LAG function
- nPath Use Case queries. Want to find:
- Users who started at Home page, clicked on Product, then bought that Product
- Top 5 paths after a Web event
- Average dwell time between pages
- Customers who bought a Product with 5 minutes of a Product view
1. Understanding PARTITION BY and ORDER BY
I forgot to include this concept in Volume 1 so want to briefly describe it here. It will help you visualize how the data is manipulated before nPath starts looking for patterns. Consider the following code:
SELECT distinct user_id FROM nPath (ON clicks PARTITION BY user_id ORDER BY datestamp MODE(NONOVERLAPPING) PATTERN('H.V.C') SYMBOLS(page = 'home' AS H, page = 'view_product' AS V, page = 'checkout' AS C) RESULT(first(user_id of H) as user_id)) order by user_id;
The first thing nPath does is get the data organized so it can be processed. It first looks at the PARTITION BY column and gets all those rows on the same v-Worker. If you are lucky enough to have this column (in our case, user_id) defined in the CREATE TABLE as the hash column, then each distinct user_id is already on the same v-Worker and no data movement is necessary. However if another column was defined as the hash column, then it may be necessary to move data between v-Workers of the Worker nodes.
Once the user_id rows are on the same v-Worker, it’s now time to sort those rows based on the ORDER BY column. In our case, we will sort by datestamp. Since we did not specify ASCENDING or DESCENDING, the default ASCENDING will take precedence.
So in our example, it will look something like this:
At this time, you have all your ducks in a row and it is time to process the remaining keywords (MODE, PATTERN, SYMBOLS, RESULT). Processing will occur in a Top-Bottom fashion as nPATH searches for the Pattern defined.
2. LAG function
You can compare the current row with a previously seen row to decide if a symbol applies to it. To do this, use the LAG function in the SYMBOLS clause. Here’s the syntax (which may not make sense until we show you a few examples so just hang in there for a few minutes).
LAG ( expression-prev , lag-rows [, default-value ] ) operator expression-current
expression-current operator LAG ( expression-prev , lag-rows , [ default_value ] )
• expression-current is the name of the column from the current row, or an expression operating on this column.
• operator can be >, >=, <, <=, =, or !=
• expression-prev is the name of the column from th e the current row, or an expression operating on this column.
• lag-rows is the number of rows to count back fr om the current row to reach the row we designate as the earlier row. For example, to compare with the immediately preceding row, use “1”.
• default-value is the optional value to be used when there are no rows which can be designated as “earlier row,” in such a case the default-value will be evaluated on the current row and used in place of the expression-prev.
To evaluate the LAG expression, nCluster uses the operator to compare the value of expression-current with the value of column-previous.
Yeah, I know. Techno-babble doesn't make a lot of sense to me either. But after walking through the below example, it will.
Here’s the scenario we want to solve:
From the savingstrans table, identify any pairs of adjacent deposits (ignore debits) whose amounts are over $500 and within 2% of each other. Return the customer, the two deposit amounts, and the time difference between the 1st and 2nd deposits.
Suppose our table looks like this.
Here’s the code using the LAG function to solve the criteria listed:
SELECT customer, last_2, Time2 - Time1 as Time_Lag FROM NPATH (ON (Select * from savingstrans where amount > 0) PARTITION BY customer ORDER BY eventtimestamp MODE (OVERLAPPING) PATTERN('First500.Second500') SYMBOLS (AMOUNT > 500 as First500, amount > 500 and lag(amount,1) >= (amount*.98)::int and lag(amount,1) <= (amount*1.02)::int as Second500) RESULT (LAST(CUSTOMER OF First500) AS CUSTOMER, ACCUMULATE(amount of any(First500, Second500)) AS LAST_2, first(eventtimestamp of First500) as Time1, first(eventtimestamp of Second500) as Time2) );
... and here's the Answer set.
customer | last_2 | time_lag
mike | [1050, 1040] | 14 days
mike | [1051, 1059] | 13 days
There’s a lot to talk about here so let’s dig in. I’ll discuss based on order of execution:
- FROM nPath starts us off with the function we’ll be using. So it’ll be a pattern search
- ON SELECT from savingstrans where amount > 0. WHERE clause filters out debits. Notice I did not use > $500 since that would create the ‘wrong’ answer set per the criteria
- PARTITION BY customer ORDER by eventtimestamp. This organizes the data (as we discussed earlier) in the correct GROUP (customer) and then orders each group by timestamp in order deposits were made
- MODE(OVERLAPPING) since I’m looking for any adjacent pairs
- PATTERN('First500.Second500') is looking for a> $500 deposit followed by a second > $500 deposit. Pretty straightforward
- SYMBOLS is where all the heavy lifting is. First500 is simple enough. Let’s take a look at Second500 and break it down piece by piece:
amount > 500 and lag(amount,1) >= (amount*.98)::int and lag(amount,1) <= (amount*1.02)::int as Second500
- We want to make sure amount > $500 or else Mark’s rows will sneak into the answer set
- Note it’s legal to have multiple conditions in SYMBOLS that uses LAG (as long as you are using AND). However an OR is not supported
- When using LAG, it must be on left side of equation. The right hand side is typically a comparison using arithmetic. So LAG(amount,1) < (amount*1.02) is legal, but ((amount * 2) – LAG(amount,1)) < (amount*1.02) is illegal. You must lead with LAG
- LAG(amount,1) >= (amount*.98)::int means:
- Go back 1 row and get value from AMOUNT column. This must be a positive number since LAG can only look at preceding rows. It cannot see following rows (ie: -1). It can be any one preceding row but cannot specify a range of rows
- So code means: Amount from preceding row > = 98% of current row’s value from AMOUNT column
- What’s up with the < ::int > syntax?? This is a CASE statement. The AMOUNT column was defined as Integer in the CREATE TABLE statement. But since we are doing decimal arithmetic here, we need a CASE or we’ll end up with a ‘type incompatibility’ error message
7. RESULTS has a lot of verbiage so let’s look in detail too:
RESULT (LAST(CUSTOMER OF First500) AS CUSTOMER, ACCUMULATE(amount of any(First500, Second500)) AS LAST_2, first(eventtimestamp of First500) as Time1, first(eventtimestamp of Second500) as Time2)
- I need the CUSTOMER in my result set. There are numerous ways to write this (FIRST, LAST both work and I could have used FIRST500 or SECOND500)
- I used the ACCUMULATE to dump both First500 and Second500 AMOUNTS into a single column in the result set
- Since criteria asked for Time lag between 2 deposits, I’m getting those 2 bits of information and will subtract them in next step
8. SELECT customer, last_2, Time2 - Time1 as Time_Lag As you probably noticed I used Alias names in the RESULTS section so now I apply those names here. I could not use a SELECT * since I have to do some subtraction to get the Difference in time between 2 deposits
So that’s the tour of using LAG in nPath. Taking each line of code separately really makes a difference in understanding what’s going on. Now that we have the nPath fundamentals down, let’s move on to so real Use Case examples.
3. nPath Use Case examples
So all these concepts are great, but how can I use it to generate sales? I’m glad you asked. Here are some pretty cool examples. And even better you can execute these on the Aster VMware images if you have loaded the RETAIL_WEB_CLICKS (3 million rows) and BANK_WEB_CLICKS (1.4 million rows) tables. See Using Aster Express: ACT 2, Loading Data to get this setup. (Note I did create a duplicate table of the RETAIL_WEB_CLICKS table. I named it CLICKS and renamed the cookie_id column to user_id to make the below Examples more relevant).
A. Analyzing Clickstream logs – Want list of all users who started at our Home page, clicked on Product, then bought that Product via Checkout with idea of sending them a follow-up sales promotion.
SELECT distinct user_id FROM nPath (ON clicks PARTITION BY user_id ORDER BY datestamp MODE (NONOVERLAPPING) PATTERN('H.V.C') SYMBOLS (page = 'home' AS H, page = 'view_product' AS V, page = 'checkout' AS C) RESULT (first(user_id of H) as user_id)) order by user_id;
Here’s partial listing of result set:
B. Top 5 paths after an Online Statement Enrollment – Want to see where my customers are navigating on my Web page after signing up for ‘Online Statement Enrollment’. Using this information, I can tailor that Web page accordingly.
SELECT path, count(*) as freq FROM NPATH (ON bank_web_clicks PARTITION BY customer_id ORDER BY datestamp MODE (OVERLAPPING) PATTERN ('A.B') SYMBOLS (page ILIKE 'ONLINE STATEMENT ENROLLMENT' AS A, TRUE AS B ) RESULT( ACCUMULATE(page OF B) AS path )) GROUP BY path ORDER BY freq DESC LIMIT 5;
Notice the ILIKE keyword in above code. As you may recall in my previous Volume 1 article, Aster is case-specific by default. However sometimes you may want a case-blind query. For case insensitive text comparisons, you can use the ILIKE and NOT ILIKE keywords.
Here’s the result set. As the president of the Bank, I concerned by what I see here. It appears my customers are having a hard time setting up their ONLINE STATEMENT ENROLLMENTS. This should be a 1-time occurrence and it looks like they are getting kicked out of the web page and then come back to complete the form. So I’ll need to look into this. I’ll also notice that I don’t see our INVESTMENTS path in here. This is a concern so I will make sure I put a banner on the ONLINE STATEMENT ENROLLMENTS to highlight these services as well.
C. Find Average dwell time between pages – I just installed a new Collaborative Filter by Aster (ie: people who bought this also buy these) and I went to see how long customers spend on these web pages compared to my previous filter.
SELECT avg( b_time - a_time) as Avg_Page_Dwell FROM npath (ON clicks PARTITION by user_id, session_id ORDER by datestamp MODE (NONOVERLAPPING) PATTERN ('A.B') SYMBOLS (true as A, true as B ) RESULT ( LAST(datestamp of B) as b_time, FIRST(datestamp of A) as a_time));
Notice I have a composite PARTITION BY defined. I want to make sure each User is within the same Session_ID. It looks like Average dwell time is almost 3 minutes. That over 30 seconds more than the last time we ran this. It appears the new Collaborative Filter is doing a better job at retaining customer’s interest.
D. Find Customers who buy a Product (Checkout) within 5 minutes of Viewing that Product – I want to know which products bought by which customers are flying off the shelves . I’ll be using the LAG function on this one.
select * from npath (on clicks partition by user_id order by datestamp mode (nonoverlapping) pattern ('Prod.Buy') symbols (page='view_product' as Prod, page='checkout' and lag(datestamp,1)>=datestamp-interval '5 minutes' as Buy) result (first(user_id of Prod) as user_id, first(product_id of Prod) as Product, first(datestamp of Prod) as ViewProd_ts, first(datestamp of Buy) as Buy_ts));
Here's a partial listing:
I can tell I'm going to put more advertising dollars into Product 1003. It's a hot seller.
As you can tell, nPath can provide tons of insight to vendors concerning which products and services their customers are buying. The above are just 4 simple examples. So what’s the biggest roadblock to using this function?
Typically data is not the problem. If anything there’s almost too much of that. And most data scientists have the software tools they need to find these patterns. The key to getting value out of nPath is being creative in figuring out what behavior you want to track. I spend more time dreaming up what I want to analyze then I spend writing the code. Focus your energies on this part of the equation and you’ll be finding interesting trends in your data that can lead to increased sales, incremental revenue streams, better customer retention, more loyalty, etc.