Extracting the nth Delimited Substring : SQL

Problem

You want to extract a specified, delimited substring from a string. Consider the following view V, which generates source data for this problem:

 create view V as
select 'mo,larry,curly' as name
from t1
union all
select 'tina,gina,jaunita,regina,leena' as name
from t1

Output from the view is as follows:

 
select * from v


NAME
-------------------
mo,larry,curly
tina,gina,jaunita,regina,leena

You would like to extract the second name in each row, so the final result set would be:

   SUB
-----
larry
gina

Solution

The key to solving this problem is to return each name as an individual row while preserving the order in which the name exists in the list. Exactly how you do these things depends on which DBMS you are using.

DB2

After walking the NAMEs returned by view V, use the function ROW_NUMBER to keep only the second name from each string:

  1 select substr(c,2,locate(',',c,2)-2)
2 from (
3 select pos, name, substr(name, pos) c,
4 row_number() over( partition by name
5 order by length(substr(name,pos)) desc) rn
6 from (
7 select ',' ||csv.name|| ',' as name,
8 cast(iter.pos as integer) as pos
9 from V csv,
10 (select row_number( ) over( ) pos from t100 ) iter
11 where iter.pos <= length(csv.name)+2
12 ) x
13 where length(substr(name,pos)) > 1
14 and substr(substr(name,pos),1,1) = ','
15 ) y
16 where rn = 2

MySQL

After walking the NAMEs returned by view V, use the position of the commas to return only the second name in each string:

  1 select name
2 from (
3 select iter.pos,
4 substring_index(
5 substring_index(src.name,',',iter.pos),',',-1) name
6 from V src,
7 (select id pos from t10) iter,
8 where iter.pos <=
9 length(src.name)-length(replace(src.name,',',''))
10 ) x
11 where pos = 2

Oracle

After walking the NAMEs returned by view V, retrieve the second name in each list by using SUBSTR and INSTR:

  1 select sub
2 from (
3 select iter.pos,
4 src.name,
5 substr( src.name,
6 instr( src.name,',',1,iter.pos )+1,
7 instr( src.name,',',1,iter.pos+1 ) -
8 instr( src.name,',',1,iter.pos )-1) sub
9 from (select ','||name||',' as name from V) src,
10 (select rownum pos from emp) iter
11 where iter.pos <>
12 )
13 where pos = 2

PostgreSQL

Use the function SPLIT_PART to help return each individual name as a row:

  1 select name
2 from (
3 select iter.pos, split_part(src.name,',',iter.pos) as name
4 from (select id as pos from t10) iter,
5 (select cast(name as text) as name from v) src
7 where iter.pos <=
8 length(src.name)-length(replace(src.name,',',''))+1
9 ) x
10 where pos = 2

SQL Server

After walking the NAMEs returned by view V, use the function ROW_NUMBER to keep only the second name from each string:

  1 select substring(c,2,charindex(',',c,2)-2)
2 from (
3 select pos, name, substring(name, pos, len(name)) as c,
4 row_number() over(
5 partition by name
6 order by len(substring(name,pos,len(name))) desc) rn
7 from (
8 select ',' + csv.name + ',' as name,
9 iter.pos
10 from V csv,
11 (select id as pos from t100 ) iter
12 where iter.pos <= len(csv.name)+2
13 ) x
14 where len(substring(name,pos,len(name))) > 1
15 and substring(substring(name,pos,len(name)),1,1) = ','
16 ) y
17 where rn = 2

Discussion

DB2 and SQL Server

The syntax is slightly different between these two DBMSs, but the technique is the same. I will use the solution for DB2 for the discussion. The strings are walked and the results are represented by inline view X:

 
select ','||csv.name|| ',' as name,
iter.pos
from v csv,
(select row_number() over( ) pos from t100 ) iter
where iter.pos <= length(csv.name)+2


EMPS POS
------------------------------- ----
,tina,gina,jaunita,regina,leena, 1
,tina,gina,jaunita,regina,leena, 2
,tina,gina,jaunita,regina,leena, 3


The next step is to then step through each character in each string:

 
select pos, name, substr(name, pos) c,
row_number() over(partition by name
order by length(substr(name, pos)) desc) rn
from (
select ','||csv.name||',' as name,
cast(iter.pos as integer) as pos
from v csv,
(select row_number() over() pos from t100 ) iter
where iter.pos <= length(csv.name)+2
) x
where length(substr(name,pos)) > 1


POS EMPS C RN
--- --------------- ---------------- --
1 ,mo,larry,curly, ,mo,larry,curly, 1
2 ,mo,larry,curly, mo,larry,curly, 2
3 ,mo,larry,curly, o,larry,curly, 3
4 ,mo,larry,curly, ,larry,curly, 4


Now that different portions of the string are available to you, simply identify which rows to keep. The rows you are interested in are the ones that begin with a comma; the rest can be discarded:

 
select pos, name, substr(name,pos) c,
row_number() over(partition by name
order by length(substr(name, pos)) desc) rn
from (
select ','||csv.name||',' as name,
cast(iter.pos as integer) as pos
from v csv,
(select row_number() over( ) pos from t100 ) iter
where iter.pos <= length(csv.name)+2
) x
where length(substr(name,pos)) > 1
and substr(substr(name,pos),1,1) = ','


POS EMPS C RN
--- -------------- ---------------- --
1 ,mo,larry,curly, ,mo,larry,curly, 1
4 ,mo,larry,curly, ,larry,curly, 2
10 ,mo,larry,curly, ,curly, 3
1 ,tina,gina,jaunita,regina,leena, ,tina,gina,jaunita,regina,leena, 1
6 ,tina,gina,jaunita,regina,leena, ,gina,jaunita,regina,leena, 2
11 ,tina,gina,jaunita,regina,leena, ,jaunita,regina,leena, 3
19 ,tina,gina,jaunita,regina,leena, ,regina,leena, 4
26 ,tina,gina,jaunita,regina,leena, ,leena, 5

This is an important step as it sets up how you will get the nth substring. Notice that many rows have been eliminated from this query because of the following condition in the WHERE clause:

 substr(substr(name,pos),1,1) = ','

You'll notice that ,larry,curly, was ranked 4, but now is ranked 2. Remember, the WHERE clause is evaluated before the SELECT, so the rows with leading commas are kept, then ROW_NUMBER performs its ranking. At this point it's easy to see that, to get the nth substring you want rows where RN equals n. The last step is to keep only the rows you are interested in (in this case where RN equals 2) and use SUBSTR to extract the name from that row. The name to keep is the first name in the row: larry from ,larry,curly, and gina from ,gina,jaunita,regina,leena,.

MySQL

The inline view X walks each string. You can determine how many values are in each string by counting the delimiters in the string:

 
select iter.pos, src.name
from (select id pos from t10) iter,
V src
where iter.pos <=
length(src.name)-length(replace(src.name,',',''))


+------+--------------------------------+
| pos | name |
+------+--------------------------------+
| 1 | mo,larry,curly |
| 2 | mo,larry,curly |
| 1 | tina,gina,jaunita,regina,leena |
| 2 | tina,gina,jaunita,regina,leena |
| 3 | tina,gina,jaunita,regina,leena |
| 4 | tina,gina,jaunita,regina,leena |
+------+--------------------------------+

In this case, there is one fewer row than values in each string because that's all that is needed. The function SUBSTRING_INDEX takes care of parsing the needed values:

 
select iter.pos,src.name name1,
substring_index(src.name,',',iter.pos) name2,
substring_index(
substring_index(src.name,',',iter.pos),',',-1) name3
from (select id pos from t10) iter,
V src
where iter.pos <=
length(src.name)-length(replace(src.name,',',''))


+------+--------------------------------+--------------------------+---------+
| pos | name1 | name2 | name3 |
+------+--------------------------------+--------------------------+---------+
| 1 | mo,larry,curly | mo | mo |
| 2 | mo,larry,curly | mo,larry | larry |
| 1 | tina,gina,jaunita,regina,leena | tina | tina |
| 2 | tina,gina,jaunita,regina,leena | tina,gina | gina |
| 3 | tina,gina,jaunita,regina,leena | tina,gina,jaunita | jaunita |
| 4 | tina,gina,jaunita,regina,leena | tina,gina,jaunita,regina | regina |
+------+--------------------------------+--------------------------+---------+

I've shown three name fields, so you can see how the nested SUBSTRING_INDEX calls work. The inner call returns all characters to the left of the nth occurrence of a comma. The outer call returns everything to the right of the first comma it finds (starting from the end of the string). The final step is to keep the value for NAME3 where POS equals n, in this case 2.

Oracle

The inline view walks each string. The number of times each string is returned is determined by how many values are in each string. The solution finds the number of values in each string by counting the number of delimiters in it. Because each string is enclosed in commas, the number of values in a string is the number of commas minus one. The strings are then UNIONed and joined to a table with a cardinality that is at least the number of values in the largest string. The functions SUBSTR and INSTR use the value of POS to parse each string:

 
select iter.pos, src.name,
substr( src.name,
instr( src.name,',',1,iter.pos )+1,
instr( src.name,',',1,iter.pos+1 )
instr( src.name,',',1,iter.pos )-1) sub
from (select ','||name||',' as name from v) src,
(select rownum pos from emp) iter
where iter.pos <>


POS NAME SUB
--- --------------------------------- -------------
1 ,mo,larry,curly, mo
1 , tina,gina,jaunita,regina,leena, tina
2 ,mo,larry,curly, larry
2 , tina,gina,jaunita,regina,leena, gina
3 ,mo,larry,curly, curly
3 , tina,gina,jaunita,regina,leena, jaunita
4 , tina,gina,jaunita,regina,leena, regina
5 , tina,gina,jaunita,regina,leena, leena

The first call to INSTR within SUBSTR determines the start position of the substring to extract. The next call to INSTR within SUBSTR finds the position of the nth comma (same as the start position) as well the position of the nth + 1 comma. Subtracting the two values returns the length of the substring to extract. Because every value is parsed into its own row, simply specify WHERE POS = n to keep the nth substring (in this case, where POS = 2, so, the second substring in the list).

PostgreSQL

The inline view X walks each string. The number of rows returned is determined by how many values are in each string. To find the number of values in each string, find the number of delimiters in each string and add one. The function SPLIT_PART uses the values in POS to find the nth occurrence of the delimiter and parse the string into values:

 
select iter.pos, src.name as name1,
split_part(src.name,',',iter.pos) as name2
from (select id as pos from t10) iter,
(select cast(name as text) as name from v) src
where iter.pos <=
length(src.name)-length(replace(src.name,',',''))+1


pos | name1 | name2
-----+--------------------------------+---------
1 | mo,larry,curly | mo
2 | mo,larry,curly | larry
3 | mo,larry,curly | curly
1 | tina,gina,jaunita,regina,leena | tina
2 | tina,gina,jaunita,regina,leena | gina
3 | tina,gina,jaunita,regina,leena | jaunita
4 | tina,gina,jaunita,regina,leena | regina
5 | tina,gina,jaunita,regina,leena | leena

I've shown NAME twice so you can see how SPLIT_PART parses each string using POS. Once each string is parsed, the final step is the keep the rows where POS equals the nth substring you are interested in, in this case, 2.

** If you want the Full Table detail. Refer the SQL Table in Label List. Or Click here to View the Table

0 Comments: