Monday, September 9, 2019

Non fractional division (int/float)

For me a RDBMS is a high level tool. Something very opposite to using low level languages like C, C++ or... Assembler. From user/developer perspective it should exist outside of binary realm. You shouldn't be exposed to any binary considerations. Until you want to, of course.
It's obvious, that data has some binary representation and it may and could be beneficial if the user, and developer especially, is aware of it but it shouldn't be a requirement.
This binary representation is a crucial thing to know for administrators (maybe not on the row but a data file level). The database performance relies on it greatly. It's best if such representation is aligned with the hardware. Things like endianness, word lengths or architecture (RISC/CISC) imply different internal implementations. We can imagine that if datafiles are in line with such factors RDBMS will perform better as there will be no conversions necessary.True, but it's the administrator's task to make it work like that (if RDBMS allows for different datafile formats). Unfortunately (my point of view) for SQL Server users it always existed in Microsoft x86 based world and as such - these things never ever changed (apart from being expanded with new features).

It's very obvious that SQL Server developers decided to incorporate internal CPU database into server's datatypes. We can see types like int, bigint or float. For sure, it's probably quick and very effective, especially when it comes to storage but it also introduces some of pitfalls.

The most notable one is a no-reminder division when both numbers are integer:
select 2/3 as result
result
-----------
0


To make sure you get what you want (unless it IS what you want) you have to cast at least one of them to floating point:
select 2.0/3 as result;
select cast(2 as float)/3 as result;
select cast(2 as numeric)/3 as result;
result
---------------------------------------
0.666666

(1 row affected)

result
----------------------
0,666666666666667

(1 row affected)

result
---------------------------------------
0.666666

(1 row affected)



The other thing worth mentioning is a way the floating point data is represented binary. It's all explained nicely in the SQL Servers documentation so I won't do this again. Suffice to say the precision is highly influenced by the exponent so, let me quote documentation exactly: "Floating point data is approximate; therefore, not all values in the data type range can be represented exactly".

Fortunately SQL Server does support other numeric datatypes like numeric for example which I strongly encourage anyone to consider using, especially when working on finance data.

Just make sure you know what you're doing and it's generally a good idea to stay away of binary data unless that's exactly what you want to do. It's all right to use int/bigint for identifiers but is very questionable for me.


Best

Friday, September 6, 2019

Inconsistent naming convention

The first thing you're faced with when you switch from Oracle to SQL Server is object naming convention. There are books about how those database work. The whole topic would cover a lot of areas such as administration, access control and what not, so in big points, no details.
In Oracle world you have database instances that don't know anything about each other. Inside such instance you have schemas and you generally work within their boundaries. Since schema is basically an user there's very limited interaction between them (some preivileges are required). In MS SQL Server however there might be multiple databases existing at the same time and schemas, though existing, are merely a logical separators.

In Oracle you're strictly limited to 1 and 2-part naming conventions:
[SCHEMA.]<OBJECT>
where [SCHEMA.] is not obligatory and if omitted CURRENT_SCHEMA is used (CURRENT_SCHEMA is a system variable and it generally works as MS SQL's USE statement, by default CURRENT_SCHEMA is a logged user name). That's it.

In MS SQL Server, however, you can use one of: 1, 2, 3 and 4-part naming conventions:
[INSTANCE.][DATABASE.][SCHEMA.]<OBJECT>
It's pretty self explanatory. One note - in case of SQL Server default [SCHEMA] is called dbo. In case of 3 and 4-part naming conventions you can still omit the schema name when pointing to objects in dbo by using a so called double-dot convention (just remove dbo from the query).

Each of the following would lead to the same table:
use master
go
select * from SaleTotals
go
select * from dbo.SaleTotals
go
select * from master.dbo.SaleTotals
go
select * from master..SaleTotals
go
select * from dbserver.master.dbo.SaleTotals
go

It's not so much different to the Oracle. It certainly makes sense. No issues about it whatsoever from my side.

My issue is that SQL Server is very inconsistent when it comes to enforcing these conventions. This mainly relates to DDL operations and... yes, you guessed it: functions.

So, what did I expect? I expected all those conventions can be used in all situations, but no. They apply only to the select statement and not in it's entirety - minus function calls of course ;)

How about DDL then?
create table DropMe1 ( x int )
go
-- Commands completed successfully.
create table dbo.DropMe2 ( x int )
go
-- Commands completed successfully.
create table master.dbo.DropMe3 ( x int )
go
-- Commands completed successfully.
create table dbserver.master.dbo.DropMe4 ( x int )
go
Msg 117, Level 15, State 1, Line 168
The object name 'dbserver.master.dbo.DropMe4' contains more than the maximum number of prefixes.
The maximum is 2.
Ok. Inconsistent? Yes, but that's fine. You can't access objects from another server directly anyway so it's not a problem at all. Strange they thought of 4-part naming convention anyway but I'm going to pass it and not bother about it ever again.

Let's try something else, shall we?
use master
go
create view DropMe_V1 as select 1 as x
go
-- Commands completed successfully.
create view dbo.DropMe_V2 as select 1 as x
go
-- Commands completed successfully.
create view master.dbo.DropMe_V3 as select 1 as x
go
Msg 166, Level 15, State 1, Line 177
'CREATE/ALTER VIEW' does not allow specifying the database name as a prefix to the object name.
What? You can't create view using 3-part naming convention? Duh... Didn't expect that (as a matter of fact - I did, I already knew it would happen ;D). It's very common for other DDL commands, including functions. We need one to demonstrate another example of such inconsistency.

Say we have the following:
create or alter function dbo.DropMeF () 
returns int as 
begin
    return 1
end
go

Let's try using it.
select master.dbo.DropMeF () as fresult;
fresult
-----------
1

Ok, now this:
select dbo.DropMeF () as fresult;
fresult
-----------
1

So far so good, now this:
select DropMeF () as fresult;
Msg 195, Level 15, State 10, Line 196
'DropMeF' is not a recognized built-in function name.


Oh... bummer. It appears to me as Microsoft decided that schema prefix is mandatory as not to confuse UDFs with built-in functions. I suspect the MS developers have a strong affection towards conditional if and case statements (that suspicion is a topic for another blog post). This is bollocks to me. I would rather see a form of object already exist when accidently trying to overwrite system function instead of stumbling on something like this.

After this we already can expect something strange when using double-dot, can't we?
select master..DropMeF () as fresult;
Msg 102, Level 15, State 1, Line 199
Incorrect syntax near '.'.


Of course! Since we already assumed they enforce schema name in case of function call so it's no different in this case. Still stupid.

Do you know any other (different) cases of such inconsistency in SQL Server?

Best

Wednesday, September 4, 2019

Function determinism

Just recently, while working on another post on multicolumn IN statement I came across another random SQL Server stupidity. Maybe it's an exaggregation but very irritating thing nevertheless. Every other day I find SQL Server's User Defined Functions to be more and more crippled, useless... I have tons of things to write about (and will do) but since I just stumbled upon this one I'll do it now.

Well... seems like UDFs in SQL Server must be deterministic by design. The server disallows you to create any functions that would encorporate anything that Server knows is non deterministic and, first and foremost, that includes system functions like rand() and newid().
create or alter function NonDeterministic () 
returns float
as
begin
 return rand()
end
go

Msg 443, Level 16, State 1, Procedure NonDeterministic, Line 6 [Batch Start Line 121]
Invalid use of a side-effecting operator 'rand' within a function.



Why is that even possible? What is the reasoning behind it? Why a function has to be deterministic in the SQL Server? Well... since people have already found workarounds and it's easy to write function that will pass compilation and will behave undeterministically.. what's the point of all of this?
create or alter view RandomView as
 select rand() as RandomValue
go

create or alter function NonDeterministic () 
returns float
as
begin
 declare @rv float
 select @rv = RandomValue from RandomView
 return @rv
end
go

select dbo.NonDeterministic() as r1, dbo.NonDeterministic() as r2, dbo.NonDeterministic() as r3
from (select 1 as id union all select 2 as id union all select 3 as id) severalrowsofdata;
r1                     r2                     r3
---------------------- ---------------------- ----------------------
0,911081997364809      0,778907737504499      0,824760706292227
0,358165531868303      0,610646082434233      0,353739725288632
0,678415101351906      0,679777567642382      0,893867037873934

(3 rows affected)


There's no such problems in Oracle:
create or replace function NonDeterministic 
return number is
begin
    return dbms_random.value(0,1);
end;
/

select NonDeterministic as r1, NonDeterministic as r2, NonDeterministic as r3
from (
    select 1 as id from dual union all 
    select 2 as id from dual union all 
    select 3 as id from dual
) severalrowsofdata;
        R1         R2         R3
---------- ---------- ----------
,235553683 ,684116267 ,294452895
,486510785 ,207421073 ,845284566
,488748819 ,387015281 ,316854148


Even more. In Oracle ALL functions are non-deterministic by definition! However, you can set them to be deterministic but this still won't disallow anything from the function's body! It's only hinting the database engine that this function will return same output when given same parameters thus allowing for caching its results. You can obviously put yourself in trouble when not paying attention ;) Just take a look:
create or replace function NonDeterministic2
return number deterministic is
begin
    return dbms_random.value(0,1);
end;
/

select NonDeterministic2 as r1, NonDeterministic2 as r2, NonDeterministic2 as r3
from (
    select 1 as id from dual union all 
    select 2 as id from dual union all 
    select 3 as id from dual
) severalrowsofdata;
        R1         R2         R3
---------- ---------- ----------
,0430362939 ,0106835894 ,717701516
,0430362939 ,0106835894 ,717701516
,0430362939 ,0106835894 ,717701516


Not so random anymore, huh?

Hope it helps. Wish hear to why Microsoft designed it this way.

Best


Monday, September 2, 2019

Multicolumn IN statement

Well... Multicolumn IN statement. What about it? Well... it doesn't exist in SQL server...

What am I even talking about? What multicolumn IN statement? What the heck is it?

Well.. you know what the IN statement is? What it does? No? So, it's used for filtering rows by a set of values, for example give me finance data for Q1 and Q2 of each year. The list don't have to be a static one, it can be a result of a subquery, for example give me finance data for quarters listed in a configuration table.

Remember a table function we created in another post? We'll use it to create a sales generator to have some data to work on ;)



Now we have two tables: SalesDetails and SaleTotals that we can work on.

So, let's get all the details for months 1, 3 and 4 of each year:
select sale_year, sale_month, sale_amount
from SalesDetails 
where sale_month in (1, 3, 4);

Now, let's use subquery. Let's see all the details for years when sale amount exceeded 2500:
select sale_year, sale_month, sale_amount
from SalesDetails 
where sale_year in (
    select sale_year 
    from SaleTotals 
    group by sale_year 
    having sum(total_sale) > 2500
);

So far, so good.

But what if we wanted to see the details for months when the sale was above a certain threshold? For example:
select * from SaleTotals where total_sale > 500;

In Oracle, we could use multiple columns in the in statement, like so:
select sale_year, sale_month, sale_product, sale_amount
from SalesDetails 
where (sale_year, sale_month) in (
    select sale_year, sale_month from SaleTotals where total_sale > 500
) order by 1, 2;

No problem.

But in SQL Server we're getting:
Msg 4145, Level 15, State 1, Line 100
An expression of non-boolean type specified in a context where a condition is expected, near ','.


Creative again, huh?

There are better and worse solutions. One of the latter type, would be to concatenate key columns somehow. In this case it could work, especially if we did not just concatenate but if we combined columns somehow. For year and month easy arithmetic would work, for example:
select sale_year, sale_month, sale_product, sale_amount
from SalesDetails 
where (sale_year, sale_month) in (
    select sale_year, sale_month from SaleTotals where total_sale > 500
) order by 1, 2;
but in case of some more random data, especially text, especially when codepage is not known (usually for n-prefixed text columns) it probably isn't good approach. The same applies to sort of autoincrement ID's. They can get very large so arithmetics may be out of scope and concatenating may lead to amiguity. In this case some separating characters would do the work but... how about performance? How about the limits of string variables (length)?

Another common alternative is inner join:
select d.sale_year, d.sale_month, d.sale_product, d.sale_amount
from SalesDetails d
inner join
(
 select sale_year, sale_month from SaleTotals where total_sale > 500
) t
on
 d.sale_year = t.sale_year and d.sale_month = t.sale_month
;
But this approach has another issue you need to be aware of. What will happen when there's more then one key value in the subquery (two entries for the same month and year in case of this example)?
We will start getting duplicates in the output. You need to get rid of duplicates prior to join either by getting distinct values or some aggregation (or whatever else might be the case for you):
select d.sale_year, d.sale_month, d.sale_product, d.sale_amount
from SalesDetails d
inner join
(
 select distinct sale_year, sale_month from SaleTotals where total_sale > 500
) t
on
 d.sale_year = t.sale_year and d.sale_month = t.sale_month
;


Best

Friday, August 30, 2019

Duplicate columns in order by

Another hard-to-understand anti-feature of MS SQL Server presents itself in an order by clause. SQL Server actively disallows a single column to be present multiple times in an order by.
Why, why, why?
Why can't you just ignore all but first occurence of a column?

select * 
from (
    select 2 as x union all
    select 4 as x union all
    select 1 as x union all
    select 3 as x
) dataset
order by x, x;
Msg 169, Level 15, State 1, Line 105
A column has been specified more than once in the order by list. Columns in the order by list must be unique.


Oracle does not care about it:
select * from (
    select 2 as x from dual union all
    select 4 as x from dual union all
    select 1 as x from dual union all
    select 3 as x from dual
) dataset
order by x, x;
         X
----------
         1
         2
         3
         4


Again, I expect to hear, it's not a problem to delete/comment out part of your code. It's not, but it's very often to move columns around when prototyping, especially, temporary moving column from back of the list to beginning. Normally I'd copy the column name, put it in front, run and delete. With MS SQL you have to do more steps and still, sooner or later you miss one or two places and see above mentioned error. If you add this to other MS SQL Server order by issues... eh...

Best.

Wednesday, August 28, 2019

Table Functions

Hi,

Although there's a lot to hate about SQL Server not everything is bad there. In fact, there's quite a lot other database engines could implement... at least Oracle could. More on it later.
This time I'm more interested in the functionality that both SQL Server and Oracle have already in place but, from ease-to-implement point of view, SQL Server does it better (minus T-SQL syntax of course ;D) - Table Functions (or, Table Valued User Defined Function as Microsoft decided to call it).

What is the Table Function? This is the question we have to ask in the first place. Well, Table Function is a function that returns some kind of collection and can be used just like any table or view, as part of a typical DML select statement. From developer's point of view, it can be seen as a sort of parameterized view although, being a function, it allows a sort of random operations to be done in it's body. You can do whatever the RDBMS allows you to do in regular functions. What is widely used as an example is parsing a clob data, usually a kind of XML or JSON data resulting in multiple output rows. Generally, Table Functions are not amongst the most used tools in databases nowadays but when you get to the point when you need them, you'll be relieved it's there.

SQL Server allows for two different Table Function syntaxes. Both are useful but serve a little bit different purpose. First one, much easier to implement is an Inline Table-Valued Function. In this case the function really serves as a parameterized view. The syntax allows only for a select statement and goes like this (the syntax comes from official documentation):

-- Transact-SQL Inline Table-Valued Function Syntax  
CREATE [ OR ALTER ] FUNCTION [ schema_name. ] function_name  
( [ { @parameter_name [ AS ] [ type_schema_name. ] parameter_data_type  
    [ = default ] [ READONLY ] }  
    [ ,...n ] 
  ] 

RETURNS TABLE 
    [ WITH <function_option> [ ,...n ] ] 
    [ AS ] 
    RETURN [ ( ] select_stmt [ ) ] 
[ ; ]


In practice you'll generally get whatever comes from the select, nothing more, nothing else. It's extremely easy to code but have a quite limited use.
That's how it can look in practice:
create or alter function MyInlineFunction ( @mp int )
returns table as
return
(
 select p.p*@mp as p, q.q*@mp as q
 from (
  select 1 as p union all
  select 2 as p union all
  select 3 as p
 ) p
 cross join
 (
  select 1 as q union all
  select 2 as q union all
  select 3 as q
 ) q
)
go

select * from MyInlineFunction( 2 )
p           q
----------- -----------
2           2
2           4
2           6
4           2
4           4
4           6
6           2
6           4
6           6


We can already see that no matter what we do, we will always get 9 rows in the output data set. How can we overcome it? By using the other, Multi-Statement Table-Valued Function syntax, obviously.

It's syntax goes like this:
-- Transact-SQL Multi-Statement Table-Valued Function Syntax 
CREATE [ OR ALTER ] FUNCTION [ schema_name. ] function_name  
( [ { @parameter_name [ AS ] [ type_schema_name. ] parameter_data_type  
    [ = default ] [READONLY] }  
    [ ,...n ] 
  ] 

RETURNS @return_variable TABLE <table_type_definition> 
    [ WITH <function_option> [ ,...n ] ] 
    [ AS ] 
    BEGIN  
        function_body  
        RETURN 
    END 
[ ; ]


So... we could code such a function like this:
create or alter function MyMultiStatementFunction ( @pcount int, @qcount int, @mp int )
RETURNS @t TABLE ( 
 p int,
 q int 
)
AS
begin
 declare @p_iterator int = 1
 declare @q_iterator int = 1

 while @p_iterator <= @pcount
 begin
  while @q_iterator <= @qcount
  begin
   insert into @t (p, q) values ( @p_iterator*@mp, @q_iterator*@mp )
   set @q_iterator = @q_iterator + 1
  end
  set @q_iterator = 1
  set @p_iterator = @p_iterator + 1
 end

 return
end
go

select * from MyMultiStatementFunction( 3, 3, 2 )
This gives us exactly the same result as the previous one, but now, we can play around with first two parameters to get different number of rows.

The syntax still doesn't look bad. It's... surprisingly easy! You just define the output type as a table, insert rows inside the function and return it using a single return statement. It really can't get easier than this!

In the Oracle world however, all is of the latter type but there's so many details a developer have to think of that it requires quite a lot of reading and education. First of all, you have an easy (which is still harder and more cumbersome than MS SQL) way of creating table functions but in this case... well... you need to know how does it work internally. The easy version creates a collection inside a function, caches it and returns the whole thing once the function is finished. This behaviour may require substantial amounts of memory and leads to the situation when query appears to be hung up since no rows are produced during processing. The general cumbersomness of PL/SQL table function comes from the fact that, unlike MS SQL which declares return type in function declaration, PL/SQL requires those types to exist in the database. Duh... although, 12.1 allows them to exist in a package so, there's a light..
How would the obove function look like in PL/SQL then?
create or replace type t_pq_r is object 
( 
    p number,
    q number 
)
/

create or replace type t_pq is table of t_pq_r;
/

create or replace function MyOracleTableFunction ( pcount number, qcount number, mp number ) 
return t_pq
is
    piterator number := 1;
    qiterator number := 1;
    rw t_pq_r;
    rv t_pq;
begin
    rv := t_pq();
    while piterator <= pcount loop
        while qiterator <= qcount loop
            rw := t_pq_r( piterator*mp, qiterator*mp );
            rv.extend( 1 );
            rv( rv.count ) := rw;
            qiterator := qiterator + 1;
        end loop;
        qiterator := 1;
        piterator := piterator + 1;
    end loop;
    return rv;
end;
/

select * from TABLE ( MyOracleTableFunction( 3, 3, 2 ) );

Although not overly complicated it doesn't look nearly as nice as SQL Server version. It is very similar to what you'd probably see in other languages (Java, C++, etc) when building collections, though.

So, what can we do to make it more interactive in terms of getting rows? We could pipeline it.
create or replace function MyOracleTablePipelinedFunction ( pcount number, qcount number, mp number ) 
return t_pq pipelined
is
    piterator number := 1;
    qiterator number := 1;
    rw t_pq_r;
begin
    while piterator <= pcount loop
        while qiterator <= qcount loop
            rw := t_pq_r( piterator*mp, qiterator*mp );
            PIPE ROW(rw);
            qiterator := qiterator + 1;
        end loop;
        qiterator := 1;
        piterator := piterator + 1;
    end loop;
    return;
end;
/

select * from TABLE ( MyOracleTablePipelinedFunction( 3, 3, 2 ) );
Now, we get results instantanously. It's so much quicker all in all...

I did some research and own testing and... initially I said SQL Server is better in this regards, now I'm drifting towards another no, it isn't.

I run Oracle jobs on the small db.t2.micro (1 CPU, 1GB RAM) instance. On the other hand, SQL Server is executed locally on 8 CPU, 32GB RAM machine... What are the times of executing each of those functions (the ones using 3 parameters)?

And mind you, Oracle does not cache results of executions of Table Functions: "Multiple invocations of a table function, either within the same query or in separate queries result in multiple executions of the underlying implementation. That is, in general, there is no buffering or reuse of rows."

It's what I got:



On the x axis is the value of the p/q parameters (both got the same value), on the y axis is the elapsed time (the scale is logarithmic). What does it tell us? That Oracle pipelined execution is unbeatable here and MS SQL is roughly comparable (in terms of performance) to the basic Oracle way of doing this stuff however, after certain threshold (~1000 for p and q) Oracle is becoming unusable whereas MS didn't experience any problems.

So, all in all, I dare to say, Oracle wins once again. It's easier to do in SQL Server, yes. Do I wish Oracle did not need explicit type definitions? Yes, I do. But despite all of it the performance difference is so much favorable for Oracle it's a no brainer.

One final disclaimer. I haven't read all the books, whitepapers, didn't do courses and who knows what, so there's a chance that there are some magic tricks that make table functions to perform better in MS SQL Server. If so, don't hesitate to clear that. I know also that those functions could have been written in trillion other ways. If you wish to present better (in any terms) versions - be my guest!

Best

Monday, August 26, 2019

Packages (procedures/functions)

I was recently asked to compare differences in T-SQL and PL/SQL packages. At least that's what I understood. Since, T-SQL does not support anything virtually resembling PL/SQL packages it may seem pointless. I disagree, however. I did some research and found, rather interesting (at least socially), discussion on StackOverflow. There is some valid points made there but mostly all voices against PL/SQL packages come from people who seem to be very uneducated developers. It's also pretty old topic so maybe Oracle was so much different back then? No, it wasn't ;)

So, instead of just finishing with - SQL Server doesn't support packages, which, by itself is both true and something SQL Server can be hated on, I decided to elaborate a bit on PL/SQL packages and why these are so much better then, yet another, bollocks workaround T-SQL developers are forced to make.

So, what's the package? I'm going to directly quote gadsUK here:

"The Package has two elements: a header file and a body file. The header file is your public interface, and contains the signature (name, params and return type if applicable) of all the stored procedures or functions (in Oracle a function returns a value, a stored proc doesn't) that are directly callable. The package body must implement all the procedure signatures in the package header file.
The body element of the package contains all the stored procs and logic that actually do the work. You may have a Save procedure declared in the package header that calls an insert or update proc that exists in the body. The developer can only see the "Save" proc. It's important to keep in mind that the package body can also implement procs or functions not declared in the package header, they're just not accessible outside of the package itself."

So, unlike some seems to be suggesting, it's not only used for grouping code. It's more of a single entity, a class. Remember, you can keep package variables and constants there as well so it's possible to share it between sessions or use it as a Singleton if need be. It's all well documented. It's not impossible to mimic such behaviour in SQL Server but you would have to code it manually, probably using temporary tables.

Some say, PL/SQL packages have this huge installation drawback. Let me quote gunnar here:

"The best argument against oracle packages is that based on experience and research on the Ask Tom site, one can't update a package without taking it off line. This is unacceptable. With SQL Server, we can update stored procedures on the fly, without interrupting production operation."

This is absolute bollocks as it has been in 2012 and 2014 (when the comment has been made). I don't know if Gunnar never used Oracle or still had been using some pre 10 version. I suppose, the issue he is reffering to is inability to compile packages in certain situations. Those situations are changing with Oracle version and... the same happens for regular procedures and function. Hell... to most of the database objects really! What they're talking about is the fact that you can't compile the package when it's being used. That means - if there are any sessions running any procedures/functions from the package. The same applies to regular procedures and functions as well. And also, you can't run any DDLs on currently used objects. That alone means that this opinion doesn't have any sense.

How did it change with Oracle versions? Well, in 10i, you couldn't compile a package if it was referenced in anything that was being executed at the time of compilation. In 11g, it's not possible only if the particular package is active. I think in 12-something or some next version it'll be possible to compile package at any time (obviously only new sessions/executions will see new package) so the issue is disappearing anyway.

Also shout-out to belgariontheking for most pointless and most uneducated opinion on the topic ever. You should switch to some other job, bro. You're clearly not fit for the Oracle. Maybe find SQL Server position somewhere :P

 Best



Non fractional division (int/float)

For me a RDBMS is a high level tool. Something very opposite to using low level languages like C, C++ or... Assembler. From user/developer p...