Monday, September 9, 2019

Non fractional division (int/float)

For me a RDBMS is a high level tool. Something very opposite to using low level languages like C, C++ or... Assembler. From user/developer perspective it should exist outside of binary realm. You shouldn't be exposed to any binary considerations. Until you want to, of course.
It's obvious, that data has some binary representation and it may and could be beneficial if the user, and developer especially, is aware of it but it shouldn't be a requirement.
This binary representation is a crucial thing to know for administrators (maybe not on the row but a data file level). The database performance relies on it greatly. It's best if such representation is aligned with the hardware. Things like endianness, word lengths or architecture (RISC/CISC) imply different internal implementations. We can imagine that if datafiles are in line with such factors RDBMS will perform better as there will be no conversions necessary.True, but it's the administrator's task to make it work like that (if RDBMS allows for different datafile formats). Unfortunately (my point of view) for SQL Server users it always existed in Microsoft x86 based world and as such - these things never ever changed (apart from being expanded with new features).

It's very obvious that SQL Server developers decided to incorporate internal CPU database into server's datatypes. We can see types like int, bigint or float. For sure, it's probably quick and very effective, especially when it comes to storage but it also introduces some of pitfalls.

The most notable one is a no-reminder division when both numbers are integer:
select 2/3 as result
result
-----------
0


To make sure you get what you want (unless it IS what you want) you have to cast at least one of them to floating point:
select 2.0/3 as result;
select cast(2 as float)/3 as result;
select cast(2 as numeric)/3 as result;
result
---------------------------------------
0.666666

(1 row affected)

result
----------------------
0,666666666666667

(1 row affected)

result
---------------------------------------
0.666666

(1 row affected)



The other thing worth mentioning is a way the floating point data is represented binary. It's all explained nicely in the SQL Servers documentation so I won't do this again. Suffice to say the precision is highly influenced by the exponent so, let me quote documentation exactly: "Floating point data is approximate; therefore, not all values in the data type range can be represented exactly".

Fortunately SQL Server does support other numeric datatypes like numeric for example which I strongly encourage anyone to consider using, especially when working on finance data.

Just make sure you know what you're doing and it's generally a good idea to stay away of binary data unless that's exactly what you want to do. It's all right to use int/bigint for identifiers but is very questionable for me.


Best

Friday, September 6, 2019

Inconsistent naming convention

The first thing you're faced with when you switch from Oracle to SQL Server is object naming convention. There are books about how those database work. The whole topic would cover a lot of areas such as administration, access control and what not, so in big points, no details.
In Oracle world you have database instances that don't know anything about each other. Inside such instance you have schemas and you generally work within their boundaries. Since schema is basically an user there's very limited interaction between them (some preivileges are required). In MS SQL Server however there might be multiple databases existing at the same time and schemas, though existing, are merely a logical separators.

In Oracle you're strictly limited to 1 and 2-part naming conventions:
[SCHEMA.]<OBJECT>
where [SCHEMA.] is not obligatory and if omitted CURRENT_SCHEMA is used (CURRENT_SCHEMA is a system variable and it generally works as MS SQL's USE statement, by default CURRENT_SCHEMA is a logged user name). That's it.

In MS SQL Server, however, you can use one of: 1, 2, 3 and 4-part naming conventions:
[INSTANCE.][DATABASE.][SCHEMA.]<OBJECT>
It's pretty self explanatory. One note - in case of SQL Server default [SCHEMA] is called dbo. In case of 3 and 4-part naming conventions you can still omit the schema name when pointing to objects in dbo by using a so called double-dot convention (just remove dbo from the query).

Each of the following would lead to the same table:
use master
go
select * from SaleTotals
go
select * from dbo.SaleTotals
go
select * from master.dbo.SaleTotals
go
select * from master..SaleTotals
go
select * from dbserver.master.dbo.SaleTotals
go

It's not so much different to the Oracle. It certainly makes sense. No issues about it whatsoever from my side.

My issue is that SQL Server is very inconsistent when it comes to enforcing these conventions. This mainly relates to DDL operations and... yes, you guessed it: functions.

So, what did I expect? I expected all those conventions can be used in all situations, but no. They apply only to the select statement and not in it's entirety - minus function calls of course ;)

How about DDL then?
create table DropMe1 ( x int )
go
-- Commands completed successfully.
create table dbo.DropMe2 ( x int )
go
-- Commands completed successfully.
create table master.dbo.DropMe3 ( x int )
go
-- Commands completed successfully.
create table dbserver.master.dbo.DropMe4 ( x int )
go
Msg 117, Level 15, State 1, Line 168
The object name 'dbserver.master.dbo.DropMe4' contains more than the maximum number of prefixes.
The maximum is 2.
Ok. Inconsistent? Yes, but that's fine. You can't access objects from another server directly anyway so it's not a problem at all. Strange they thought of 4-part naming convention anyway but I'm going to pass it and not bother about it ever again.

Let's try something else, shall we?
use master
go
create view DropMe_V1 as select 1 as x
go
-- Commands completed successfully.
create view dbo.DropMe_V2 as select 1 as x
go
-- Commands completed successfully.
create view master.dbo.DropMe_V3 as select 1 as x
go
Msg 166, Level 15, State 1, Line 177
'CREATE/ALTER VIEW' does not allow specifying the database name as a prefix to the object name.
What? You can't create view using 3-part naming convention? Duh... Didn't expect that (as a matter of fact - I did, I already knew it would happen ;D). It's very common for other DDL commands, including functions. We need one to demonstrate another example of such inconsistency.

Say we have the following:
create or alter function dbo.DropMeF () 
returns int as 
begin
    return 1
end
go

Let's try using it.
select master.dbo.DropMeF () as fresult;
fresult
-----------
1

Ok, now this:
select dbo.DropMeF () as fresult;
fresult
-----------
1

So far so good, now this:
select DropMeF () as fresult;
Msg 195, Level 15, State 10, Line 196
'DropMeF' is not a recognized built-in function name.


Oh... bummer. It appears to me as Microsoft decided that schema prefix is mandatory as not to confuse UDFs with built-in functions. I suspect the MS developers have a strong affection towards conditional if and case statements (that suspicion is a topic for another blog post). This is bollocks to me. I would rather see a form of object already exist when accidently trying to overwrite system function instead of stumbling on something like this.

After this we already can expect something strange when using double-dot, can't we?
select master..DropMeF () as fresult;
Msg 102, Level 15, State 1, Line 199
Incorrect syntax near '.'.


Of course! Since we already assumed they enforce schema name in case of function call so it's no different in this case. Still stupid.

Do you know any other (different) cases of such inconsistency in SQL Server?

Best

Wednesday, September 4, 2019

Function determinism

Just recently, while working on another post on multicolumn IN statement I came across another random SQL Server stupidity. Maybe it's an exaggregation but very irritating thing nevertheless. Every other day I find SQL Server's User Defined Functions to be more and more crippled, useless... I have tons of things to write about (and will do) but since I just stumbled upon this one I'll do it now.

Well... seems like UDFs in SQL Server must be deterministic by design. The server disallows you to create any functions that would encorporate anything that Server knows is non deterministic and, first and foremost, that includes system functions like rand() and newid().
create or alter function NonDeterministic () 
returns float
as
begin
 return rand()
end
go

Msg 443, Level 16, State 1, Procedure NonDeterministic, Line 6 [Batch Start Line 121]
Invalid use of a side-effecting operator 'rand' within a function.



Why is that even possible? What is the reasoning behind it? Why a function has to be deterministic in the SQL Server? Well... since people have already found workarounds and it's easy to write function that will pass compilation and will behave undeterministically.. what's the point of all of this?
create or alter view RandomView as
 select rand() as RandomValue
go

create or alter function NonDeterministic () 
returns float
as
begin
 declare @rv float
 select @rv = RandomValue from RandomView
 return @rv
end
go

select dbo.NonDeterministic() as r1, dbo.NonDeterministic() as r2, dbo.NonDeterministic() as r3
from (select 1 as id union all select 2 as id union all select 3 as id) severalrowsofdata;
r1                     r2                     r3
---------------------- ---------------------- ----------------------
0,911081997364809      0,778907737504499      0,824760706292227
0,358165531868303      0,610646082434233      0,353739725288632
0,678415101351906      0,679777567642382      0,893867037873934

(3 rows affected)


There's no such problems in Oracle:
create or replace function NonDeterministic 
return number is
begin
    return dbms_random.value(0,1);
end;
/

select NonDeterministic as r1, NonDeterministic as r2, NonDeterministic as r3
from (
    select 1 as id from dual union all 
    select 2 as id from dual union all 
    select 3 as id from dual
) severalrowsofdata;
        R1         R2         R3
---------- ---------- ----------
,235553683 ,684116267 ,294452895
,486510785 ,207421073 ,845284566
,488748819 ,387015281 ,316854148


Even more. In Oracle ALL functions are non-deterministic by definition! However, you can set them to be deterministic but this still won't disallow anything from the function's body! It's only hinting the database engine that this function will return same output when given same parameters thus allowing for caching its results. You can obviously put yourself in trouble when not paying attention ;) Just take a look:
create or replace function NonDeterministic2
return number deterministic is
begin
    return dbms_random.value(0,1);
end;
/

select NonDeterministic2 as r1, NonDeterministic2 as r2, NonDeterministic2 as r3
from (
    select 1 as id from dual union all 
    select 2 as id from dual union all 
    select 3 as id from dual
) severalrowsofdata;
        R1         R2         R3
---------- ---------- ----------
,0430362939 ,0106835894 ,717701516
,0430362939 ,0106835894 ,717701516
,0430362939 ,0106835894 ,717701516


Not so random anymore, huh?

Hope it helps. Wish hear to why Microsoft designed it this way.

Best


Monday, September 2, 2019

Multicolumn IN statement

Well... Multicolumn IN statement. What about it? Well... it doesn't exist in SQL server...

What am I even talking about? What multicolumn IN statement? What the heck is it?

Well.. you know what the IN statement is? What it does? No? So, it's used for filtering rows by a set of values, for example give me finance data for Q1 and Q2 of each year. The list don't have to be a static one, it can be a result of a subquery, for example give me finance data for quarters listed in a configuration table.

Remember a table function we created in another post? We'll use it to create a sales generator to have some data to work on ;)



Now we have two tables: SalesDetails and SaleTotals that we can work on.

So, let's get all the details for months 1, 3 and 4 of each year:
select sale_year, sale_month, sale_amount
from SalesDetails 
where sale_month in (1, 3, 4);

Now, let's use subquery. Let's see all the details for years when sale amount exceeded 2500:
select sale_year, sale_month, sale_amount
from SalesDetails 
where sale_year in (
    select sale_year 
    from SaleTotals 
    group by sale_year 
    having sum(total_sale) > 2500
);

So far, so good.

But what if we wanted to see the details for months when the sale was above a certain threshold? For example:
select * from SaleTotals where total_sale > 500;

In Oracle, we could use multiple columns in the in statement, like so:
select sale_year, sale_month, sale_product, sale_amount
from SalesDetails 
where (sale_year, sale_month) in (
    select sale_year, sale_month from SaleTotals where total_sale > 500
) order by 1, 2;

No problem.

But in SQL Server we're getting:
Msg 4145, Level 15, State 1, Line 100
An expression of non-boolean type specified in a context where a condition is expected, near ','.


Creative again, huh?

There are better and worse solutions. One of the latter type, would be to concatenate key columns somehow. In this case it could work, especially if we did not just concatenate but if we combined columns somehow. For year and month easy arithmetic would work, for example:
select sale_year, sale_month, sale_product, sale_amount
from SalesDetails 
where (sale_year, sale_month) in (
    select sale_year, sale_month from SaleTotals where total_sale > 500
) order by 1, 2;
but in case of some more random data, especially text, especially when codepage is not known (usually for n-prefixed text columns) it probably isn't good approach. The same applies to sort of autoincrement ID's. They can get very large so arithmetics may be out of scope and concatenating may lead to amiguity. In this case some separating characters would do the work but... how about performance? How about the limits of string variables (length)?

Another common alternative is inner join:
select d.sale_year, d.sale_month, d.sale_product, d.sale_amount
from SalesDetails d
inner join
(
 select sale_year, sale_month from SaleTotals where total_sale > 500
) t
on
 d.sale_year = t.sale_year and d.sale_month = t.sale_month
;
But this approach has another issue you need to be aware of. What will happen when there's more then one key value in the subquery (two entries for the same month and year in case of this example)?
We will start getting duplicates in the output. You need to get rid of duplicates prior to join either by getting distinct values or some aggregation (or whatever else might be the case for you):
select d.sale_year, d.sale_month, d.sale_product, d.sale_amount
from SalesDetails d
inner join
(
 select distinct sale_year, sale_month from SaleTotals where total_sale > 500
) t
on
 d.sale_year = t.sale_year and d.sale_month = t.sale_month
;


Best

Non fractional division (int/float)

For me a RDBMS is a high level tool. Something very opposite to using low level languages like C, C++ or... Assembler. From user/developer p...