Monday, December 19, 2011

Storing Unicode string into VARBINARY (MAX)

I came across an interesting issue while working on a client project a couple of days ago. I thought it might be a good idea to share this information.
We had to populate a “data” field which is of type “varbinary(max)”, the content of which is picked up by sharepoint to index.
We had two types of items

  1. Items which have files (these files were directly written to the ‘data’ field in their binary form)
  2. Items which didn’t have any files (the ‘data’ field needed to be populated with the title and description of item)
There were no issues with files, but when ‘data’ field was populated with title and description, nothing from that field was getting indexed.
Both the title and description fields where of type ‘nvarchar(max)’.

Let’s look at what happens when you write data of ‘varchar(max)’ and ‘nvarchar(max)’ into the ‘varbinary(max)’ field.

--DECLARING FIELDS TO STORE DATA IN DIFFERENT DATA TYPES
DECLARE @MyUnicodeName NVARCHAR(MAX)
DECLARE @MyName VARCHAR(MAX)
DECLARE @BinaryFieldForUnicode VARBINARY(MAX)
DECLARE @BinaryField VARBINARY(MAX)

--POPULATE NAME VARIABLES
SELECT @MyUnicodeName = N'KHAJA'
SELECT @MyName = 'KHAJA'

--POPULATE BINARY VARIABLES
SELECT @BinaryField = CAST(@MyName AS VARBINARY(MAX))
, @BinaryFieldForUnicode = CAST(@MyUnicodeName AS VARBINARY(MAX))

--DISPLAY BINARY VARIABLES
SELECT @BinaryField "Binary Field"
SELECT @BinaryFieldForUnicode "Binary Unicode Field"


Upon running the above code the following result is returned :


It is evident that it has extra ‘00’ for the unicode data.
If you were to read these fields and write it to a text file (using a program),


Data in txt file
Binary Field KHAJA
Binary Unicode Field K H A J A




Here you can see that the txt file for Binary Unicode field has extra spaces.
Problems/Issues
  1. When you write the content to a txt file, a space is appended to each character written. Which makes it not really useful for anything(in addition to the extra storage required to save the data)
  2. Full text search fails on the data stored in the varbinary(max) column.

Fix
the fix is simple, just do the following when storing the Unicode string into varbinary(max)
SELECT @BinaryFieldForUnicode = 0xFFFE + CAST(@MyUnicodeName AS VARBINARY(MAX))

Adding the byte order mark 0xFFFE at the start will fix both the issues. If you write data from varbinary(max) field to txt file it will write the proper text without extra space characters and the full text search will pick up proper data to index.

Hope this article was helpful. If you have any questions or comments I will be happy to answer.

5 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Good stuff Khaja. Just a couple of questions.

    Do you know why the Unicode representation adds the trailing 0's? Just from the look of it, it looks like it's trying to fill up bits to represent a binary format but never actually converts the Hex representation of the Unicode value to binary.

    Secondly I was curious to know why 0xFFFE fixes it and what it actually represents. I tried to google it, but all I got was errors about it and couldn't be bothered working out what it meant.

    Mind you it's 5 am in the morning so I'm probably not thinking too clearly.

    ReplyDelete
  3. Hi Daryl,
    sorry about the late reply...
    0xFFFE is the byte order mark which specifies that the string that follows is a unicode string.
    when the binary Reader/Writer reads this at the start, it recognizes it as a unicode string hence it knows that those two trailing '00' are part of that single character.

    ReplyDelete
  4. answer to your first question is the extra byte that the unicode character uses for storage. when that extra byte is 00 then its the same character as ASCII

    ReplyDelete