Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query: Improved performance for String.StartsWith() filters #7429

Closed
jemiller0 opened this issue Jan 17, 2017 · 39 comments
Closed

Query: Improved performance for String.StartsWith() filters #7429

jemiller0 opened this issue Jan 17, 2017 · 39 comments
Assignees
Labels
area-perf closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. type-enhancement
Milestone

Comments

@jemiller0
Copy link

Describe what is not working as expected.

I have a LINQ query that is using a .StartsWith() search on a string column. I found that it is being translated into the following SQL WHERE clause.

WHERE [b].[updated_by] LIKE N'jemiller%' + N'%' AND (CHARINDEX(N'jemiller%', [b].[updated_by]) = 1)

The full SQL statement is.

SELECT [b].[bib_id], [b].[content], [b].[date_created], [b].[created_by], [b].[fast_add], [b].[former_id], [b].[unique_id_prefix], [b].[date_updated], [b].[updated_by], [b].[staff_only], [b].[status], [b].[status_updated_date], [b].[status_updated_by], [b.BibExt].[id], [b.BibExt].[author], [b.BibExt].[creation_time], [b.BibExt].[creation_user_name], [b.BibExt].[last_write_time], [b.BibExt].[last_write_user_name], [b.BibExt].[level], [b.BibExt].[oclc_number], [b.BibExt].[publication_date], [b.BibExt].[subscription_status], [b.BibExt].[title], [b.BibExt].[type]
FROM [ole_ds_bib_t] AS [b]
LEFT JOIN [uc_bib_ext] AS [b.BibExt] ON [b].[bib_id] = [b.BibExt].[id]
WHERE [b].[updated_by] LIKE N'jemiller' + N'%' AND (CHARINDEX(N'jemiller', [b].[updated_by]) = 1)
ORDER BY [b].[date_updated] DESC
OFFSET 0 ROWS FETCH NEXT 100 ROWS ONLY

The problem that I'm having is that the query is taking 11 seconds to complete. I'm using SQL Server 2016. The primary table that is being queried has 7.7 million rows and the column that is being filtered on has an index. If I comment out the "AND (CHARINDEX(N'jemiller', [b].[updated_by]) = 1)" part of the WHERE clause so that it's only doing a LIKE, then, the query only takes 1 second. Although, it looked like it took longer the first time through. I think the extra call to CHARINDEX() may be throwing the query optimizer in SQL Server off. If I view the execution plan in SQL Management Studio that it has 9 different steps when CHARINDEX() is included an only 7 without it. I can send the execution plan XML if that would help.

Issue #474 talks about why the CHARINDEX() check is included. Would it be possible to only include that check if the input string includes a LIKE special character such as %, _, [, or ]?

Or maybe have it use the LIKE ESCAPE clause? It would be nice if it only did it if the special characters were present, but, someone mentioned something about it causing a problem caching. CHARINDEX() looks like a good way to handle it, but, it looks like it could cause performance problems, at least for some queries. In general, it seems like it would be good to keep the expression as simple as possible and only make it more complicated when necessary.

Failing that, would it be possible to make this behavior configurable? This is pretty much a blocker for my app because performance is just too slow.

Further technical details

EF Core version: 1.2.0-preview1-23129
Database Provider: Microsoft.EntityFrameworkCore.SqlServer
Operating system: Windows 10 Anniversary Update
IDE: Visual Studio 2015

@popcatalin81
Copy link

popcatalin81 commented Jan 18, 2017

@jemiller0 what type is your updated_by column, VARCHAR or NVARCHAR?

In case it's varchar then the piece CHARINDEX(N'jemiller', [b].[updated_by], may force an conversion of the column to nvarchar row by row bypassing the index, while LIKE alone may not force a conversion and allow the index to be used.

@jemiller0
Copy link
Author

The column is a NVARCHAR(40) NULL. I generated the database using EF Core and it's a string property. Hopefully, EF Core wouldn't include the N'' strings if it was a VARCHAR.

I have the same database in PostgreSQL and MySQL. MySQL (Pomelo provider) does something similar and generates the following SQL.

WHERE b.updated_byLIKE CONCAT('jemiller','%') AND (POSITION('jemiller' INb.updated_by) = 1)

The query takes around 1 to 2 seconds with MySQL. Commenting out the call to POSITION() appears to not have a noticeable effect.

PostgreSQL generates the following SQL.

WHERE "b"."updated_by" LIKE ('jemiller' || '%')

I'm assuming || is a concatenation operator in PostgreSQL. PostgreSQL doesn't address the issue.

One thing I've been wondering, why isn't EF Core using db parameters? I'm pretty sure other ORMs like NHibernate do. I'm surprised that it doesn't. Wouldn't not using parameters slow things down because the database server would have to keep recompiling the SQL on the backend?

In the case of SQL Server, I noticed that the query takes about 15 seconds the first time through. Then, it takes 1 second if the CHARINDEX() call is removed. Otherwise, it consistently takes 11 seconds with it.

@jemiller0
Copy link
Author

One other thing I noticed is that if I comment out the JOIN to the other table, then, the query takes around 0.7 second with the call to CHARINDEX() and ironically, it takes 1.7 seconds without it. I ran it multiple times and the results were the same.

So, in the simple case, you may not see the issue. It probably only happens when you are joining in other tables.

@popcatalin81
Copy link

@jemiller0 if removing the join gives a difference of 0.7 and 1.7 with CHARINDEX respectively, then this means the carnality estimator gets confused, it doesn't think the index on [updated_by] is more selective than presumably (if it exists) the index on [bib_id]. I would suggest to help the query planner to work better by Updating and/or Creating a new set o statics on the columns that need to be considered for index usage.

I would try the following, and if one step does not solve the problem moving to the next:

  1. Update statistics UPDATE STATISTICS
  2. Ensure Indexes are not fragmented (Rebuild or Defragment) Link
  3. Create new statistics on [updated_by] with FULLSCAN if possible, or as as big sample as possible
  4. Create new statistics on [bib_id] with FULLSCAN if possible, or as as big sample as possible

If neither of the above work, then the query is simply NOT friendly to the optimizer and the only way to fix this would be to rewrite the query using some alternatives, besides forcing an execution plan.

@jemiller0
Copy link
Author

Thanks. I will give that a try.

I still think that EF Core shouldn't add that check unless the input contains a special character though. I think it would result in less problems out of the box.

@divega
Copy link
Contributor

divega commented Jan 18, 2017

@jemiller0 @popcatalin81 I was also going to suggest trying UPDATE STATISTICs hoping that would help. If it doesn't help we can consider other options.

Sniffing into the argument for StartsWith() to check if it contains wildcards is a possibility, but it is potentially provider specific and would only work if the argument is coming from the client.

Also, this is a scenario in which having actual support for LIKE in LINQ (#6159) would help because LIKE by definition is free to treat any wildcards as such.

@maumar
Copy link
Contributor

maumar commented Jan 18, 2017

Another thing worth considering is caching. Currently we only depend on parameter nullability determine the resulting SQL (with a given query plan we will always generate the same SQL, regardless of the parameter value, unless the value is null). If we were to add the extra check only when needed, we would have 3 possible states: null, non-null-with-escaping, non-null-without-escaping. For all cases apart from scenarios around 'Like', states 2 and 3 would produce the same sql, so they are effectively unnecessary cache entries.

@rowanmiller
Copy link
Contributor

@jemiller0 did UPDATE STATISTICS make a difference?

@jemiller0
Copy link
Author

I tried UPDATE STATISTICS and it didn't make a difference. I haven't had a chance to test items 2, 3, and 4 listed above. I've been meaning to do that, but, got tied up with unrelated projects.

@rowanmiller
Copy link
Contributor

@maumar can you do a time boxed effort to try and reproduce this

@jemiller0
Copy link
Author

I'm going to try the rest of the UPDATE STATISTICS related commands shortly.

I tried testing the query using the ESCAPE clause just for the heck of it. I found that performance when doing it that way is by far the worst. I ran it twice and both times it took 1 minute 17 seconds. Seems like it must be causing it to do a table scan or something.

WHERE [b].[updated_by] LIKE N'jemiller' + N'%' ESCAPE '!'

@jemiller0
Copy link
Author

One other thing I was thinking of. Couldn't you just apply the StartsWith() filter in C# as the rows are being streamed into RAM (in addition to just doing a LIKE by itself)? I know it wouldn't be ideal. Just a thought.

@roji
Copy link
Member

roji commented Feb 6, 2017

@jemiller0, reading your comment above, can you please elaborate on "PostgreSQL doesn't address the issue" (I'm the Npgsql maintainer)? I'm asking because I'm currently working on fixing #474 for PostgreSQL as well.

@roji
Copy link
Member

roji commented Feb 6, 2017

Sorry I'm guessing you mean there's no double-check (i.e. LIKE and a CHARINDEX-equivalent)

@jemiller0
Copy link
Author

Yeah, I think it doesn't use CHARINDEX() if I remember correctly.

@roji
Copy link
Member

roji commented Feb 6, 2017

I'll be introducing that soon.

@jemiller0
Copy link
Author

Here's what I'm seeing. I attached screenshots for what the execution plan looks like for LIKE by itself, LIKE with ESCAPE, and LIKE with CHARINDEX() for SQL Server. LIKE by itself usually takes around 2 seconds, although it might be something like 10 seconds the first time through. LIKE with CHARINDEX() is usually around 11 seconds. LIKE with ESCAPE is the worst and usually takes around 1 minute 15 seconds or greater.

like
likewithcharindex
likewithescape

@ErikEJ
Copy link
Contributor

ErikEJ commented Feb 7, 2017

Remove the ORDER BY...

@jemiller0
Copy link
Author

An ORDER BY of some kind is needed since I'm using paging and it's using OFFSET FETCH. I did notice that if I change the column that is being ordered by to the same column that is being filtered on, that the query returns immediately for all 3 types of filtering. It makes sense that it would have trouble filtering on one column and sorting on another, but, somehow it works fine when you only use LIKE. When I ran the queries in the images above, I forgot and ran the queries that the current release version of EF Core generate. They have an extra column in the ORDER BY. That bug was fixed separately. Removing the extra column ([b].[bib_id]) doesn't appear to have much of an effect.

@jemiller0
Copy link
Author

@roji mentioned using LEFT(LEN()) (similar to what is done for .EndsWith()) instead of CHARINDEX() in #474. I tested this as in

WHERE [b].[updated_by] LIKE N'jemiller' + N'%' AND N'jemiller' = LEFT([b].[updated_by], LEN(N'jemiller'))

and found that it was faster than even using LIKE by itself. For my problem query, it was the fasted performance yet. Consistently less than 1 second with the LEFT() check while it was around 2 seconds with only LIKE. Also, I tested it without LIKE at all and that took around 2 seconds.

I'm not sure if it would need an extra OR where it checks for = '' (empty string). I noticed that when you do a .EndsWith() it does that at least sometimes.

Would it be possible to switch to using LEFT(LEN()) instead of CHARINDEX() for the SQL Server provider? This seems to cure the problem that I've been seeing.

@divega
Copy link
Contributor

divega commented Feb 8, 2017

@jemiller0 thanks for this new datapoint. Would you be able to provide the corresponding query plan screenshot?

@jemiller0
Copy link
Author

Sure.

leftlen
likewithleftlen

@popcatalin81
Copy link

In all 5 execution plans the primary index used is on the [date_updated] column, In none of the execution plans is the index on [updated_by] column used (if any). If the index [updated_by] doesn't exist or is not used, then any assumptions the performance characteristics about different ways to translate StartsWith() could be false positives.

StartsWith() type of search in SQL Server, LIKE 'term%' is by definition a type of search that can take Advantage of indexes , and it's normally the fastest type (can even use index Seeks). There needs to be a baseline for comparisons where the index is actually used and an optimal execution plan is used, and after that test for various other ways to write the query.

@jemiller0 does an index on the [updated_by] exist?

@jemiller0
Copy link
Author

@popcatalin81 Yes it has an index. Here is a screen shot of the table definitions.

bibexttable
bibtable

Here's some info on the index fragmentation in case it helps. They are all less than a percent.

bibextindexfragmentation
bibindexfragmentation

The tables have close to 8 million rows in them and the main table has an XML field in it.

@divega
Copy link
Contributor

divega commented Feb 8, 2017

FWIW, we have seen in our own testing that query plan selection is very sensitive to a variety of factors. You can even say it can be a bit hard to setup a scenario in which the index is consistently used.

In general, the performance characteristics of all these variations could be different for different kinds of queries, different SQL Server versions and different data sets, but this is still valid empirical data.

Also it is interesting that of the previous screenshots, these two showed the index being used:

  • LIKE AND CHARINDEX()
  • LIKE + ESCAPE

The query plan for the query that has LIKE alone does not appear to use the index.

BTW, screenshot for LIKE AND LEFT() shows a slightly different ORDER BY key. @jemiller0 was that intentional?

@jemiller0
Copy link
Author

@divega That was by mistake. I copied the queries from the log file I have from EF Core. I'm currently using the current release version. The nightly builds fix #6861. The current release version includes extra sort columns that it shouldn't. I think it needs them for one-to-manys, but, doesn't for many-to-ones. I re-ran the queries without the extra sort column after the fact and the results appeared to be the same. I can post new screen captures if you want them.

@divega
Copy link
Contributor

divega commented Feb 8, 2017

@jemiller0 thanks. No need for new screenshots. Just wanted to be sure the results were the same.

@jemiller0
Copy link
Author

Here they are. I didn't check to see if the execution plans were exactly the same. I just saw what the timings were.
like
likewithcharindex
likewithescape

@jemiller0
Copy link
Author

So, is there any disadvantage to using LEFT(LEN()) instead of CHARINDEX()? This is somewhat of a blocker for me. I want to switch from EF 6 to EF Core, but, I can't do it until I'm able to resolve performance issues such as this. Also, is there any word on when interception will be implemented? If interception were implemented, I could do a search and replace within the SQL text to work around the problem. Also, it would be great if fixes could be published a little quicker. It's been awhile since the last version was published and there are other blockers such as the problem where it was adding extra sort columns that I need. I'm assuming that one would be a problem for anyone working with large tables.

@maumar
Copy link
Contributor

maumar commented Feb 27, 2017

@jemiller0 we are in the process of adding support for explicit Like function which should hopefully address your problem, PR is being reviewed at the moment: #7611

@jemiller0
Copy link
Author

Thanks, but, from my perspective, I would rather not use Like(). I would rather have a more efficient StartsWith(). I'm using Telerik's RadGrid component which generates Dynamic LINQ which uses StartsWith(). I could modify the string that they are creating, but, would prefer not to. Is there something more efficient about using CHARINDEX() than using LEFT(LEN())? It seems like CHARINDEX() would be less efficient as it would have to scan the whole string. LEFT(LEN()) is the same approach that EndsWith() is using, except it's using LEFT() instead of RIGHT().

@roji
Copy link
Member

roji commented Feb 27, 2017

Agree with @jemiller0 that optimizing Startswith() is orthogonal to providing direct LIKE support via #7611...

@maumar
Copy link
Contributor

maumar commented Feb 27, 2017

We are also planning to switch our translation to LEFT(LEN()), it doesn't seem to have any disadvantages, as far as we know. We will consult some SQL experts just to be sure. The change will come in 2.0.0

@jemiller0
Copy link
Author

That sounds great. Thanks for everyone's work on this. Is there going to be a 1.2 release before 2.0?

@maumar
Copy link
Contributor

maumar commented Feb 27, 2017

We don't have the plans for it currently, but if we end up having 1.2 before 2.0, then this fix will very likely be a part of that release.

@jemiller0
Copy link
Author

That sounds great. Thanks again.

maumar added a commit that referenced this issue Apr 7, 2017
Fix is to use more permanent translation:

property.StartsWith(pattern) => property LIKE pattern + '%' AND LEFT(property, LEN(pattern)) = pattern

Also fixed minor issue with null semantics where we would apply C# null semantics for comparisons to null-valued parameters
maumar added a commit that referenced this issue Apr 8, 2017
Fix is to use more permanent translation:

property.StartsWith(pattern) => property LIKE pattern + '%' AND LEFT(property, LEN(pattern)) = pattern

Also fixed minor issue with null semantics where we would apply C# null semantics for comparisons to null-valued parameters
maumar added a commit that referenced this issue Apr 10, 2017
Fix is to use more permanent translation:

property.StartsWith(pattern) => property LIKE pattern + '%' AND LEFT(property, LEN(pattern)) = pattern

Also fixed minor issue with null semantics where we would apply C# null semantics for comparisons to null-valued parameters
@maumar
Copy link
Contributor

maumar commented Apr 10, 2017

Fixed in d8567c0

@maumar maumar closed this as completed Apr 10, 2017
@maumar maumar added type-enhancement closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. area-perf and removed type-investigation labels Apr 10, 2017
@ajcvickers ajcvickers changed the title StartsWith() filter slow with use of SQL CHARINDEX() Query: StartsWith() filter slow with use of SQL CHARINDEX() May 9, 2017
@divega divega changed the title Query: StartsWith() filter slow with use of SQL CHARINDEX() Query: Improved performance for String.StartsWith() filters May 10, 2017
@simeyla
Copy link

simeyla commented Jun 23, 2019

I still find it very odd that the LEFT thing is added when there's no special characters. It seems to completely prevent indexes from being used which utterly killed my performance.

It isn't obvious at all that LEFT doesn't work with indexing in SQL Server so until you have to end up researching why it's hard to guage the reason why things are so slow.

Fortunately EF.Functions.Like(...) will solve this and doesn't introduce LEFT into the SQL. Absolute MASSIVE performance increase and I was just doing email.StartsWith(searchQuery).

@smitpatel
Copy link
Contributor

@simeyla - Your issue has already been fixed. Tracked via #14657

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-perf closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. type-enhancement
Projects
None yet
Development

No branches or pull requests

10 participants