More Explain to inspect how yugabyte processes queries.
TL;DR: knowing that data is hashed in yugabyte prompts me to try some range-queries. And creating indexes. With more or less predictable results?
But there is at least 1 outcome that I can not explain yet.
Background:
Yugabyte uses a hash to shard the data over tablets. Hashes are Ideal for point-lookups, e.g. Equal-conditions like: Where id = 42. But hash-spread data is less ideal when you need a range, for example to select the last 24hrs: Where dt > now()-interval '1 day'.
Let's try a few, and see what happens.
Setup:
See earlier blog, for definition of tables t01 and t16: split into 1 tablet and split into 16 tablets. Again, we put 100 records into the tables.
And I would create an index on the dt, the date-time field, just for good measure, because I may want to look for "recent data", or Data-from-last-year.
I'm going to ask for :
The record with id=42 (point-lookup)
All records with ID between 40 and 60 (a range)
All records where dt (date) is in the last 30 days (e.g. show me recent data).
In trying this we will add some indexes and see how that works out.
Note: After 20+ yrs of observing "optimizers", I am very much aware of how optimisers can be much smarter than I think I am. I will humbly look at all the results and maybe ask yugabyte for explanation when I dont quite see what I expect..
The Queries, Finding 1 record is Easy:
Looking for 1 record, in both tables t01 ad t16
The explains for id=42 are more or less as expected: Index Scan on PK, and in both cases just 1 read-call or storage call. This was a very precise lookup.
Next Query: Finding a range of IDs:
Let me try and fetch a set of records. Because the records are hash-indexed on the PK, the id, I would expect a seq-scan or some iteration over the list of integers in the range.
This is what explain said that happened:
For t01, it reports a Seq Scan, rows=21, and a single Storage Table Read Request. More or less what is expected for a single-tablet table.
For t16, it also reports Seq Scan, rows=21, but only 2 Storage Table Read Requests? I would have expected those 21 ids to be hashed over more tablets... ? Would the small range of records be kept in just two tablets ? Checking with larger range...
Nope, even if I increase the range to 0-100, e.g. all records in the table, it reports 2 that it only needs 2 Storage Table Read Request.
Question: Can someone explain this ? I would expect the Seq Scan on t16 to read all 16 tablets because I know that the condition 0..100 is filtering for all records, and those records are (expected to be) hashed over all 16 tablets. But explain-analyze-dist only reports 2? What Clever Trick is used here ?
Looking for a Datetime-Range:
Next test is to look for records in a a date-range, say the last 30 days. The result is as expected:
For t01, we see Seq Scan, and 1 Storage Table Read Request. All the data can be scanned form that single tablet.
For t16, we see the same Seq Scan, and 16 storage calls. It has to check all tablets to inspect and select all the records that confirm to the filter. No magic here, rather "expected behaviour".
And because on a large table, we may not want a Seq Scan to find a date-range.
Should we consider an index...?
Date-range with index, Hash or Asc:
We create an index on t16, and we even make it a "covering index" for this query. Ideally the query can now pick up All its Data just from this index.
This index didnt help, we still see Seq Scan and 16 calls to storage.
And we know why:
Because in yugabyte, all data is Hashed over tablets. And indexes in this case are by default "hash-indexes" and therefore only suitable for equality-lookups. Hence, the SQL-planner chose the Seq Scan again to scan the whole table. Our index is useless for range-queries.
But we know how to fix that: Create an index and tell it to use ASCending..
And that fixed it: now we can find records from a date-range using the Index Only Scan as we had hoped. The covering-index contains all the data we are looking for. And in this case, that data was located in a single tablet from the index: 1 Storage Index Read Request was enough.
Lessons:
Firstly, we can confirm that the data is "hahshed over tabelts". Nearly all of the results are consistent with "hashed data" or hahs-index.
The Exception is the "Seq Scan with only Two Storage Calls". I have a question about that, below.
If range-searches (by ID o by Date) are your typical access-path, you should consider your indexes, and possibly your PK to be ASC, rather than Hashed. Using ASC indexes on PK will probably have the same effect as "ordered tables" in some other databases (Index-Organized in Oracle, or the clustered PK index in MS-SQL).
Indexes, even covering indexes, are not necesarily helpful when the filter-condition isnt "Equal". Proven by the two indexes on the data-range.
There is always more, and Questions ...
For further research, I might look at the range-sharding and its effets. I am also curious to see how non-covering indexes (e.g. in this case an index on just dt) would work out. Something to test in next run?
It may also be interesting to see how yugabyte handles (automatic) range-sharding on large sets. Something to investigate.
Q1. What exactly is a "storage call"? How do we know for sure it is 1 call, and not hiding some "parallel" work under water ? And what about 16 storage-calls: On a 7 node cluster, some of those calls could (would?) have been combined ?
Q2: Is there any caching or buffering that we can measure or report. e.g. if data is buffered in the cache of the QRY proces: nice to know. If data happened to be cached in the postgres-global-area on the TServer: nice to know.
Q3: Possibly interesting to know if an RPCs from SQL-engine to storage-system return many records or just a few records.
Q4 (my most puzzling questions at the moment): How can a Seq-Scan be executed with only 2 storage calls, in the case of id-between-x-and-y ?
Q5: When I have a covering (hash) index, the system still choose to seq-scan the table, not the index. Any reason for this? IMHO, both would have been equally good or bad when the index is hashed.
-------- End of Blogpost - possibly notes + more questions below -----
Q: Would larger volumes help to indicate where reads/writes take place ? Yb-admin and yugatools how some IO information, whould that help to spot the 2-storage-reads? Any other way to snoop on RPCs or subsystems?
Q: Observing with yugatool: are reads/writes cumulative? would I still catch IO when measures with, say , 60sec interval ? Or is yb-admin better (doing io/sec) ?
Comments
Post a Comment