R or T-SQL? One Button Changes Everything

Description

Here’s a story: a team trained a model, everything worked fine—until the dataset doubled. Suddenly, their R pipeline crawled for hours. The root cause wasn’t the algorithm at all. It was compute context. They were running in local compute, dragging every row across the network into memory. One switch to SQL compute context pushed the R script to run directly on the server, kept the data in place, and turned the crawl into a sprint. That’s the rule of thumb: if your dataset is large, prefer SQL compute context to avoid moving rows over the network. Try it yourself—run the same R script locally and then in SQL compute. Compare wall-clock time and watch your network traffic. You’ll see the difference. And once you understand that setting, the next question becomes obvious: where’s the real drag hiding when the data starts to flow?The Invisible BottleneckWhat most people don’t notice at first is a hidden drag inside their workflow: the invisible bottleneck. It isn’t a bug in your model or a quirk in your code—it’s the way your compute context decides where the work happens. When you run in local compute context, R runs on your laptop. Every row from SQL Server has to travel across the network and squeeze through your machine’s memory. That transfer alone can strangle performance. Switch to SQL Server compute context, and the script executes inside the server itself, right next to the data. No shuffling rows across the wire, no bandwidth penalty—processing stays local to the engine built to handle it. A lot of people miss this because small test sets don’t show the pain. Ten thousand rows? Your laptop shrugs. Ten million rows? Now you’re lugging a library home page by page, wondering why the clock melted. The fix isn’t complex tuning or endless loop rewrites. It’s setting the compute context properly so the heavy lifting happens on the server that was designed for it. That doesn’t mean compute context is a magic cure-all. If your data sources live outside SQL Server, you’ll still need to plan ETL to bring them in first. SQL compute context only removes the transfer tax if the data is already inside SQL Server. Think of it this way: the server’s a fortress smithy; if you want the blacksmith to forge your weapon fast, you bring the ore to him rather than hauling each strike back and forth across town. This is why so many hours get wasted on what looks like “optimization.” Teams adjust algorithms, rework pipeline logic, and tweak parameters trying to speed things up. But if the rows themselves are making round trips over the network, no amount of clever code will win. You’re simply locked into bandwidth drag. Change the compute context, and the fight shifts in your favor before you even sharpen the code. Still, it’s worth remembering: not every crawl is caused by compute context. If performance stalls, check three things in order. First, confirm compute context—local versus SQL Server. Second, inspect your query shape—are you pulling the right columns and rows, or everything under the sun? Third, look at batch size, because how many rows you feed into R at a time can make or break throughput. That checklist saves you from wasting cycles on the wrong fix. Notice the theme: network trips are the real tax collector here. With local compute, you pay tolls on every row. With SQL compute, the toll booths vanish. And once you start running analysis where the data actually resides, your pipeline feels like it finally got unstuck from molasses. But even with the right compute context, another dial lurks in the pipeline—how the rows are chunked and handed off. Leave that setting on default, and you can still find yourself feeding a beast one mouse at a time. That’s where the next performance lever comes in.Batch Size: Potion of Speed or SlownessBatch size is the next lever, and it behaves like a potion: dose it right and you gain speed, misjudge it and you stagger. In SQL Server, the batch size is controlled by the `rowsPerRead` parameter. By default, `rowsPerRead` is set to 50,000. That’s a safe middle ground, but once you start working with millions of rows, it often starves the process—like feeding a dragon one mouse at a time and wondering why it still looks hungry. Adjusting `rowsPerRead` changes how many rows SQL Server hands over to R in each batch. Too few, and R wastes time waiting for its next delivery. Too many, and the server may choke, running out of memory or paging to disk. The trick is to find the point where the flow into R keeps it busy without overwhelming the system. A practical way to approach this is simple: test in steps. Start with the default 50,000, then increase to 500,000, and if the server has plenty of memory, try one million. Each time, watch runtime and keep an eye on RAM usage. If you see memory paging, you’ve pushed too far. Roll back to the previous setting and call that your sweet spot. The actual number will vary based on your workload, but this test plan keeps you on safe ground. The shape of your data matters just as much as the row count. Wide tables—those with hundreds of columns—or those that include heavy text or blob fields are more demanding. In those cases, even if the row count looks small, the payload per row is huge. Rule of thumb: if your table is wide or includes large object columns, lower `rowsPerRead` to prevent paging. Narrow, numeric-only tables can usually handle much larger values before hitting trouble. Once tuned, the effect can be dramatic. Raising the batch size from 50,000 to 500,000 rows can cut wait times significantly because R spends its time processing instead of constantly pausing for the next shipment. Push past a million rows and you might get even faster results on the right hardware. The runtime difference feels closer to a network upgrade than a code tweak—even though the script itself hasn’t changed at all. A common mistake is ignoring `rowsPerRead` entirely and assuming the default is “good enough.” That choice often leads to pipelines that crawl during joins, aggregations, or transformations. The problem isn’t the SQL engine or the R code—it’s the constant interruption from feeding R too slowly. On the flip side, maxing out `rowsPerRead` without testing can be just as costly, because one oversized batch can tip memory over the edge and stall the process completely. That balance is why experimentation matters. Think of it as tuning a character build: one point too heavy on offense and you drop your defenses, one point too light and you can’t win the fight. Same here—batch size is a knob that lets you choose between throughput and resource safety, and only trial runs tell you where your system maxes out. The takeaway is clear: don’t treat `rowsPerRead` as a background setting. Use it as an active tool in your tuning kit. Small increments, careful monitoring, and attention to your dataset’s structure will get you to the best setting faster than guesswork ever will. And while batch size can smooth how much work reaches R at once, it can’t make up for sloppy queries. If the SQL feeding the pipeline is inefficient, then even a well-tuned batch size will struggle. That’s why the next focus is on something even more decisive: how the query itself gets written and whether the engine can break it into parallel streams.The Query That Unlocks Parallel WorldsWriting SQL can feel like pulling levers in a control room. Use the wrong switch and everything crawls through one rusty conveyor. Use the right one and suddenly the machine splits work across multiple belts at once. Same table, same data, but the outcome is night and day. The real trick isn’t about raw compute—it’s whether your query hands the optimizer enough structure to break the task into parallel paths. SQL Server will parallelize happily—but only if the query plan gives it that chance. A naive “just point to the table” approach looks simple, but it often leaves the optimizer no option but a single-thread execution. That’s exactly what happens when you pass `table=` into `RxSqlServerData`. It pulls everything row by row, and parallelism rarely triggers. By contrast, defining `sqlQuery=` in `RxSqlServerData` with a well-shaped SELECT gives the database optimizer room to generate a parallel plan. One choice silently bottlenecks you; the other unlocks extra workers without touching your R code. You see the same theme with SELECT statements. “SELECT *” isn’t clever, it’s dead weight. Never SELECT *. Project only what you need, and toss the excess columns early. Columns that R can’t digest cleanly—like GUIDs, rowguids, or occasionally odd timestamp formats—should be dropped or cast in SQL itself, or wrapped in a view before you hand them to R. A lean query makes it easier for the optimizer to split tasks, and it keeps memory from being wasted on junk you’ll never use. Parallelism also extends beyond query shape into how you call R from SQL Server. There are two main dials here. If you’re running your own scripts through `sp_execute_external_script` and not using RevoScaleR functions, explicitly set `@parallel = 1`. That tells SQL it can attempt parallel processes on your behalf. But if you are using the RevoScaleR suite—the functions with the rx* prefix—then parallel work is managed automatically inside the SQL compute context, and you steer it with the `numTasks` parameter. Just remember: asking for 8 or 16 tasks doesn’t guarantee that many will spin up. SQL still honors the server’s MAXDOP and resource governance. You might request 16, but get 6 if that’s all the server is willing to hand out under current load. The lesson is simple: test both methods against your workload, and watch how the server responds. One smart diagnostic step is to check your query in Management Studio before ever running it with R. Execute it, right-click the plan, and look: do you see parallel streams, or is it a single-line serial path? A missing index, a sloppy SELECT, or too-broad a scan can quietly kill pa

Become a supporter of this podcast: https://www.spreaker.com/podcast/m365-fm-modern-work-security-and-productivity-with-microsoft-365--6704921/support.

If this clashes with how you’ve seen it play out, I’m always curious. I use LinkedIn for the back-and-forth.

Listen

Description

Want to check another podcast?