TMK Clusterize on thousands of .tmk files #211

jlekas · 2019-10-17T16:44:44Z

I was attempting to use the tmk-clusterize tool on a set of around 50,000 videos and the program crashed before it was finished running. Is there a limit to the number of hash files that the clusterize tool can operate on?

johnkerl · 2019-10-19T00:26:33Z

We need to incorporate FAISS into the clusterizer.

That said, the program might have run out of RAM as it's a single-threaded all-in-one demo -- do you have more context on why it crashed, and/or what you saw on the terminal around the time of the crash?

jlekas · 2019-10-21T14:38:35Z

The program was left running for 1 - 2 days and at a certain point I switched to the open terminal and had stopped running although I unfortunately do not remember the output. Do you think it would it be worthwhile attempting to fork the project and writing a multi threaded or multi processed version of the file to help it run on larger data sets, or would it be better to wait for FAISS to be incorporated into the clusterizer?

johnkerl · 2019-10-22T00:43:51Z

@jlekas both. :)

I really apologize for the delay. My team is actively working on unblocking a couple key flows in ThreatExchange and I've not been able to dedicate time to unblocking TMK+FAISS ...

jlekas · 2019-10-22T18:06:07Z

Oh, no worries thank you so much for your timely responses and help in solving this problem. I will look into writing a multi threaded and multi processed version of the file to see if it works well for my larger data set.

github-actions · 2020-11-13T17:12:28Z

This issue is being marked as stale because it has no recent activity. It will be closed automatically in 14 days
unless it becomes active before then. To prevent closing, please comment on the issue before that time. If the
issue is no longer relevant, please feel free to close it prior to that time.

Cleaning up stale issues helps redirect focus to the issues top of mind of the community. Thank you for your help
with this.

github-actions · 2020-11-27T17:13:47Z

This issue has been closed due to no recent activity. If you need this issue reopened, please let us know.
Thanks!

Dcallies · 2022-04-12T19:18:16Z

@dxdc - How are things going after your changes - is it performant enough to consider this task closed out?

dxdc · 2022-04-12T20:29:30Z

@Dcallies I've tested up to 25k tmk files, works great. There may be some better ways to optimize the parallelization, but it would be for smaller scale improvements at this point I would imagine.

github-actions bot added the Stale label Nov 13, 2020

github-actions bot closed this as completed Nov 27, 2020

Dcallies reopened this Mar 23, 2022

Dcallies added enhancement help wanted do-not-reap performance tmk+pdqf and removed Stale labels Mar 23, 2022

dxdc mentioned this issue Mar 26, 2022

Parallel, multicore processing for TMK #959

Merged

Dcallies removed the do-not-reap label Apr 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TMK Clusterize on thousands of .tmk files #211

TMK Clusterize on thousands of .tmk files #211

jlekas commented Oct 17, 2019

johnkerl commented Oct 19, 2019

jlekas commented Oct 21, 2019

johnkerl commented Oct 22, 2019 •

edited

jlekas commented Oct 22, 2019

github-actions bot commented Nov 13, 2020

github-actions bot commented Nov 27, 2020

Dcallies commented Apr 12, 2022

dxdc commented Apr 12, 2022

TMK Clusterize on thousands of .tmk files #211

TMK Clusterize on thousands of .tmk files #211

Comments

jlekas commented Oct 17, 2019

johnkerl commented Oct 19, 2019

jlekas commented Oct 21, 2019

johnkerl commented Oct 22, 2019 • edited

jlekas commented Oct 22, 2019

github-actions bot commented Nov 13, 2020

github-actions bot commented Nov 27, 2020

Dcallies commented Apr 12, 2022

dxdc commented Apr 12, 2022

johnkerl commented Oct 22, 2019 •

edited